ClickHouse® Disaster Recovery: Tips and Tricks to Avoid Trouble in Paradise

Recorded: March 25 @ 08:00 am PT
Presenters: Robert Hodges and Alexander Zaitsev
In this webinar, Altinity CEO Robert Hodges and CTO Alexander Zaitsev walk through the full spectrum of disaster recovery options for ClickHouse®, drawing on six-plus years of operating Altinity.Cloud and working with hundreds of production customers.
Robert opens by defining disaster recovery, framing the problem around blast radius, Recovery Point Objective (RPO), Recovery Time Objective (RTO), cost, and complexity. He then introduces the four main DR strategies, each represented in a comparison table: backups, single-region multi-availability-zone (multi-AZ) replication, cross-region replication, and independent clusters. He covers what needs to be protected in a ClickHouse installation, walks through Altinity Backup for ClickHouse® (clickhouse-backup) and the built-in SQL BACKUP/RESTORE commands, explains the new Backup Database Engine introduced in ClickHouse 25.2, discusses ZooKeeper backup strategy, and the SYSTEM RESTORE REPLICA command, and presents a Kafka-based technique for minimizing the backup data-loss window.
Alexander takes over for multi-AZ replication, explaining how anti-affinity rules and the Altinity Kubernetes Operator for ClickHouse® make cross-AZ placement almost automatic. He then dives deep into cross-region replication: VPC peering, CoreDNS cross-cluster forwarding, why headless Kubernetes services are required, the cross-region cluster definition for distributed DDL, and the two failover procedures (new Keeper promotion versus SYSTEM RESTORE REPLICA). He closes with the independent cluster approach using Kafka for dual ingest, and a sophisticated two-subcluster design that eliminates the need for immediate failover by keeping both regions writable at all times.
The Q&A addresses topics including cross-AZ data transfer costs, Keeper traffic optimization, the UNDROP TABLE command, insert quorum, ZooKeeper placement best practices, and plans for multi-cluster Kubernetes operator support.
Here are the slides:
Key Moments (Timestamps)
Key moments generated with AI assistance.
- 00:08 – Welcome and housekeeping
- 01:48 – Speaker introductions: Robert Hodges and Alexander Zaitsev
- 03:08 – What is disaster recovery? Definition and blast radius hierarchy
- 06:09 – The four DR strategies: comparison table of tradeoffs
- 07:56 – What to protect in ClickHouse: config files, schema, data, RBAC metadata
- 09:00 – Backup option 1: Altinity Backup for ClickHouse (clickhouse-backup)
- 12:07 – Backup option 2: Built-in SQL BACKUP/RESTORE commands
- 13:09 – New in 25.2: the Backup Database Engine
- 13:52 – Comparison: clickhouse-backup vs. built-in SQL backup
- 15:09 – ZooKeeper backup strategy and SYSTEM RESTORE REPLICA
- 16:40 – Reducing the backup data-loss window with Kafka topic offsets
- 18:30 – Multi-AZ replication with the Altinity Kubernetes Operator
- 21:15 – YAML example: anti-affinity and zone placement in the operator
- 23:00 – Verifying AZ placement with kubectl
- 24:59 – Cross-region replication: network setup and VPC peering
- 28:34 – CoreDNS cross-cluster forwarding for service resolution
- 31:34 – Why headless Kubernetes services are required
- 33:29 – Cross-region cluster definition for distributed DDL
- 36:13 – Failover procedures: Keeper promotion vs. SYSTEM RESTORE REPLICA
- 44:55 – Independent clusters: dual ingest via Kafka
- 47:07 – Advanced: two-subcluster design for always-writable cross-region DR
- 49:14 – Summary and wrap-up
- 53:21 – Q&A: multi-cluster operator, cross-AZ costs, Keeper placement, UNDROP TABLE, insert quorum
Webinar Transcript
[00:08] – Welcome and Housekeeping
Robert: Hi everybody and welcome to our webinar on ClickHouse® Disaster Recovery: Tips and Tricks to Avoid Trouble in Paradise. My name is Robert Hodges. I am CEO of Altinity. I have also been working on ClickHouse for about six and a half years and on databases for 40. With me is Alexander Zaitsev, our CTO. We will do deeper intros in just a second.
A couple of housekeeping items. This webinar is recorded. We will send you a link to that recording as well as a slide deck as soon as this is done, within a few hours. We are here to take questions. Please feel free to put them in the Q&A box or in the chat. We will see them there. Feel free to post them at any time. If something is relevant, we may jump into it; otherwise we will hold them for the end.
One final point: a number of you sent me questions through LinkedIn when I reached out. Thank you so much for that. We will try to cover all those questions here as well.
[01:48] – Speaker Introductions
Robert: I am a database geek working on open source for coming up on two decades. Alexander, would you like to give a quick intro?
Alexander: Hey everyone. I am Alexander, co-founder of Altinity. I have been working on databases for many years, including some interesting disaster recovery scenarios.
Robert: Altinity, as you have probably heard, has been working on ClickHouse since 2017. We are the authors of Altinity.Cloud, the first cloud service for ClickHouse, running on Amazon, GCP, Hetzner, and Azure. We are the authors of the Altinity Kubernetes Operator for ClickHouse®, which many of you on this call are already using. We maintain the Altinity Backup for ClickHouse® project. We are big on ClickHouse. We love it, and we will be sharing the knowledge we have both from running the cloud and from working with hundreds of customers over the years.
[03:08] – What Is Disaster Recovery?
Robert: I found a great definition from the Google Cloud documentation: disaster recovery is your ability to restore access and functionality of services when something really bad happens. And “really bad” can span a variety of causes. The point is: some event takes out a key part of your infrastructure and you need to get it back up and running as quickly as possible.
Disasters come in all kinds of different sizes. We sometimes call this the blast radius, meaning how far does this disaster reach, and more importantly, how far away do you have to have copies of data when something bad happens. You can think of these in a rough hierarchy:
The simplest and most local is a DBA accidentally dropping a table. People do not always think of that as a disaster, but you would be surprised how often it happens. It can crash your service and you need to get it back up quickly. The hierarchy extends to node failures: machines and the racks they live in, storage failures affecting one or more nodes. It can go all the way up to an availability zone, which is a big data center that might be hit by a tornado or hurricane. And then there are full region outages, when a cloud provider has a management plane problem and key services you depend on are simply down.
Disaster recovery has to take care of all of these. There are a lot of questions to think about: How much data can I lose? This is what we call the Recovery Point Objective (RPO). How long will it take to restart the service? That is the Recovery Time Objective (RTO). How much is it going to cost? And how much complexity will it add to my installation?
[06:09] – The Four DR Strategies
Robert: In this talk we are going to consider four different ways to handle disaster recovery with ClickHouse: backups, single-region multi-AZ replication, cross-region replication, and independent clusters that have no direct replication between servers. A handy comparison table summarizes the benefits and drawbacks of each.
Backups are the cheapest and simplest to implement for protecting your data, but they have downsides. When you have to go to backups and it is a big table, it will take a while to restore. And since backups only run at intervals, you may have a large data-loss window unless you have things very well organized. We will talk about one way to handle that.
Single-region multi-AZ replication is so easy to set up that every installation should have it configured. Cross-region replication is more complicated, but it covers you from large-scale disasters. And independent clusters give you creative ways to handle DR at the cost of potential data drift between clusters.
[07:56] – What to Protect in ClickHouse
Robert: This is a useful question to ask because sometimes people only think about the data in tables. But there are other things to restore. You want your configuration files back. You want to restore the schema for your tables so that when the data comes back, it has a place to live. The data itself is obviously important. And then RBAC metadata: things like system accounts. These need to be covered as well, including special cases where you can store them in ZooKeeper or Keeper. By the way, that is a really great feature. If you are not using it, you should look into it.
[09:00] – Backup Option 1: Altinity Backup for ClickHouse
Robert: There are a number of backup options. One that is not widely used but some people have used to make copies of data is ClickHouse Copier. It is not super well-supported anymore and it does not handle things like configuration files or user accounts.
The main tool is Altinity Backup for ClickHouse®, sometimes known as clickhouse-backup because that is the name of the project in our GitHub org. It is a standalone backup utility for ClickHouse that has been around for years, maintained through many versions. One of the cool things about it is that it backs up everything.
It is a command-line tool written in Go. You can just download it and get started. Here is how you create a backup: you issue a create backup command, which makes a local copy of your data constructed using hard links. Hard links take no extra storage space but give you a stable copy of the files that you can then upload to remote storage, typically object storage. These commands can be combined into a single command that does the whole thing at once. A shadow copy remains until everything is uploaded, at which point you can wipe it out.
To restore, you reverse the process: download the files from remote storage, which creates a shadow copy, and then the restore command moves that shadow copy back into the ClickHouse server. This can be used to restore not just full databases but individual tables, partitions, and more.
There are many restore options. You can restore just the schema, which is useful if you are recreating a replica or making a schema copy in another location. You can restore single tables or partitions within tables. The project repo on GitHub has a nice documentation page covering all of this, and clickhouse-backup has good built-in help.
[12:07] – Backup Option 2: Built-In SQL BACKUP/RESTORE Commands
Robert: The other backup option is what we call the embedded backup: SQL commands built into ClickHouse itself. Very simple. You say something like BACKUP DATABASE <name> TO DISK '<external_storage>', and then RESTORE DATABASE <name> FROM DISK '<external_storage>'. Like clickhouse-backup, this also supports restoring individual tables, so the functionality is broadly similar except that it is embedded in ClickHouse itself.
One interesting tip not in the slides: ClickHouse itself can work as the command source. There is a special table you can set up, and if you insert a row into it, clickhouse-backup will see it and treat it as a command. So in both cases you can have backups triggered from within ClickHouse itself.
[13:09] – New in 25.2: The Backup Database Engine
Robert: There is a really cool feature that was introduced in ClickHouse 25.2. It is called the Backup Database Engine. What it allows you to do is take a backup, mount it as what appears to be a database, and then you actually have the tables in that database. You can run queries directly off those tables. They could be stored in object storage. So you can do queries on historical backups and, of course, if you need to restore, you simply SELECT the data out of this database into wherever you want it restored. It is a cool feature, and one that we need to look at supporting in clickhouse-backup as well.
[13:52] – Comparing the Two Backup Tools
Robert: In general, Altinity Backup for ClickHouse® is more versatile. It works across many ClickHouse versions, whereas the built-in commands depend on which version you are using. It also backs up RBAC metadata, including when it is stored in ZooKeeper or Keeper, and has built-in automation features. If you want something that will work across different server types and versions, clickhouse-backup is a good choice. We have used it in Altinity.Cloud for years. It is our backup mechanism and it gets pounded on every day.
The SQL backup has the advantage of being baked into ClickHouse and having the cool Backup Database Engine feature. Both are interesting and useful.
[15:09] – ZooKeeper Backup Strategy and SYSTEM RESTORE REPLICA
Robert: Backing up ZooKeeper is an important consideration. ZooKeeper does have a snapshot backup mechanism, but the key insight is that you generally do not need to back it up separately. What you should do instead is back up the ClickHouse data itself, and then use the SYSTEM RESTORE REPLICA command to restore the ZooKeeper metadata after recovering the data.
SYSTEM RESTORE REPLICA is a command every ClickHouse operator should know. What it does is take a replica that has gotten disconnected from ZooKeeper, or whose ZooKeeper metadata has been lost, and restore the ZooKeeper metadata by simply looking at the contents of the files on disk for that server. It just puts the metadata back into ZooKeeper. This is a really great command. It means that even for a large ClickHouse cluster, you can always recover ZooKeeper metadata by going to the individual servers and reloading from disk.
If you want to back up RBAC information stored in ZooKeeper specifically, clickhouse-backup has a --rbac option that will both copy it during backup and restore it when restoring.
[16:40] – Reducing the Backup Data-Loss Window with Kafka Topic Offsets
Robert: Backups have a data-loss window. Incremental backups partially address that, but even incremental backups are expensive to run. They only run at intervals, and ClickHouse does not really have a built-in transaction log the way traditional databases do. However, if you are using Kafka or Redpanda, you already have a log built in.
Some companies handle this in their backup scripts. The process is: stop ingest temporarily, capture the Kafka topic offsets into a ClickHouse table or a clickhouse-backup table, wait for replicas to sync so you have a consistent copy, and then run the backup. As soon as the stable snapshot is created, you can re-enable ingest and start again.
When you restore, you restore the tables, reset the Kafka topic offsets from the backup, and re-enable ingest. Your consumers, which are reading from Kafka and dumping data into ClickHouse, will automatically re-read all the data they missed during the backup window. This is a really great option. How you implement it will differ depending on whether you write your own consumers or use the Kafka table engine, but it is definitely worth exploring.
[18:30] – Multi-AZ Replication with the Altinity Kubernetes Operator
Robert: Let us look at the second option: cross-availability-zone replication. As we saw from the comparison table, this handles failures within a single data center region. If you have three availability zones, that corresponds roughly to three data centers that may be 30 kilometers apart. If there is a tornado, they do not all get hit at once.
The basic approach is to spread your ClickHouse clusters across multiple availability zones. The networking is relatively simple, and it kind of works out of the box because the network latencies are low. You can run your Keeper servers in different availability zones. They can communicate with each other, establish consensus, and you can ingest at high rates of speed. If one data center goes down, you still have two left.
This is driven by two Kubernetes features we build on: Affinity and anti-affinity. Anti-affinity causes ClickHouse servers not to land on the same machine. Affinity can drive them to land in particular availability zones. Using these two properties, we can set up replication and replicas that span availability zones.
If you are operating in the cloud, I highly recommend running this on Kubernetes and using the Altinity Kubernetes Operator for ClickHouse®. This is not just because we wrote it; Kubernetes is simply very good at placing things in different locations. The operator gives you the ability to define your cluster with a relatively small piece of YAML that the Altinity Operator will read and then issue the commands to place your servers in different locations across availability zones.
[21:15] – YAML Example: Anti-Affinity and Zone Placement
Robert: Here is an example ClickHouseInstallation resource. It is a simple YAML file. In the templates section, there is a pod template definition for one replica in US West 2A and another for US West 2B. On the next page, you see those filled out. In each pod template definition there are clauses that force replicas to be in particular zones, and there is also anti-affinity enabled by default in the operator to prevent two ClickHouse servers from landing on the same host.
When you set things up like this, it is really important to verify that it actually worked. Here are quick commands to check what hosts your pods are running on and confirm they are in different zones. These can obviously be scripted together to give yourself a little report, so you can make sure your servers are located exactly where you expect.
[23:00] – Cross-AZ Data Transfer Costs
Robert: There was a great question from Yanik about cross-AZ data transfer costs. Yes, in general, the amount of data transfer for replication is roughly proportional to the amount of data ingested. But there is a key thing to watch out for: if you accidentally configure things so that ZooKeeper or Keeper lookups are going across availability zones, we have seen cases where that causes a huge amount of unexpected traffic, and you are paying two cents a gigabyte for that.
Alexander: To add to that: the cross-AZ traffic for replication is actually even smaller than the ingest traffic, because the data that gets replicated consists of already well-compressed parts. Merges and other operations execute independently on every replica, which eliminates extra replication traffic. The Keeper traffic can be a problem, though. If ClickHouse is talking to a Keeper node in a different AZ, that can generate surprisingly large traffic sometimes.
[24:59] – Cross-Region Replication: Network Setup and VPC Peering
Alexander: Cross-region replication serves two purposes. One is DR: having an availability site in another region or continent. The other, very popular reason is proximity to your customers. If you have a site in Europe and one in the US, US customers can go to your US database and European customers can go to your European database. So in addition to high availability and DR, it also gives you lower query latency for applications.
To set up cross-region replication, you need to configure the network first. ClickHouse needs to replicate data on a special replication port. Distributed DDL queries require another port. All nodes should be able to see nodes in the other region, and all nodes should connect to the same ZooKeeper ensemble.
We generally do not recommend having ZooKeeper in two regions simultaneously, and I will explain why. In the architecture we recommend, Region A is your primary region with two ClickHouse replicas and a Keeper ensemble. In Region B, you can deploy what is called an Observer Keeper, which is not active until a DR event, at which point you can do a switchover.
When working in Kubernetes, you have two clusters located in different VPCs with their own subnets. The easiest approach in AWS is VPC peering, which allows each VPC to see the address space of the other. But that is not all you need: you also need DNS resolution between the clusters, because in Kubernetes we always use Kubernetes services to refer to ClickHouse and Keeper nodes. If you have two clusters, they need to see each other’s services.
[28:34] – CoreDNS Cross-Cluster Forwarding
Alexander: To make cross-cluster service resolution work, you modify the CoreDNS configuration in both regions to forward DNS requests to the DNS server of the other Kubernetes cluster. That requires a little cloud configuration and patching the CoreDNS configmap in each cluster.
Once configured, the behavior is: when a ClickHouse pod in Region B tries to resolve a ZooKeeper service name in Region A, the local CoreDNS fails to find it locally and forwards the request to the other cluster’s DNS server, which responds with the correct IP address.
One important requirement: if there are name collisions between the two clusters, the failover rule will not fire correctly. You need to be very careful to use different service names in each cluster. In particular, ClickHouseInstallation names must be different even though they represent the same replicated cluster.
[31:34] – Why Headless Kubernetes Services Are Required
Alexander: Kubernetes services come in two types: regular services and headless services. A regular service has a stable virtual IP address that does not change even if the underlying pod changes, which is very reliable. However, that IP address belongs to the subnet of its own Kubernetes cluster and cannot be used with VPC peering.
Headless services do not have a static IP address. Instead, they route directly to the underlying pod. Headless services work correctly with VPC peering. So when you configure cross-region replication, all ClickHouse-related services must be converted to headless services. The standard Keeper deployment already uses headless services. If you use Keeper with the Altinity Kubernetes Operator for ClickHouse®, you can create a service template with a headless service and do the same for ClickHouse itself.
The Altinity Operator has a unique feature called ClickHouseInstallation Templates that allow you to inject extra behavior into your existing running cluster. This particular template converts the services for the first replica to headless ones by setting clusterIP: None, overwriting whatever was there before. This is useful when you cannot or do not want to modify your ClickHouseInstallation directly.
[33:29] – Cross-Region Cluster Definition for Distributed DDL
Alexander: Once you have VPC peering, modified CoreDNS, and headless services, replication will start working. The remaining piece is creating a special cross-region cluster definition that combines nodes from both ClickHouseInstallations. This allows you to run distributed DDL such as CREATE TABLE ON CLUSTER, ALTER TABLE, or DROP TABLE, and have those commands automatically propagated to both regions.
The cluster definition is straightforward: it just combines the hostnames from both clusters into a single cluster definition. It is not easy overall, and it requires a lot of manual heavy lifting, but it can be done with open-source tools. On the Altinity.Cloud side, we plan to add automation in the next few months that will make this much easier to set up. But you can do it with open source today.
[36:13] – Failover Procedures: Keeper Promotion and SYSTEM RESTORE REPLICA
Alexander: When a DR event actually occurs, this is really complicated. Your first reaction should be to check whether the situation will resolve on its own. Maybe it is a temporary network problem and you will spend more time failing over than just waiting for the network to come back. But if your primary region is definitively down for the foreseeable future, you have to fail over. Both options here require automation.
Option 1: SYSTEM RESTORE REPLICA approach. Switch your users to your standby site. Since the Keeper is not running there, your primary site is down, inserts will not work. Your dashboards can still be populated from existing data, but no new data arrives. To fix this: stop ingest from the failed primary, start a new Keeper ensemble on the standby site, reconfigure ClickHouse to point to this new Keeper instead of the one in the primary region, and then run SYSTEM RESTORE REPLICA on every replicated table. That command restores the ZooKeeper metadata state by reading from disk, making the tables writable. Then you restart ingest.
The downside is that for some period of time your tables will be read-only, which requires careful scripting. And you have to iterate through every replicated table because SYSTEM RESTORE REPLICA works on a per-table basis.
Option 2: Keeper Observer promotion. If you deployed Keeper Observers in your DR site, they already have the metadata. Promote those observers to become the primary Keeper ensemble by disconnecting them from the failed primary and reconfiguring everything. After your Keeper ensemble starts acting as primary, inserts can be resumed.
Downsides: it is not automated, you still need scripting, and for some period of time the cluster will be in read-only mode. Also, promoting Keeper observers is not very straightforward compared to ZooKeeper, where this pattern is more mature. So this particular scenario is easier with ZooKeeper in some cases.
Robert: There are definitely companies who have done this successfully. It does require scripting, but it is definitely doable. SYSTEM RESTORE REPLICA is your friend because if all else fails, you can always recover the ZooKeeper metadata. I also want to mention that there are some bugs in the Keeper reconfiguration commands that make failover harder than it should be. Those are things that will eventually get fixed and make this much easier.
[44:55] – Independent Clusters: Dual Ingest via Kafka
Alexander: The last DR strategy is independent clusters: just set up independent clusters in multiple regions and multiply your ingest. This is probably the easiest approach overall. You ingest independently into every site. I personally implemented this pattern for the first time about 15 to 20 years ago with other databases. It is a universal approach, not ClickHouse-specific.
With Kafka, or any message broker that lets you consume the same data into multiple destinations, you can have your ClickHouse clusters, which are completely independent and possibly in different availability zones, different regions, or even different cloud providers, consume from different consumer groups but receive exactly the same data. Each cluster processes the same data the same way, so you can assume that in theory they are identical. In practice there could be some discrepancies from schema differences or errors on one side that are not tracked on the other, but in theory both systems will be identical. You just make sure you have the same version of software and the same schema.
[47:07] – Advanced: Two-Subcluster Design for Always-Writable DR
Alexander: There is an interesting advanced design that some ClickHouse users have implemented: a cross-region system that stays tolerant to writes even during a regional failure.
The previous failover scenarios resulted in a read-only period. Here is how to work around that. You deploy two subclusters. For the first subcluster, Region A is the primary. For the second subcluster, Region B is the primary. Both subclusters replicate to each other, and each subcluster has replicas in both Region A and Region B. Each subcluster has its Keeper ensemble in its own primary region.
To have a full view of the data, you define a distributed table that collects nodes from both subclusters. In every region you have a consistent, complete view of all the data.
If Region A goes down, you still have the full data in Region B and you can still do writes. You can only write to one subcluster, but you can still do writes. Similarly, if Region B goes down, you can continue writing in Region A and you still have the full data there. You do not have an outage. The advantage of this design is that you do not need to do an immediate failover. You have time to think and decide, and if Region A eventually recovers, you do not need to do anything at all. Meanwhile, the system continues to operate.
[49:14] – Summary and Wrap-Up
Robert: Let us do a quick summary.
Backups are the cheapest approach and you always want to have them in some form, because backups handle the drop-table scenario. If you do a DROP TABLE ... ON CLUSTER, replication will propagate that deletion and you are not getting the data back unless you have a backup somewhere.
Cross-AZ replication is something every cluster running in a cloud region should have configured. The Altinity Kubernetes Operator makes it very easy to set up. There are a few minor details, like making sure you are not generating unexpected cross-AZ traffic from Keeper, but it is so straightforward that there is no excuse not to do it.
Cross-region DR gives you coverage in case an entire region goes down. The hardest part is the custom networking setup, followed by the failover procedure. It also works well, and Alexander’s blog post on setting up cross-region ClickHouse replication in Kubernetes has all the details.
Independent clusters and the two-subcluster design reduce RPO and RTO and simplify management. The two-subcluster design is an elegant solution that should get you through most disasters, big and small, without an outage or a read-only period.
Altinity.Cloud is adding full support for cross-region DR. If you are looking for an easier path, check us out. Most of the things described here will be available by pressing a button.
For further reading, check the Altinity blog, the Altinity YouTube channel, and the ClickHouse docs for the built-in backup commands. Alexander’s cross-region replication blog article is where all the networking examples in this talk come from. We also did a recent webinar entirely on clickhouse-backup that covers incremental backups, all the ins and outs, and is available on our YouTube channel.
[53:21] – Q&A
Q: Are there plans to support multi-cluster deployment for the Altinity Kubernetes Operator for DR purposes?
Alexander: The operator operates within a single Kubernetes cluster, so multi-cluster deployments require something sitting outside the operator itself. The best example of this being done is by eBay, who implemented a “Federated ClickHouseInstallation” resource and a special operator for it that maintained ClickHouseInstallation resources across two Kubernetes clusters. They did not open source it, but it is definitely possible. We did not build it yet because the multi-region DR scenario is rare enough that it has not risen to the top of the priority list. But we are totally open to collaborating on it.
Robert: The Altinity Kubernetes Operator is completely open source and we take pull requests. If you would like to help build something like this or work collaboratively, come talk to us.
Q: How much cross-AZ data transfer does replication generate?
Alexander: It is typically less than the ingest traffic, because the data that gets replicated consists of already well-compressed parts. If you get 90% compression, only about 10% of the raw ingest volume crosses AZ boundaries for replication.
Q: Would Ingress gateways work instead of CoreDNS for cross-cluster service resolution?
Robert: Maybe, but we try to find solutions that work with any cloud provider. CoreDNS is available everywhere. Ingress gateways are the place where cloud providers differ the most in their distributions. Something built for Amazon might be completely different or extremely costly in another cloud. Using native Kubernetes resources like CoreDNS keeps things portable.
Q: Is it better to keep ZooKeeper in the same cluster as ClickHouse or on separate VMs?
Robert: Keep ZooKeeper or Keeper on separate VMs or node pools with no other workloads. And strongly do not share Keepers or ZooKeepers across ClickHouse clusters. If one of your clusters goes crazy and puts a lot of load on the Keeper, it can cause tables in other clusters to go read-only, which stops ingest across the board. Put Keepers in their own corner, feed them well, and you will be happy. They do not use that much compute, so the cost is not significant.
Q: What about UNDROP TABLE if a drop was executed without sync?
Alexander: UNDROP TABLE exists but depends on the timing. When you drop a table, ClickHouse tries not to physically delete the inactive parts for about 8 minutes by default. If you realize immediately that you dropped the wrong table, you may be able to use UNDROP, but you do not have a lot of time. I would not rely on it as a DR strategy.
Robert: ClickHouse also has some built-in guard rails: there are system settings that define the maximum table size you can drop without adjusting a configuration setting. So to drop a big table you really have to go out of your way.
Q: What about insert quorum?
Robert: Insert quorum ensures that copies of data get to multiple servers before an insert is acknowledged. It helps protect against a case where data is trapped on a single server that you then lose. However, it does not help when ZooKeeper or Keeper metadata gets corrupted. For that, SYSTEM RESTORE REPLICA is the right tool. ClickHouse is quite robust: as long as you do not lose the servers themselves, you can usually recover. It is in many ways better than other databases I have worked with.
FAQ Section
Q: What are the four main disaster recovery strategies for ClickHouse and when should I use each?
A: The four strategies are backups, multi-AZ replication, cross-region replication, and independent clusters. Backups are the cheapest and simplest option and should always be in place, primarily as protection against accidental table drops that replication cannot protect against. Multi-AZ replication is something every production cluster should have: it is easy to set up with the Altinity Kubernetes Operator and protects against node and availability zone failures within a single region. Cross-region replication protects against entire region outages and also improves read latency for geographically distributed users, but requires custom networking configuration and planned failover procedures. Independent clusters with dual ingest are the most operationally simple cross-region approach and, in the two-subcluster variant, allow both regions to remain writable even during a regional failure.
Q: What does Altinity Backup for ClickHouse back up and how does it work?
A: Altinity Backup for ClickHouse® (clickhouse-backup) backs up ClickHouse configuration files, table schemas, table data, and RBAC metadata, including when that metadata is stored in ZooKeeper or Keeper. It works by using filesystem hard links to create an instantaneous shadow copy of the ClickHouse data directory, which requires no extra disk space and causes no downtime. The shadow copy can then be uploaded to object storage in the background. Restoring reverses the process: the files are downloaded to a shadow directory and then moved back into the ClickHouse server. The tool supports full databases, individual tables, individual partitions, schema-only restores, and incremental backups.
Q: Do I need to back up ZooKeeper or Keeper separately?
A: Generally no. ZooKeeper and Keeper store the state of the distributed system rather than actual data: which replicas have which parts, what merges are pending, etc. That state is always changing and cannot be captured consistently from a running system. The correct approach is to back up only the ClickHouse data itself, and then use the SYSTEM RESTORE REPLICA command after recovering the data. This command reads the on-disk state of each ClickHouse server and reconstructs the ZooKeeper or Keeper metadata from it, making all replicated tables writable again.
Q: What is the SYSTEM RESTORE REPLICA command and when should I use it?
A: SYSTEM RESTORE REPLICA is a ClickHouse command that rebuilds the ZooKeeper or Keeper metadata for a replicated table by reading the data files on disk for the server where it is executed. You use it in two scenarios: after restoring a backup, to rebuild the ZooKeeper state that matches the restored data; and after a DR failover where the original Keeper is unavailable and you have started a fresh Keeper ensemble on the standby site. The command must be run once per replicated table. On large clusters this can be scripted. It is one of the most important commands for ClickHouse operators to know.
Q: How do you set up cross-region replication on Kubernetes?
A: The main steps are: configure VPC peering so that the two Kubernetes clusters can route traffic to each other; modify CoreDNS in both clusters to forward unresolvable service names to the other cluster’s DNS server; convert all ClickHouse and Keeper services to headless services (which route to the pod directly rather than through a virtual IP that is confined to a single VPC’s subnet); deploy a cross-region cluster definition that combines hostnames from both regions for distributed DDL; and finally define the ZooKeeper section in ClickHouse configuration to point all nodes to the same Keeper ensemble. The Altinity Kubernetes Operator’s ClickHouseInstallation Template feature can inject headless service configuration into running clusters without modifying the original installation directly. Alexander Zaitsev’s blog post on setting up cross-region ClickHouse replication in Kubernetes covers all the networking details.
Q: What is the two-subcluster design and why is it useful?
A: The two-subcluster design eliminates the need for an immediate failover and keeps both regions writable at all times. You create two independent ClickHouse subclusters: the first has its primary Keeper in Region A, the second has its primary Keeper in Region B. Both subclusters have replicas in both regions and replicate data to each other. A distributed table collects nodes from both subclusters to give a complete unified view of all data. If Region A fails, Region B still has all the data and can continue accepting writes to the second subcluster. If Region B fails, Region A continues writing to the first subcluster with full data available. You do not need to trigger a failover immediately, which gives you time to assess the situation and potentially recover the failed region with no data loss.
© 2026 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.
ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.