A New Way to Restore ClickHouse After ZooKeeper Metadata Is Lost
ClickHouse uses Zookeeper for the ReplicatedMergeTree
table engine family. Sometimes, Zookeeper may be lost due to user error or operational failure. Every ClickHouse user has experienced this at least once, and that was not the best day of their life. If a failure resulted in metadata loss, ClickHouse tables turn to pumpkins–users can query but can not insert data anymore. In order to recover ClickHouse tables, one would need to run multiple SQL and file system-level manipulations.
As of ClickHouse version 21.7, this has changed. There is a new SQL command: SYSTEM RESTORE REPLICA.
SYSTEM RESTORE REPLICA
is a powerful tool. It’s used to restore table state in various situations, for example:
- Total Zookeeper loss
- Loss of Zookeeper metadata for particular table
- Loss of Zookeeper metadata for particular replica.
Even if you have a cluster with 50 replicas and one replica is out, you can still bring it back online. As a bonus, if data on replicas were not erased, data parts are not fetched over the network, speeding up recovery and reducing the network traffic.
Let’s try it.
Restore Replica Example
Imagine we have a cluster with three replicas.
CREATE TABLE test(n UInt32) ENGINE = ReplicatedMergeTree('/clickhouse/tables/test/', '{replica}') ORDER BY n PARTITION BY n % 10; INSERT INTO test SELECT * FROM numbers(1000);
Suddenly, ZooKeeper loses metadata for all replicas (this can be simulated by using zookeeper-cli
or zk.delete
in integration tests):
zk.delete("/clickhouse/tables/test", recursive=True)
Now we have to restart all replicas. The RESTORE
query operates only on read-only tables for security reasons.
If you restart a replica with ZooKeeper metadata missing, it will attach as read-only:
SYSTEM RESTART REPLICA test;
And now we can restore everything. You can either execute the RESTORE
query on each of the replicas by hand or use the ON CLUSTER
version.
SYSTEM RESTORE REPLICA test; -- Single replica restoration SYSTEM RESTORE REPLICA test ON CLUSTER cluster;
It’s that simple.
How It’s Implemented
Implementation mostly relies on improved ALTER TABLE ATTACH
internals. Before these changes, ALTER ATTACH
pushed a GET_PART
command into the replicated log. While applying this command to itself, the replica fetched the part from some other replica.
Now ALTER ATTACH
pushes a special ATTACH_PART
command. The replica first checks the detached folder for the missing part. Parts are compared using checksums. If a part was found, the replica attaches it from the detached folder, fetching it from the other replica otherwise.
When invoked on a read-only replica, the RESTORE
interpreter gets all data parts in all states. It then moves all parts to the detached folder, clearing the internal state so the table becomes empty.
After that, the interpreter tries to push metadata about the table into Zookeeper. Only the first query succeeds, so multiple requests can be executed on multiple replicas in parallel.
Finally, all partitions in the detached folder are attached. Multiple RESTORE
queries on a single replica will trigger an error, so only one would do the restoration job.
Try It Yourself
It is always better to be prepared for the worst-case scenario. Therefore, try out how to use the SYSTEM RESTORE REPLICA
query to bring back your cluster nodes when Zookeeper fails. Check out the Altinity Documentation Site on setting up ClickHouse clusters with Zookeeper and try to simulate the failure. If you are running on Altinity.Cloud your Zookeeper is already safely managed so this isn’t a worry. Still have questions? Contact us and let us know so we can help!
reeeeeally helped!