A New Way to Restore ClickHouse After ZooKeeper Metadata Is Lost

ClickHouse uses Zookeeper for the ReplicatedMergeTree table engine family. Sometimes, Zookeeper may be lost due to user error or operational failure. Every ClickHouse user has experienced this at least once, and that was not the best day of their life. If a failure resulted in metadata loss, ClickHouse tables turn to pumpkins–users can query but can not insert data anymore. In order to recover ClickHouse tables, one would need to run multiple SQL and file system-level manipulations.

As of ClickHouse version 21.7, this has changed. There is a new SQL command: SYSTEM RESTORE REPLICA.

SYSTEM RESTORE REPLICA is a powerful tool. It’s used to restore table state in various situations, for example:

  • Total Zookeeper loss
  • Loss of Zookeeper metadata for particular table
  • Loss of Zookeeper metadata for particular replica.

Even if you have a cluster with 50 replicas and one replica is out, you can still bring it back online. As a bonus, if data on replicas were not erased, data parts are not fetched over the network, speeding up recovery and reducing the network traffic.

Let’s try it.

Restore Replica Example

Imagine we have a cluster with three replicas.

CREATE TABLE test(n UInt32)
    ENGINE = ReplicatedMergeTree('/clickhouse/tables/test/', '{replica}')
    ORDER BY n PARTITION BY n % 10;

INSERT INTO test SELECT * FROM numbers(1000);

Suddenly, ZooKeeper loses metadata for all replicas (this can be simulated by using zookeeper-cli or zk.delete in integration tests):

zk.delete("/clickhouse/tables/test", recursive=True)

Now we have to restart all replicas. The RESTORE query operates only on read-only tables for security reasons.

If you restart a replica with ZooKeeper metadata missing, it will attach as read-only:

SYSTEM RESTART REPLICA test;

And now we can restore everything. You can either execute the RESTORE query on each of the replicas by hand or use the ON CLUSTER version.

SYSTEM RESTORE REPLICA test; -- Single replica restoration
SYSTEM RESTORE REPLICA test ON CLUSTER cluster;

It’s that simple.

How It’s Implemented 

Implementation mostly relies on improved ALTER TABLE ATTACH internals. Before these changes, ALTER ATTACH pushed a GET_PART command into the replicated log. While applying this command to itself, the replica fetched the part from some other replica.

Now ALTER ATTACH pushes a special ATTACH_PART command. The replica first checks the detached folder for the missing part. Parts are compared using checksums. If a part was found, the replica attaches it from the detached folder, fetching it from the other replica otherwise.

When invoked on a read-only replica, the RESTORE interpreter gets all data parts in all states. It then moves all parts to the detached folder, clearing the internal state so the table becomes empty.

After that, the interpreter tries to push metadata about the table into Zookeeper. Only the first query succeeds, so multiple requests can be executed on multiple replicas in parallel.

Finally, all partitions in the detached folder are attached. Multiple RESTORE queries on a single replica will trigger an error, so only one would do the restoration job.

Try It Yourself

It is always better to be prepared for the worst-case scenario. Therefore, try out how to use the SYSTEM RESTORE REPLICA query to bring back your cluster nodes when Zookeeper fails. Check out the Altinity Documentation Site on setting up ClickHouse clusters with Zookeeper and try to simulate the failure. If you are running on Altinity.Cloud your Zookeeper is already safely managed so this isn’t a worry. Still have questions? Contact us and let us know so we can help!

Share

One Comment

Comments are closed.