All About Zookeeper (And ClickHouse® Keeper, too!)

Recorded: Wednesday, Mar 30 | 10 am PT
Presenters: Robert Hodges & Altinity Support Engineering
In this tutorial, Altinity CEO Robert Hodges covers everything operators need to know about ZooKeeper in ClickHouse deployments. He starts with why it’s needed at all: when ClickHouse tables are replicated across nodes, those nodes must agree on what data they hold, what merges to run, and what DDL to execute—even as nodes fail or restart. ZooKeeper solves this distributed consensus problem by acting as a consistent directory service that stores the ON CLUSTER DDL queue, part-tracking for replication, and insert deduplication hashes.
The session then goes under the hood and into operations. Robert explains the ZAB protocol—leader election, quorum-acknowledged writes, and why three nodes beat five for most workloads—and walks through the z-node tree structure with real examples. On the practical side, he covers installation (Ubuntu setup, production hardware like dedicated SSD-backed hosts and JVM tuning, key zoo.cfg and ClickHouse config files), administration (running three nodes, using system.zookeeper, four-letter commands, and Prometheus/Grafana monitoring), and troubleshooting the four most common failures: read-only tables, expired sessions, a lost single node, and complete ensemble loss (recoverable via SYSTEM RESTORE REPLICA).
The second half introduces ClickHouse Keeper, a from-scratch C++ reimplementation of ZooKeeper built on the Raft consensus protocol. It ships with ClickHouse, requires no separate installation, and works as a drop-in replacement with the same API and tooling—running either as an external ensemble or embedded for development, with clickhouse-keeper-converter handling migration. Robert closes by assessing it as production-ready on ClickHouse 22.3 LTS, with Q&A on when to prefer Keeper and plans for the Altinity Kubernetes Operator to support embedded mode.
Here are the slides:
Key Moments (Timestamps)
Key moments generated with AI assistance.
- 0:03 – Introduction: Robert Hodges, Altinity
- 1:43 – Why does ClickHouse need ZooKeeper? The distributed consensus problem
- 4:16 – Example: ReplicatedMergeTree table creation with ON CLUSTER command
- 7:41 – What can go wrong in distributed systems: failed DDL, offline nodes, conflicting merges
- 9:55 – What ZooKeeper does: stores pending ON CLUSTER commands and replication part tracking
- 11:14 – ZooKeeper architecture: three-node ensemble, ZAB protocol, leader election
- 12:02 – ZAB protocol: leader serializes writes, forwards to followers, waits for quorum acknowledgement
- 14:55 – Performance implications: write throughput gated on leader capability and follower count
- 15:51 – ZooKeeper data structure: z-nodes as hierarchical directory (like a file system)
- 17:06 – What’s stored: task queues for ON CLUSTER DDL, table parts, deduplication hashes
- 18:55 – ClickHouse source code reference: replicated storage class describes all ZooKeeper usage
- 21:30 – Installing ZooKeeper on Ubuntu: apt install, version requirements (3.4.9+)
- 22:26 – Hardware requirements: dedicated hosts, SSD for transaction log, 4 GB RAM, disable swap, JVM heap
- 24:50 – ZooKeeper process internals: in-RAM z-node copy, myid file, zoo.cfg, data dir, dataLogDir
- 26:33 – Essential zoo.cfg settings: autopurge.purgeInterval, autopurge.snapRetainCount, server list
- 27:49 – Starting ZooKeeper and four-letter commands: ruok, mntr, stat
- 28:51 – Connecting ClickHouse to ZooKeeper: zookeeper.xml configuration
- 29:40 – Setting up macros: macros.xml with cluster, shard, replica values
- 30:14 – How many ZooKeeper nodes? 3 in production, 1 in development; why 2 is worse than 1
- 32:37 – system.zookeeper: inspecting z-nodes, verifying connectivity
- 33:46 – Using zk_cli.sh and other ZooKeeper tools
- 34:27 – Four-letter commands in more depth: ruok, conf, cons, mntr, server
- 35:11 – Monitoring: Prometheus + Grafana, check_zookeeper.pl, Altinity Knowledge Base
- 36:10 – Troubleshooting: read-only tables
- 39:05 – Troubleshooting: session expired
- 40:29 – Troubleshooting: single node loss — replace with same ID and restart
- 41:44 – Troubleshooting: complete ensemble loss — SYSTEM RESTORE REPLICA command
- 43:05 – Introduction to ClickHouse Keeper
- 43:38 – What ClickHouse Keeper is: C++ ZooKeeper replacement using Raft, bundled with ClickHouse
- 44:28 – Why ZooKeeper was replaced: no longer actively developed, Java complexity, zxid overflow, uncompressed logs
- 46:11 – ClickHouse Keeper as external ensemble: same patterns as ZooKeeper on dedicated hosts
- 47:14 – ClickHouse Keeper embedded in ClickHouse: ensemble runs inside ClickHouse nodes
- 48:24 – Single-node embedded mode for developers: replicated schema from day one, no external ZooKeeper needed
- 49:19 – ClickHouse Keeper configuration: keeper_server section, server id, port, raft_configuration
- 51:23 – Compatibility: system.zookeeper, four-letter commands, zk_cli.sh all work unchanged
- 51:39 – How to tell you’re talking to ClickHouse Keeper: stat command shows “clickhouse_server_vX.X”
- 52:52 – Migration: clickhouse-keeper-converter, outage required, test thoroughly
- 53:35 – Production readiness assessment: ready for production on ClickHouse 22.3 LTS
- 56:14 – Q&A: when to prefer ZooKeeper over ClickHouse Keeper
- 58:58 – Q&A: Altinity Kubernetes Operator support for embedded ClickHouse Keeper (yes, soon)
Webinar Transcript
[0:03] — Introduction and Housekeeping
Robert: Okay, welcome to our webinar “All About ZooKeeper and ClickHouse Keeper Too.” My name is Robert Hodges. Together with the folks at Altinity engineering we’ve put together a comprehensive treatment of how ZooKeeper works with ClickHouse and also the new replacement called ClickHouse Keeper.
This is being recorded and we’ll send a link as well as slides to anyone who signed up — you’ll see it in your email within a few hours. For questions, the best way is the Q&A box in Zoom’s control panel. I’ll answer relevant ones as we go along and we’ll get the rest at the end.
[1:43] — About Altinity
Robert: My day job is CEO of Altinity. What’s relevant here: I’ve been working on databases since 1983, and a lot of that has involved distributed systems — which is the foundation underlying both ZooKeeper and ClickHouse Keeper. That makes this a fun topic to cover.
Altinity has about 42–43 people, mostly engineers. If you add up our database experience it’s literally centuries. We offer support and services for ClickHouse and the applications people build on it, run Altinity.Cloud as a managed ClickHouse platform, and are the authors of the Altinity Kubernetes Operator for ClickHouse.
[3:25] — Why Does ClickHouse Need ZooKeeper?
Robert: ZooKeeper is an extra component — an extra thing to manage, requiring Java, separate installation, and so on. Why does ClickHouse need it?
To understand why, let’s work through an example. One of the fundamental ways ClickHouse delivers outstanding performance in analytic queries is horizontal scaling: spreading data across many locations over a network.
We have four ClickHouse servers split into two shards (red and blue), each with two replicas. This allows us to take advantage of the processing capabilities of many machines, scale reads (multiple replicas), and scale writes (each shard handles a subset of data). This is one of the fundamental reasons ClickHouse is fast.
[4:16] — Creating a Replicated Table
Robert: So we create a table called events_local using ReplicatedMergeTree:
CREATE TABLE events_local ON CLUSTER ‘{cluster}’
(
event_date Date,
event_type String,
value UInt64
)
ENGINE = ReplicatedMergeTree(
‘/clickhouse/{cluster}/tables/{shard}/events’,
‘{replica}’
)
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, event_type);
The ON CLUSTER ‘{cluster}’ blasts this command across all four nodes. The {cluster}, {shard}, and {replica} in curly brackets are macros — ClickHouse fills them in with values from each server’s macros.xml configuration file. If this succeeds, all four nodes now have the table.
We can then connect to any one of the nodes and start inserting data. We could also use a Distributed table as an umbrella to route inserts automatically.
[7:41] — What Can Go Wrong
Robert: As with everything in computing, we must ask: what could go wrong?
When hosts communicate over a network there are many things that could go wrong:
- The ON CLUSTER command could fail on one node: maybe the table already exists, there’s a schema issue, or the server has a missing database definition.
- A node could be offline for maintenance during a schema change or data insert.
- Two replicas could make conflicting decisions — for example, choose to merge different combinations of parts, creating an inconsistency.
- Two replicas could delete different overlapping sets of parts, causing conflicts or errors.
In general: in distributed systems, things can fail. When they fail they can create inconsistencies. This is called the distributed consensus problem. We need a mechanism to have all nodes agree on the proper state of things, and to have a way for nodes that missed an update to find out about it afterwards.
[9:55] — What ZooKeeper Does
Robert: That’s exactly what ZooKeeper does. When you set up a ClickHouse cluster connected to ZooKeeper, two of the biggest things ZooKeeper does are:
- Remembers ON CLUSTER commands: keeps a list of them in order, so that if a node misses one it can find out from ZooKeeper what it needs to execute.
- Keeps track of what needs to be replicated: records the parts that exist across tables, so a replica can ask ZooKeeper “what parts am I missing?” and then pull them from other replicas.
ZooKeeper is a set of nodes that store a consistent database. Their job is to make sure that database stays consistent so ClickHouse instances can always ask “what’s the next ON CLUSTER command I should execute?” or “what parts have I missed?” and get an accurate answer.
[11:14] — ZooKeeper Architecture and ZAB Protocol
Robert: In a production system we typically have three ZooKeeper nodes: zookeeper-0, zookeeper-1, zookeeper-2. These nodes form what’s called an ensemble and talk to each other using the ZooKeeper Atomic Broadcast (ZAB) protocol.
ZooKeeper clients (ClickHouse instances) connect to any node in the ensemble on port 2181. Regardless of which node they connect to, they should get the same answer — because ZooKeeper maintains consensus between the nodes.
[12:02] — How ZAB Protocol Works
Robert: Here’s how ZAB maintains consistency:
- Leader election: one of the three nodes becomes the leader. All nodes participate in an election process to agree on who the leader is.
- Write serialization: when a write request arrives (say, at a follower like zookeeper-2), the follower forwards it to the leader (zookeeper-0). The leader sends the write to all followers and ensures they acknowledge it before committing.
- Quorum acknowledgement: the leader waits until a majority (quorum) of nodes have acknowledged the write before confirming it to the client. For three nodes, quorum is two (the leader plus at least one follower).
The key insight is that having a single leader ensures all writes are serialized in the same order across all nodes. This is the heart of ZAB — and of consensus protocols like Raft and Paxos generally.
[14:55] — Performance Implications
Robert: This protocol has important performance implications:
Write throughput is gated on the leader’s capability (its hardware, CPU, storage speed) and on how many followers it needs to distribute writes to. The more followers, the slower the write throughput. This is why three nodes beats five for most production workloads — with five you’d have four acknowledgements needed instead of two.
Reads are not gated on the leader: since every node in the ensemble maintains a consistent copy, reads can arrive at any node and it will look up and return the data. Read-heavy workloads scale independently of the write bottleneck.
[15:51] — ZooKeeper Data Structure: Z-Nodes
Robert: The data in ZooKeeper looks like a directory tree. We sometimes call ZooKeeper a “distributed directory service.”
If you look inside ZooKeeper connected to a ClickHouse cluster you’ll see something like a file system, with a root node, a clickhouse/ node, and then several chains under it:
- /clickhouse/{cluster}/ — for a cluster called first, this tracks all the tables and their parts for replication.
- /clickhouse/task_queue/ — stores pending and completed ON CLUSTER DDL commands.
These special nodes are called z-nodes. Unlike a file system, a z-node is simultaneously a directory (it can have children) and a file (it can have content). If you look at the task queue and examine a specific command node, you’ll see something like:
/clickhouse/task_queue/queries/query-0000000001
value: “DROP TABLE IF EXISTS…”
This structure is complex. For most operations you should never modify it manually — only ClickHouse should write to it. The only software that should modify ZooKeeper data is ClickHouse itself.
[17:06] — What’s Stored in Z-Nodes
Robert: The two major categories of information stored in z-nodes are:
Task queue: pending and completed ON CLUSTER commands. ClickHouse keeps a list of them in order. Any node that was offline can come back and ask “what commands did I miss?”
Table information: for each replicated table, the parts that exist, their checksums, which merges are in progress, and deduplication hashes. The deduplication hash list is particularly clever: if an INSERT fails on one replica and you retry to another, the receiving replica checks the hash against the list. If the block was already inserted (maybe it succeeded on the first try after all), it simply ignores the duplicate.
For the definitive reference, the C++ source code for StorageReplicatedMergeTree in the ClickHouse repository is well-commented and describes exactly what’s stored in ZooKeeper and why. This is one of the great things about open source.
[21:30] — Installing ZooKeeper
Robert: Installing ZooKeeper is easy on most systems. On Ubuntu:
sudo apt-get update
sudo apt-get install -y zookeeper netcat
netcat (abbreviated nc) is a command-line tool used to send ZooKeeper four-letter commands. Version requirements: install ZooKeeper 3.4.9 or later. We generally recommend the version that comes with Ubuntu 18.04 from the apt repository. Check the ZooKeeper production configuration guide in the Altinity Knowledge Base for specific version recommendations.
[22:26] — Critical Hardware Requirements
Robert: Before we discuss configuration, here are the non-negotiable hardware requirements for production ZooKeeper:
Dedicated hosts: ZooKeeper must run on machines separate from ClickHouse. ClickHouse is designed to use every resource it can get — if it runs on the same host as ZooKeeper, it will steal CPU and memory from ZooKeeper, which can cause ZooKeeper to become slow or fail to maintain quorum.
SSD for the transaction log: ZooKeeper maintains a write-ahead transaction log. This is the most performance-critical file path — put it on a dedicated SSD, not shared with other processes.
Low network latency between ZooKeeper nodes: latency between nodes affects quorum acknowledgement speed. The critical metric is latency, not bandwidth.
4 GB RAM, disable swap: ZooKeeper keeps an in-RAM copy of all z-nodes. If it starts swapping you’re in serious trouble — it will become very slow and unable to maintain quorum. Set the Java heap to about 75% of available RAM (e.g., 3 GB on a 4 GB host).
Don’t share with other applications: same reason as not co-locating with ClickHouse.
[24:50] — ZooKeeper Process Internals
Robert: When ZooKeeper runs it maintains an in-RAM copy of all z-nodes. For read operations it doesn’t touch storage — it just reads from memory. It refers to four types of files:
- myid: a file containing just the server ID (1, 2, 3, etc.) — set once per node
- zoo.cfg: the configuration file
- Data directory: where snapshots are stored (periodic full copies of the z-node tree)
- dataLogDir: where transaction logs are stored — put this on a dedicated SSD
If you don’t set dataLogDir it defaults to the data directory, which is acceptable for development but not production.
[26:33] — Essential zoo.cfg Settings
Robert: The essential settings to add or modify in zoo.cfg:
# Prevent snapshots from accumulating and filling disk
autopurge.purgeInterval=1
autopurge.snapRetainCount=5
# List all servers in the ensemble
server.1=zookeeper-0:2888:3888
server.2=zookeeper-1:2888:3888
server.3=zookeeper-2:2888:3888
# Data and log directories
dataDir=/var/lib/zookeeper
dataLogDir=/ssd/zookeeper/log
autopurge.purgeInterval=1 means check every hour for old snapshots. autopurge.snapRetainCount=5 keeps at most 5 snapshots. Without these settings, snapshots accumulate indefinitely and will eventually fill your disk, causing ZooKeeper to stop — which is very bad.
[27:49] — Starting ZooKeeper and Four-Letter Commands
Robert: Start ZooKeeper as a service:
sudo -u zookeeper /opt/zookeeper/bin/zkServer.sh start
# Or as a systemd service:
sudo systemctl start zookeeper
ZooKeeper has a set of four-letter commands you can send to verify its state:
echo ruok | nc zookeeper-0 2181 # Are you OK? Returns “imok”
echo mntr | nc zookeeper-0 2181 # Monitor stats (for Prometheus)
echo stat | nc zookeeper-0 2181 # Server statistics and connections
echo conf | nc zookeeper-0 2181 # Configuration dump
echo cons | nc zookeeper-0 2181 # Client connections
ruok is your basic ping — if it returns “imok”, ZooKeeper is live and has a leader. The mntr command dumps stats that can feed into Prometheus.
[28:51] — Connecting ClickHouse to ZooKeeper
Robert: Configure ClickHouse to find ZooKeeper by creating /etc/clickhouse-server/config.d/zookeeper.xml:
<clickhouse>
<zookeeper>
<node>
<host>zookeeper-0</host>
<port>2181</port>
</node>
<node>
<host>zookeeper-1</host>
<port>2181</port>
</node>
<node>
<host>zookeeper-2</host>
<port>2181</port>
</node>
</zookeeper>
<distributed_ddl>
<path>/clickhouse/task_queue/ddl</path>
</distributed_ddl>
</clickhouse>
And create /etc/clickhouse-server/config.d/macros.xml for ON CLUSTER macro values. See the cluster configuration process in the Altinity Knowledge Base for the full setup procedure.
[30:14] — How Many ZooKeeper Nodes?
Robert: The answer might surprise you:
- Development: 1 node is fine. If it goes down, restart it — no big deal.
- Production: 3 nodes is optimal. Five is slower for writes (larger quorum). Two is actually less available than one.
Why is two worse than one? With two servers, quorum must be two (both must agree). If either server fails or loses connectivity to the other, the ensemble stops — it can’t form quorum. If you set quorum to one with two servers, you get a “split brain”: both servers think they’re the leader and operate independently, creating divergent state. So two gives you all the failure exposure of a two-machine deployment with none of the HA benefit. Always use odd numbers: 1 for dev, 3 for production.
[32:37] — Inspecting ZooKeeper with system.zookeeper
Robert: One of the most useful tools for inspecting ZooKeeper from ClickHouse is system.zookeeper. You query it like a table, specifying the path with a WHERE clause:
— List top-level ZooKeeper nodes
SELECT name, value, path
FROM system.zookeeper
WHERE path = ‘/clickhouse’;
— View the task queue
SELECT name, value
FROM system.zookeeper
WHERE path = ‘/clickhouse/task_queue/queries’;
This serves two purposes: it shows you what’s in ZooKeeper, and if this query succeeds you know ClickHouse can see ZooKeeper. If it times out or fails, ZooKeeper is unreachable or not working.
You can also use zk_cli.sh (the standard ZooKeeper CLI tool) to navigate the tree, or various graphical tools that let you browse the z-node hierarchy visually.
[35:11] — Monitoring
Robert: Don’t go to production without ZooKeeper monitoring set up. Options:
- Prometheus + Alert Manager + Grafana: the modern approach. Use mntr four-letter commands to export stats; there’s a Grafana dashboard available from Grafana Labs.
- Nagios/check_zookeeper.pl: older approach, still valid.
- The care and feeding of ZooKeeper with ClickHouse guide in Altinity documentation covers the monitoring setup in detail.
- The ZooKeeper production configuration guide in the Altinity Knowledge Base has a dedicated monitoring section.
[36:10] — Troubleshooting: Read-Only Tables
Robert: If you’re inserting data and suddenly see failures like “the table is in read-only mode,” ClickHouse cannot contact ZooKeeper to record the insert. ClickHouse makes the table read-only to prevent inconsistency until ZooKeeper is available again.
Diagnosis steps:
- Run SELECT * FROM system.zookeeper WHERE path = ‘/clickhouse’. If this succeeds, ClickHouse can see ZooKeeper — something else is wrong.
- If it fails (timeout), check ZooKeeper itself: echo ruok | nc zookeeper-0 2181. If that doesn’t return “imok”, ZooKeeper itself has a problem.
- Go look at the ZooKeeper logs. Common causes:
- Disk space full: ZooKeeper stops if it can’t write to the transaction log
- Lost quorum: one of the nodes is down
- zxid rollover: see the next section
[39:05] — Troubleshooting: Session Expired
Robert: ClickHouse maintains a single persistent TCP connection to ZooKeeper. If this connection is interrupted for any reason (network routing change, brief network blip, ZooKeeper was momentarily busy), ClickHouse will log a “session expired” message.
If it happens rarely: just ignore it. Session expiry is normal in distributed systems — ClickHouse will automatically reconnect and resume.
If it happens frequently, investigate:
- zxid overflow: ZooKeeper uses 32-bit integers for message IDs. When the lower 32 bits roll over, ZooKeeper forces a leader election (briefly unavailable). This typically happens with too many writes — either many applications sharing a ZooKeeper ensemble or very frequent small inserts from ClickHouse.
- Too many parts: very high insert rates can create too many small parts, causing ClickHouse to generate too many ZooKeeper writes.
- jute.maxbuffer too small: a ZooKeeper setting controlling buffer sizes. If your ZooKeeper buffers are too small for the messages ClickHouse is sending, connections can fail.
[40:29] — Troubleshooting: Single Node Loss
Robert: Losing one ZooKeeper node (VM burns up, disk fails, etc.) is easy to recover from:
- Create a fresh host with the same ZooKeeper instance ID (myid) — try to keep the same hostname too
- Install ZooKeeper with the same zoo.cfg as the surviving nodes
- Start it
The new node will join the ensemble, get all the missing data replicated from the surviving nodes, and the ensemble will return to full health. The surviving two nodes maintain quorum throughout (with quorum = 2 out of 3), so ClickHouse never has to stop.
[41:44] — Troubleshooting: Complete Ensemble Loss
Robert: Losing the entire ZooKeeper ensemble is a serious event. Before the SYSTEM RESTORE REPLICA command existed, recovery was extremely painful: you had to convert all replicated tables back to plain MergeTree, rebuild the ZooKeeper ensemble from scratch, then convert them back to ReplicatedMergeTree.
The modern approach using the SYSTEM RESTORE REPLICA command: bring up a new ZooKeeper ensemble, then on each ClickHouse node run:
SYSTEM RESTORE REPLICA events_local ON CLUSTER ‘{cluster}’;
This reads the metadata from ClickHouse’s own replica filesystem and posts it back to ZooKeeper, rebuilding the distributed state. It’s a genuine lifesaver. For details and the full recovery procedure see ZooKeeper backup and SYSTEM RESTORE REPLICA.
[43:05] — Introduction to ClickHouse Keeper
Robert: Now let’s talk about one of the most interesting developments in ClickHouse over the last couple of years: ClickHouse Keeper.
[43:38] — What ClickHouse Keeper Is
Robert: ClickHouse Keeper is a from-scratch reimplementation of ZooKeeper written in C++ by Alexander Sapin of Altinity. Key properties:
- Completely mimics the ZooKeeper API and administrative four-letter commands
- Uses the Raft consensus protocol (specifically NuRaft, developed at eBay) rather than ZAB
- Written in C++, not Java — no JVM dependencies, no JVM heap tuning
- Part of ClickHouse: no separate installation needed
For the technical deep-dive, see Alexander Sapin’s talk from the SF Bay Area ClickHouse meetup, which explains ClickHouse Keeper internals and the Raft protocol in detail.
[44:28] — Why ZooKeeper Was Replaced
Robert: The ClickHouse team had several motivations:
- Philosophy: ClickHouse should include everything you need. Kafka integration is built in; having a separate external dependency like ZooKeeper complicates operations.
- ZooKeeper is no longer actively developed: it came out of the Hadoop era. Security fixes and new features arrive slowly.
- Java complexity: adds dependencies and requires JVM tuning. Many operators find it frustrating.
- Known bugs: zxid overflow has been a long-standing issue; uncompressed logs waste disk space.
For using ClickHouse Keeper including production setup instructions, migration guidance, and version recommendations, see the Altinity Knowledge Base.
[46:11] — ClickHouse Keeper as External Ensemble
Robert: The first deployment mode is as a drop-in ZooKeeper replacement on dedicated hosts. You run three ClickHouse Keeper processes on separate machines (not on the ClickHouse servers themselves) — the same placement rules as ZooKeeper apply. Configuration lives in a keeper_server section of the ClickHouse config:
<keeper_server>
<tcp_port>2181</tcp_port>
<server_id>1</server_id>
<log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
<snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>
<coordination_settings>
<operation_timeout_ms>10000</operation_timeout_ms>
<session_timeout_ms>30000</session_timeout_ms>
<raft_logs_level>warning</raft_logs_level>
</coordination_settings>
<raft_configuration>
<server>
<id>1</id>
<hostname>keeper-0</hostname>
<port>9444</port>
</server>
<!– server 2, server 3 –>
</raft_configuration>
</keeper_server>
In this mode you can barely tell the difference from ZooKeeper — the system.zookeeper interface, the four-letter commands, and zk_cli.sh all work unchanged.
[47:14] — ClickHouse Keeper Embedded in ClickHouse Nodes
Robert: The more architecturally interesting option: run ClickHouse Keeper inside the ClickHouse instances themselves. Here’s the embedded configuration for a single-node development setup:
<keeper_server>
<tcp_port>9181</tcp_port>
<server_id>1</server_id>
<!– log/snapshot paths –>
<raft_configuration>
<server>
<id>1</id>
<hostname>localhost</hostname>
<port>9234</port>
</server>
</raft_configuration>
</keeper_server>
For a single ClickHouse instance (like on a laptop), this gives you a complete ZooKeeper replacement with no external dependencies. This is extremely powerful for development.
[48:24] — Developer Benefit: Replicated Schema from Day One
Robert: Previously, when developing on a laptop you’d typically use plain MergeTree tables (no replication) and then switch to ReplicatedMergeTree as you got closer to production. This meant your development code paths were different from production.
With embedded ClickHouse Keeper you can now use ReplicatedMergeTree from day one, on your laptop, with the exact same schema you’ll use in production. You get the same code paths, the same behavior. This is a huge step forward for development workflow.
[51:23] — Compatibility: Everything Still Works
Robert: Because ClickHouse Keeper completely mimics the ZooKeeper API:
- SELECT * FROM system.zookeeper works exactly the same
- All four-letter commands work: echo ruok | nc keeper-0 2181 returns “imok”
- zk_cli.sh and all other ZooKeeper tools work
- ON CLUSTER commands and replication work identically
One way to confirm you’re talking to ClickHouse Keeper rather than ZooKeeper: the stat four-letter command prints the version string. ZooKeeper prints its own version; ClickHouse Keeper prints “clickhouse_server_v22.3” (or whatever version is running).
[52:52] — Migrating from ZooKeeper to ClickHouse Keeper
Robert: If you’re starting fresh, just set up ClickHouse Keeper from the beginning — no migration needed.
If you have an existing ZooKeeper deployment, use clickhouse-keeper-converter to convert ZooKeeper transaction logs to ClickHouse Keeper format. The procedure requires an outage (you stop ZooKeeper, convert, start ClickHouse Keeper).
Important: test this procedure thoroughly before applying to production. There is no simple revert path — if something goes wrong after conversion, you have inconsistent state. Alternatively, you can start fresh with an empty ClickHouse Keeper ensemble and use SYSTEM RESTORE REPLICA to rebuild the metadata.
[53:35] — Production Readiness Assessment
Robert: Is ClickHouse Keeper ready for production?
Our assessment: yes, on ClickHouse 22.3 LTS. The core team performed extensive Jepsen testing (the adversarial distributed systems test suite). It’s been deployed on an increasing number of production servers including some at Altinity, and so far so good.
It also fixes a number of known ZooKeeper problems: zxid overflow, uncompressed logs, and the general Java operational complexity. For new installations we’d recommend looking very hard at ClickHouse Keeper. It’s particularly valuable for developers since the embedded single-node mode makes replicated ClickHouse development trivially easy.
[56:14] — Q&A: When to Prefer ZooKeeper
Robert: Question: is there any reason to prefer ZooKeeper over ClickHouse Keeper?
If you have ZooKeeper working well already, there’s no urgent motivation to migrate. At Altinity.Cloud we run over 100 clusters on ZooKeeper. Most problems we see are actually caused by ClickHouse misuse (overloading ZooKeeper with tiny inserts) rather than ZooKeeper itself. Once ZooKeeper is properly set up on capable hosts it has very few problems, and there’s years of operational experience and tooling around it.
On the other hand, if you’re bringing up something new, I’d definitely look hard at ClickHouse Keeper — especially because the development workflow is so much better with the embedded mode. Over time ClickHouse Keeper will definitely become the preferred solution.
[58:58] — Q&A: Altinity Kubernetes Operator Support for Embedded ClickHouse Keeper
Robert: Question: are there plans for the Altinity Kubernetes Operator to support running ClickHouse Keeper embedded in ClickHouse nodes?
Yes, absolutely. Setting up ClickHouse Keeper in Kubernetes manually is somewhat painful right now — there are tricks to making it come up and establish consensus correctly. We’ve been asked this many times. We will support it. I’d do it myself if necessary. We’re waiting for ClickHouse Keeper to stabilize fully in 22.3 LTS and then we’ll get to work on it. Eventually Altinity.Cloud will also migrate to ClickHouse Keeper.
Robert: Thank you all for attending. If you have further questions, come to altinity.com, join our Slack channel, or send email to info@altinity.com. Congratulations to Alexander Sapin for ClickHouse Keeper — a wonderful piece of technology.
FAQ
Why does ClickHouse need ZooKeeper for replication?
ClickHouse uses ZooKeeper to solve the distributed consensus problem. When data is replicated across multiple ClickHouse nodes, those nodes need to agree on what data exists, what merges to perform, what DDL commands to execute, and what parts have been successfully inserted — even when nodes go offline and come back. ZooKeeper provides a consistent, distributed directory service that stores this shared state: pending ON CLUSTER commands, replication part tracking, and deduplication hashes for inserted blocks. Without it, nodes could diverge and become inconsistent.
How does ZooKeeper maintain consistency across nodes?
ZooKeeper uses the ZooKeeper Atomic Broadcast (ZAB) protocol. The ensemble elects a leader through a voting process. All write operations go through the leader, which serializes them and broadcasts them to followers, waiting for a quorum (majority) of acknowledgements before confirming the write to clients. This ensures all nodes see writes in the same order. Read operations can be served by any node since every node maintains a consistent in-memory copy of the z-node tree.
How many ZooKeeper nodes should I run in production?
Three. Always use odd numbers to avoid quorum ambiguity. One is fine for development. Three is optimal for production: it tolerates one node failure while maintaining quorum (2 of 3). Two nodes is actually less available than one, because losing either node breaks quorum and stops the ensemble. Five nodes is slower than three because the leader needs acknowledgement from more followers.
What are the critical hardware requirements for production ZooKeeper?
Run ZooKeeper on dedicated hosts separate from ClickHouse. Put the transaction log on a dedicated SSD with nothing else writing to it. Ensure at least 4 GB RAM and disable swap completely. Set low network latency between ZooKeeper nodes. Tune the JVM heap to approximately 75% of available RAM. Never share ZooKeeper hosts with ClickHouse or other applications, as ClickHouse will compete for resources and can cause ZooKeeper to fail or become slow.
What is ClickHouse Keeper and how does it differ from ZooKeeper?
ClickHouse Keeper is a from-scratch C++ reimplementation of ZooKeeper built into ClickHouse. It uses the Raft consensus protocol rather than ZAB, requires no Java, has no separate installation, and is bundled with every ClickHouse release. It completely mimics the ZooKeeper API — including the four-letter commands, the system.zookeeper interface, and compatibility with zk_cli.sh. It fixes several known ZooKeeper issues including zxid overflow and uncompressed transaction logs. It can run as an external three-node ensemble (like ZooKeeper) or embedded inside a single ClickHouse instance for development.
What is the embedded ClickHouse Keeper mode and why is it useful for developers?
Embedded ClickHouse Keeper runs the consensus service inside the ClickHouse server process itself, with no external ensemble needed. For a single ClickHouse instance (like a development laptop), this means you can use ReplicatedMergeTree tables from day one without any external ZooKeeper setup. This is a significant improvement over the previous approach of developing with plain MergeTree and switching to ReplicatedMergeTree before production — the schema and code paths are now identical from development through production.
© 2022 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.
ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.