Deep Dive on ClickHouse® Sharding and Replication

Recorded: March 26 @ 07:00 am PDT
Presenters: Alexander Zaitzev – Co-Founder & CTO, and Tatiana Saltykova, Senior ClickHouse Solutions Engineer

In this deep-dive tutorial, Altinity CEO Robert Hodges walks through ClickHouse sharding and replication from first principles to advanced distributed query behavior and capacity planning, aimed at developers who know the basics and want to scale horizontally. He opens with the core distinction: shards are disjoint subsets of data that increase write capacity and storage, while replicas are copies within a shard that increase read capacity and provide high availability. From there, he introduces three fundamental patterns—all-sharded (useful for system tables), all-replicated (for small dimension tables), and sharded-plus-replicated (for large fact tables)—and notes that replication happens at the table level, not the server level.

The setup and loading sections cover the practical machinery. Every distributed deployment needs three objects: a ReplicatedMergeTree local table (the actual data store), a Distributed table (a zero-data umbrella for cross-shard queries), and optionally a replicated dimension table—configured via macros.xml, the remote_servers.xml cluster definition (with the secret parameter for inter-node authentication), and ON CLUSTER DDL. For loading, Robert weighs inserting through the Distributed table (no shard awareness needed, but async by default) against inserting directly to local tables (shard-aware but synchronous), and explains insert_distributed_sync and insert_quorum for stronger durability.

The distributed SELECT section explains the initiator node model—how ClickHouse pushes subqueries to local tables and where aggregation and sorting happen—and why it uses a rule-based rather than cost-based optimizer. Robert gives detailed treatment to the distributed_product_mode setting for distributed-to-distributed JOINs (by his admission, the single most confusing thing in ClickHouse) and demonstrates debugging via system.query_log and the remote(), cluster(), and clusterAllReplicas() functions. He closes with capacity planning: the “big table model” for avoiding distributed JOINs, bucketing for large-to-large JOINs, a single-node-first scaling approach, and three options for shard rebalancing (clickhouse-copier, ALTER TABLE MOVE PART, and manual freeze/sync/attach).

Here are the slides:

Deep-Dive-on-ClickHouse-Sharding-and-Replication-2024-1-1 Download

Key Moments (Timestamps)

Key moments generated with AI assistance.

02:41 – Welcome and webinar intent: conceptual foundations for developers
03:08 – ClickHouse overview: SQL, shared-nothing, columnar, parallel execution, open source
06:53 – Why distributed data looks simple but is deeply complex
07:51 – Vertical scaling: CPU, RAM, storage on a single node
08:58 – Horizontal scaling: replication (QPS and concurrency) vs. sharding (throughput and I/O)
10:49 – Combining multiple topologies inside a single cluster
12:21 – Replication architecture: ZooKeeper/Keeper, port 9009, metadata vs. data
14:55 – Why adding replicas has minimal insert performance impact
16:10 – ClickHouse Keeper: standalone vs. in-process deployment modes
18:17 – Which table types are replicated and how to enable replication
21:36 – What operations are and are not replicated automatically
25:26 – Converting a non-replicated table to replicated
27:45 – What else ClickHouse replicates: DDL via ON CLUSTER, RBAC, UDFs, configuration
30:06 – Inspecting ZooKeeper/Keeper metadata via system.zookeeper
31:55 – Replication with object storage and zero-copy replication
34:09 – Building a sharded and replicated cluster: five steps overview
35:35 – Step 1: defining the cluster in remote_servers.xml
36:23 – Verifying cluster layout with system.clusters
36:57 – Step 2: setting macros (cluster, shard, replica)
37:50 – Step 3: creating the replicated local table with ON CLUSTER
38:19 – Step 4: creating the Distributed table
39:53 – Step 5: creating the fully replicated dimension table
40:15 – Using ON CLUSTER for distributed DDL
41:00 – Distributed vs. local inserts: tradeoffs
43:10 – How the asynchronous distributed insert queue works
45:05 – How distributed SELECT queries are rewritten and executed
47:10 – Distributed product mode: local, allow, and global
50:55 – Big-table model vs. bucketing model for distributed joins
52:30 – Parallel replicas: experimental dynamic sharding of a replicated table
54:27 – Summary, resources, and free consultation offer
55:55 – Q&A: balancing shards when scaling horizontally
58:39 – Q&A: INSERT block size and shard count

Webinar Transcript

[02:41] – Welcome and Introduction

Alexander: Good morning, good evening, everyone. Today we will have a webinar about ClickHouse® sharding and replication. This is quite an interesting topic given that ClickHouse is built for distributed systems and has a lot of capabilities to support huge systems in both sharding and replication.

I am Alexander Zaitsev, co-founder of Altinity. I do not do webinars too often, but today I am with you. This webinar is intended mostly for developers. We have a separate course for administrators that explains how to configure replication, how to troubleshoot ClickHouse clusters, and how to support and maintain them at scale. That is a very complex topic and more for advanced ClickHouse users. In this webinar we try to explain the structure of the problem: how to think about ClickHouse replication, how to think about ClickHouse sharding, how to reason about different properties and make decisions about when to use one approach or another. It is more for developers, but of course we can dive deeper into any area.

You may know us. Altinity is one of the pioneers in ClickHouse services. We run Altinity.Cloud, which is managed ClickHouse servers. We also provide support, training, and even free consultations. We are maintainers of many open-source projects including the Altinity Kubernetes Operator for ClickHouse®, which is very popular and powers thousands of ClickHouse installations worldwide.

[03:08] – ClickHouse Overview

Alexander: A very short introduction to ClickHouse. This is a SQL database, a data warehouse for real-time data workloads first and foremost. It understands SQL, runs everywhere from a laptop to very advanced cloud setups, and is built on a shared-nothing architecture. We will see how this architecture plays into sharding and replication. Data is stored in columns, which is the main property driving performance. The second driver is parallel and vectorized execution: vectorized execution applies to a single node, but parallel execution also extends to the cluster itself. Today we will look at small lab setups with a small number of shards, but if you go to huge systems with tens or hundreds of shards, this parallelism is essential. ClickHouse scales, and sharding is one of the things that allows it to scale to very large data sizes. And it is open source, which is why we like it.

[06:53] – Why Distributed Data Looks Simple but Is Not

Alexander: Distributed data is very complex. When you first look at it, it seems very simple and straightforward, but it is like the tip of an iceberg. There is a well-known narrow creek in the US that looks like a small, shallow stream, but in some places is sixty meters deep. You cannot tell just from looking. ClickHouse replication and distributed design is similar: it looks very simple when you start, but as you proceed it becomes more and more complex.

[07:51] – Vertical Scaling on a Single Node

Alexander: Let us start by discussing how ClickHouse can scale. Of course it can scale vertically on a single node, and a single host is typically characterized by three main dimensions you can scale all together or independently: the number of CPU cores or clock speed, RAM size, and storage size. In principle you can scale all three quite far, but there is definitely a limit to how large a single machine can be. In terms of storage you can in principle attach petabytes to a small machine, but you cannot use it effectively because queries will be just very slow. When you have a lot of data you have to go to distributed systems.

[08:58] – Two Horizontal Scaling Dimensions: Replication and Sharding

Alexander: ClickHouse scales horizontally in two completely different dimensions and it is very important to understand they are totally different.

Replication scales on the vertical axis. It helps with two things: first, it improves your queries per second, because queries can be executed on multiple replicas that have the same data. If one replica can execute 10 or 20 queries per second, two replicas can do twice as much. The same applies to concurrency: with replication you can serve more users concurrently.

Sharding is different. Sharding helps with throughput and I/O. It increases your ability to scale data size. In the case of replicas, a query is kind of bound to the replica and executed on a single host. With sharding, the query is executed on all hosts simultaneously, adding more power to every single query. Scaling can be almost linear in some cases. You add extra shards and proportionally your performance and throughput may increase. In reality it is not perfectly linear, but ClickHouse tries to be as close to that as possible.

[10:49] – Combining Multiple Topologies in a Single Cluster

Alexander: Inside one cluster it is possible to combine multiple patterns, and this is a very important concept. Unlike some other systems where cluster topology is hardcoded to the cluster itself, in ClickHouse it is external. Inside one server you may have a table that is both sharded and replicated. You may have a table that is only replicated on all nodes of your cluster, which is typical for small dimension tables. You may also have a table that is not replicated at all, sharded across all nodes with no replicas. This is all possible in the same cluster with four nodes.

This is used all the time in practice. If you are familiar with star schema design, you have a huge fact table and many small dimension tables. The fact table is typically sharded or sharded and replicated. Dimension tables are easier to store as a copy on every node because that is much better for queries. Your dimensions are always replicated and your fact table is sharded, all in the same cluster.

[12:21] – Replication Architecture: Keeper, Port 9009, and Immutable Data

Alexander: Let us jump straight to the network diagram of how ClickHouse replication works. For ClickHouse replication to work, we need an external coordination system. This was traditionally built using ZooKeeper, but now there is a replacement called ClickHouse Keeper.

ClickHouse replication is what is typically called master-master in other databases. You can write to any node. The insert goes to either node 1 or node 2 and they replicate to each other. There is no locking on replication. This is achieved by the shared-nothing architecture and also by the fact that ClickHouse data is immutable: once data is written to disk it will never change. It may be mutated, but that creates new data that is replicated in turn. The data itself is immutable, which allows ClickHouse to avoid the expensive locking that exists in other databases.

When you add a replica, your insert performance may degrade very insignificantly. This degradation is just the extra communication with the Keeper, which takes a little bit of network time. There are no other penalties.

For replication to work, ClickHouse uses an extra port, 9009. Data is transferred directly between nodes, as you can see in the architecture diagram. The Keeper stores no data at all. What it stores is metadata: it knows what data should be on every replica. When 100,000 rows are inserted into one table, ClickHouse creates a part and places a reference to that part’s metadata into the Keeper. Other replicas connect to the Keeper, discover that a part exists, check their local filesystem, and if they do not have it they fetch it from another replica.

This means the Keeper is very important. If you lose it, replication breaks, your ClickHouse goes read-only, and restoring the Keeper data is difficult. Two years ago a feature was added called SYSTEM RESTORE REPLICA that allows restoring Keeper data from local data on disk. Before that, recovery was very complex.

[16:10] – ClickHouse Keeper: Standalone vs. In-Process Deployment

Alexander: Keeper is a replacement for ZooKeeper. It supports the same ZooKeeper protocol and has its own extensions optimized for ClickHouse.

Keeper can be used in two different setups. You can use it as a drop-in replacement for ZooKeeper installed on separate nodes. This is the recommended way because the load profiles for ClickHouse and Keeper are very different, especially related to storage usage. It is recommended to keep them separate.

Keeper also supports what we call in-process mode, where it works as part of ClickHouse itself. It is a single binary with different roles, and certain ClickHouse nodes can also work as Keeper nodes and maintain the quorum and coordination. This looks a little bit simpler from an operational perspective at first glance because you do not need extra machines to manage. But configuration-wise it is pretty complicated, and for production workloads we actually do not recommend the in-process approach. Use the standalone separate cluster instead.

[18:17] – Which Tables Are Replicated and How to Enable Replication

Alexander: All MergeTree tables in ClickHouse can be replicated. Log tables and external table engines are not replicated. The difference between a non-replicated and a replicated table is simply the name of the engine: add Replicated to the name and your table starts to be replicated.

Here is an example creating a table that stores airports. The engine definition has two important parameters. The first defines the replica path, which identifies the unique data in ZooKeeper or Keeper. The second, inside curly braces, is the replica name, which identifies a specific instance of that data.

If you define these explicitly, you need to remember to set the right paths on every node. ClickHouse provides a way to skip those parameters completely. Just define defaults in your server XML configuration, and by default the replica path uses {uuid} and the replica name defaults to the host name. For any ClickHouse version not older than two years, just use the defaults and do not worry about those paths.

In some cases you do have to use explicit paths. For example, if you want a table to be replicated to all nodes across all shards, you need to define the replica path explicitly because the default path contains a shard marker, meaning different shards will expect different data. To replicate to all shards, remove the shard marker from the path.

One important caveat: if you use a path that contains the table name and you rename the table, the path in ZooKeeper is not renamed. You cannot change it. If you later try to create a new table with the old name, it will fail because the path is already used. With UUID-based paths you do not have this problem.

[21:36] – What Operations Are and Are Not Replicated

Alexander: When a ReplicatedMergeTree table is created, the most important operations are automatically replicated. Data inserts are replicated to all replicas. Most ALTER TABLE operations, such as adding a column, are automatically replicated so that new data containing the new column can be replicated correctly. Mutations like ALTER TABLE UPDATE and ALTER TABLE DELETE are replicated. OPTIMIZE and UPDATE are replicated as well.

Some operations are not replicated. ALTER TABLE FREEZE for backups is not replicated because you usually want to back up only a single replica, not all of them. Moving a part to a different disk or volume is not replicated because different replicas may have different storage configurations. These operations are local to the node.

CREATE TABLE, DROP TABLE, RENAME, DETACH, and ATTACH are not replicated either. If you want to execute any of these on all nodes you must do so manually on every node, or use the ON CLUSTER syntax for distributed DDL, which we will discuss in a moment.

[25:26] – Converting a Non-Replicated Table to Replicated

Alexander: In many cases you start with a non-replicated table because it is easier and you do not need any special configuration. Later you may decide to add replication. There are two approaches.

The first is manual and works in all ClickHouse versions. Create a replicated table with the same structure and then attach partitions one by one from the non-replicated table using ALTER TABLE ATTACH PARTITION FROM. This operation is nearly instant for every partition because it uses hard links and does not copy any data. To generate all the attach statements you need a bit of SQL, which you can find in the Altinity Knowledge Base on engines and ReplicatedMergeTree.

The second approach is very new, added in ClickHouse 24.2. You can add a convert_to_replicated flag in your table’s data directory. When you restart the server, ClickHouse will automatically convert the non-replicated table to replicated. This is similar to how converting an Ordinary database to Atomic works. This approach requires a very recent ClickHouse version.

The ClickHouse team is also working on an ALTER TABLE MODIFY ENGINE statement that would handle this conversion more conveniently, but that work is still in progress.

[27:45] – What Else ClickHouse Replicates

Alexander: In addition to table data, ClickHouse replicates several other things. If you add ON CLUSTER to a DDL statement, it is executed on all nodes of the cluster. We will use this when creating distributed tables shortly.

ClickHouse can also use ZooKeeper or Keeper as storage for users. Instead of using users.xml or local storage, you can configure replicated storage for user directories. All ClickHouse nodes connected to the same ZooKeeper cluster will see the same users. Similarly, user-defined functions can be stored in Keeper, and there is a special setting required to enable that.

Finally, parts of your ClickHouse configuration can be stored in Keeper itself. This is a very old feature, but it exists: you can have an external system push configuration into Keeper and ClickHouse nodes will automatically read it on startup.

[30:06] – Inspecting Keeper Metadata via system.zookeeper

Alexander: You can always inspect what is stored in ZooKeeper or Keeper. ClickHouse has a system.zookeeper table that you can query with SELECT * FROM system.zookeeper WHERE path = '/'. You start at the root and navigate deeper by specifying more specific paths like /clickhouse/tables/0/default/airports. This reveals the replication log, quorum information, replica information, and so on. The task queue used for DDL is also visible there. If you have RBAC stored in Keeper, you will see an access node with user data (without passwords, of course).

[31:55] – Replication with Object Storage and Zero-Copy Replication

Alexander: Replication works a little bit differently when object storage is involved. With standard configuration, ClickHouse stores metadata locally on disk along with references to objects in S3 or GCS. Each replica maintains its own copy of files in object storage, which works correctly but is somewhat wasteful.

Zero-copy replication was designed to address this. The idea is that when new data arrives in object storage, ClickHouse does not need to copy it to another replica; it only needs to notify the other replica that the data exists in object storage. Unfortunately the zero-copy design has some issues and problems. There is ongoing work to rewrite it, including contributions from Altinity and others in the community, but it is not yet complete. If you use object storage, use multiple copies even if it seems redundant. Zero-copy is experimental and there have even been signals from the ClickHouse team that they may consider removing it in the future.

[34:09] – Building a Sharded and Replicated Cluster: Five Steps

Alexander: Now let us go to sharding. Say we have four ClickHouse nodes and want to build a schema that is both distributed and replicated.

From top to bottom we need to build several layers. There is the Distributed table, which is a special table engine that stores no data but creates an interface to data distributed across the cluster. Under that we have ontime_local, which stores the data itself and is replicated twice and sharded twice: half the rows on one shard, the other half on the second shard, each with a replica. Finally the airports table is replicated on all nodes.

[35:35] – Step 1: Defining the Cluster in remote_servers.xml

Alexander: The first thing we need to do is define the cluster itself in the ClickHouse configuration. This is defined in a remote_servers.xml file, typically in your config.d directory. Inside the remote_servers section we define the cluster named demo. The first shard is comprised of replicas 00 and 01, and the second shard of replicas 10 and 11. So there are four hosts, two shards, and two replicas each.

If you define a cluster in remote_servers, you can always check the layout by querying system.clusters. It displays information about all clusters defined in the ClickHouse system. You may have multiple clusters defined, and they will all appear here.

[36:57] – Step 2: Setting Macros

Alexander: The second step is defining macros. The {cluster} macro is needed for distributed DDL with ON CLUSTER. The {shard} and {replica} macros can be used in replica paths, though if you are using UUID-based paths you may not need {shard} in the path. The {replica} macro is used as the replica name and is required. Typically you number shards numerically and default the replica macro to the host name.

You can always check macros from the system.macros table.

[37:50] – Step 3: Creating the Replicated Local Table with ON CLUSTER

Alexander: Third, we create ontime_local, which is a replicated and sharded table. You can see in the path that it contains a {shard} macro, so the ZooKeeper path will be different for different shards. We use the ReplicatedMergeTree engine.

[38:19] – Step 4: Creating the Distributed Table

Alexander: Fourth, we define the Distributed table. It has the same column structure as ontime_local (you can use CREATE TABLE AS ontime_local to inherit the column definitions). The only difference is the engine. The Distributed engine takes the cluster name, the database, the table name, and an optional sharding key. The sharding key is only used during inserts. You can skip it, but in some cases you will want a custom key. For example, if you store user transactions and want all information about one user to land on the same shard, use something like cityHash64(user_id) as the sharding key. That ensures all data for a user lands on the same shard.

[39:53] – Step 5: Creating the Fully Replicated Dimension Table

Alexander: Finally, we define the airports table. This time we do not use the {shard} macro in the path, because we want this table to be replicated to all shards. It will be replicated across all shards of our cluster.

[40:15] – Using ON CLUSTER for Distributed DDL

Alexander: Previously these queries had to be run on every node individually. To make this more convenient we use the ON CLUSTER statement. We just add ON CLUSTER {cluster} to any DDL and ClickHouse will automatically execute it on every node. This works for CREATE TABLE, ALTER TABLE, DROP TABLE, and everything else. It is essential when working with larger clusters.

[41:00] – Distributed vs. Local Inserts

Alexander: Now let us talk about how to load data into the cluster and how to query it. You need to decide whether to insert using the Distributed table or using local inserts.

Both options are possible. You can insert into ontime (the distributed table) or into ontime_local (the local table on each node).

Local inserts: Simpler, no automatic sharding, but you get larger blocks and usually faster performance. If you are ingesting from Kafka with multiple consumers writing to every node, the data will be more or less evenly distributed across shards without any sharding key, because you are writing large batches from all nodes simultaneously.

Distributed inserts: Sharding is done automatically by ClickHouse using the sharding key. But inserting into the Distributed table requires more resources because every insert is split into parts, one per shard, and sent over the network. Your effective insert block size will be smaller on every node, which is not ideal for ClickHouse.

When you insert into a Distributed table, ClickHouse maintains a special local queue. It first splits the data into parts for each shard and puts them in this local queue. A background process then sends data to the shards. This means that if some shards are temporarily unavailable, the data accumulates in the queue and will eventually be sent. By default this is asynchronous, so the insert returns as soon as the data is in the local queue. You can also turn on synchronous propagation, in which case ClickHouse waits until data is sent to all shards before acknowledging the insert. This is much slower and is not recommended unless you have very strong consistency requirements.

[45:05] – How Distributed SELECT Queries Are Executed

Alexander: When you query the Distributed table, ClickHouse executes it in several stages. First, it rewrites the query and sends the rewritten version to all shards. On each shard, this rewritten query is executed locally. For example, if your query contains avg(DepDelay), the rewritten query on each shard will use avgState(DepDelay) instead, which produces a partial aggregate state rather than a final result. The rewritten query also replaces ontime with ontime_local. These local queries can be seen in your query log.

The partial results are collected by the initiator node, which merges the avgState values into the final average. This is the same partial-state technology used in AggregatingMergeTree, but applied to distributed queries as well.

ClickHouse also tries to push as much processing as possible to the local tables: filters, joins, and even subqueries. The idea is to minimize the data returned to the initiator. For hedged requests, ClickHouse may send the local query to multiple replicas of the same shard simultaneously and use whichever responds first. This is a newer performance optimization that may be disabled by default but is interesting to explore.

[47:10] – Distributed Product Mode: local, allow, and global

Alexander: What happens when two tables are distributed and you want to join them? ClickHouse can handle this in three different ways, controlled by the distributed_product_mode setting.

local mode: ClickHouse rewrites both parts of the query to use local tables. The sub-select or join is executed locally on each shard against local data. This works well when the tables are co-located by the same sharding key.

allow mode: The sub-select or join is not rewritten. On every shard, the full sub-select against the Distributed table is executed. If your second table also has billions of rows distributed across 10 shards, that sub-select will be executed 10 times (once per shard of the outer query). This can be very heavyweight.

global mode: The sub-select is executed only once on the initiator node, producing a temporary table. That temporary table is then shipped to all shards of the outer query and used in the join condition. This is the most efficient approach when the sub-select or join table is moderately sized.

By default the setting is not allowed, so if you try to query two distributed tables without setting distributed_product_mode, you will get an exception. ClickHouse does not know how to do this efficiently without help from the application developer who knows how the data is distributed.

[50:55] – Big-Table Model vs. Bucketing Model for Distributed Joins

Alexander: When designing a distributed schema with multiple large tables, there are two patterns to consider.

Big-table model (star schema): You have one large table distributed across shards. All other tables are small and are replicated to every node. There are no distributed joins at all. On every node you have a full copy of all dimension tables, so joins are purely local. This is very straightforward and works well for the typical star schema in analytics.

Bucketing model: You have multiple large tables that are all distributed using exactly the same sharding key. For any given key value, the data from all large tables lands on the same shard. You can then do local joins on each shard and produce correct global results. This model allows you to join trillion-row tables, which you normally cannot do because you do not have enough RAM on a single node. If you have a cluster with many nodes and both large tables are distributed the same way, local joins become feasible.

[52:30] – Parallel Replicas: Experimental Dynamic Sharding

Alexander: The last concept worth mentioning is parallel replicas, an experimental feature that allows ClickHouse to dynamically shard a replicated table across nodes.

When you have a query going to a table that is replicated across all nodes (not sharded), ClickHouse can dynamically assign different portions of the table to different replicas, execute them in parallel, and combine the results on the initiator, just like a sharded distributed query. This requires a good sharding key, which you specify in the configuration. Without knowing your data distribution, it is hard to pick a key that divides the work evenly.

This feature was designed primarily for object storage: if you have the same data attached to multiple replicas (using zero-copy replication, for example), you can use parallel replicas to query it efficiently using multiple nodes simultaneously. You can also experiment with it in your own environment.

[54:27] – Summary and Resources

Alexander: That is the overview. I do not think it was a very deep dive, but it at least pointed you toward topics you can research further. For more information:

The ClickHouse official docs are quite detailed now. We regularly post articles about different topics including replication and sharding troubleshooting on the Altinity blog and YouTube channel. The Altinity Knowledge Base is especially useful if you hit a specific problem: searching there often gives you a solution or a hint. You can also reach out to the ClickHouse Telegram channel, the Slack workspace, or join our training courses if you want these topics covered in more depth with exercises. And we have a free consultation: just sign up and talk to our experts.

[55:55] – Q&A: Balancing Shards When Scaling Horizontally

Audience: What is the best way to balance shards when scaling horizontally?

Alexander: The ClickHouse idiomatic approach is not to balance data at all when you add shards. The data is time-series in nature, which means new data is arriving all the time and your queries typically query the most recent data (hours, days, or weeks). When you scale a cluster, you typically double it: if you had three shards you add three more and now have six. You simply wait, and new data is distributed properly across all shards. Your old data is not redistributed, but if you have TTL rules then old data will be deleted automatically after a certain time. Eventually the cluster is balanced.

Having said that, ClickHouse does have an internal capability to move data between shards. It is not documented and is called ALTER TABLE MOVE PART TO SHARD. You need to enable a couple of settings to use it. This functionality was contributed by Cloudflare. We have tried it with some clients and it works, but it is experimental and undocumented. It is only applicable in rare cases when you really need to redistribute data that cannot be handled by the time-series approach. If you are interested, send us a request and we can send you instructions.

Q: Is there an execution dependency between the number of shards and insert block size?

Alexander: Yes. If you insert into a Distributed table, your block is divided into smaller blocks, one per shard. So if you have max_block_size configured at 1 million rows but you insert into a Distributed table with 10 shards, the effective block size on each shard is about 100,000 rows. You will never get blocks as large as the configured maximum when inserting through a Distributed table.

FAQ Section

Q: What is the difference between replication and sharding in ClickHouse, and when should I use each?

A: These are two independent horizontal scaling dimensions that solve different problems. Replication creates multiple copies of the same data on different nodes. This increases your queries per second and concurrent user capacity, because any replica can serve any read. It also provides high availability. Sharding distributes different subsets of data across different nodes. This is how you scale beyond the capacity of a single machine for both storage and write throughput: every query must touch all shards to produce a complete result, but each shard handles only a fraction of the data. In practice, most production clusters use both together: sharding to handle data volume and replication within each shard for high availability and read scaling.

Q: How does ClickHouse Keeper differ from ZooKeeper and which should I use?

A: ClickHouse Keeper is a replacement for ZooKeeper that speaks the same protocol but has extensions optimized for ClickHouse. It is distributed as part of the ClickHouse binary with different runtime roles. The standalone Keeper deployment, running Keeper on its own separate machines, is the recommended production choice because the load profiles for ClickHouse and Keeper are very different. Keeper can also run in-process inside ClickHouse nodes (embedding the coordination role inside the same process), which looks simpler operationally but is more complex to configure and is not recommended for production. Either option replaces the external ZooKeeper dependency.

Q: What does the Distributed table engine do and how does it work?

A: The Distributed table engine stores no data. It is a metadata interface that knows the cluster topology (which shards, which replicas) and the name of the underlying local table on each shard. When you query a Distributed table, ClickHouse rewrites the query to use local table names, sends it to all shards in parallel, receives partial aggregation states back, merges them on the initiator node, and returns the final result. When you insert into a Distributed table, ClickHouse uses the sharding key to assign each row to a shard, buffers the data in a local async queue, and forwards it to the appropriate shard nodes in the background.

Q: What are the tradeoffs between inserting into a Distributed table vs. inserting directly into local tables?

A: Inserting directly into local tables produces larger insert blocks on each node, which is better for ClickHouse performance and reduces background merge pressure. This approach requires you to manage routing yourself (choosing which shard to insert into) or to rely on a load balancer or Kafka consumers that naturally distribute writes across shards. Inserting into a Distributed table lets ClickHouse handle routing automatically using the sharding key, but it splits each insert batch into smaller per-shard blocks, increases network traffic, and is somewhat slower overall. For best performance, prefer local inserts with client-side or infrastructure-level balancing. Use Distributed inserts when you need automatic sharding-key-based routing and cannot manage routing at the client.

Q: What are the three distributed_product_mode options and when should I use each?

A: This setting controls what happens when a query against a Distributed table includes a sub-select or join against another Distributed table. In local mode, both parts of the query are rewritten to use local tables on each shard, which is efficient when tables are co-located by the same sharding key. In allow mode, the inner Distributed table query is executed in full on every shard of the outer query, which can be very expensive if the inner table is large. In global mode, the inner query is executed once on the initiator, and the resulting temporary table is shipped to all shards of the outer query for local joining, which is the most efficient general-purpose approach. The default is to disallow the combination entirely, forcing you to make an explicit choice.

© 2026 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE^® SOLUTIONS

Get in touch with ClickHouse experts.

Related:

Leave a Reply Cancel reply