ClickHouse® Data Management Internals | Understanding MergeTree Storage, Merges, and Replication - Altinity

Recorded: November 15 @ 08:00 am PDT
Presenters: Tatiana Saltykova, Robert Hodges, and Alexander Zaitsev

In this webinar, Altinity CEO, Robert Hodges, delivers a detailed walkthrough of ClickHouse® MergeTree storage internals, merges, mutations, and replication. A special thank-you goes to Tatiana Solova, whose deep operational experience with ClickHouse and work on Altinity’s administrator training class informed a large part of the material presented.

The talk is framed as part of Altinity’s ongoing series on ClickHouse internals: understanding what ClickHouse is doing under the covers lets you make queries faster, operate storage more cost-efficiently, and avoid the class of operational problems that only appear at production scale.

MergeTree table definition. Robert reviews the key clauses that distinguish a MergeTree table definition from ordinary SQL: ENGINE, PARTITION BY, ORDER BY, and PRIMARY KEY. Partitioning divides the table into manageable ranges; ordering controls which rows are physically adjacent, which has a large effect on both query scan efficiency and compression; and the sparse primary key index must fit in RAM to support fast lookups on very large tables.

Insert-to-part creation. When an insert arrives, ClickHouse constructs the new part entirely in RAM, sorts it, compresses it, and writes it to disk atomically. The result is a directory of column files on disk. Wide parts store two files per column (a .mrk2 mark file indexing into compressed data blocks, and the compressed column data itself). Compact parts collapse all columns into fewer files for small inserts.

Directory structure: Ordinary vs. Atomic databases. On the filesystem, ClickHouse tables in an Ordinary (legacy) database are found by walking a straightforward directory tree under /var/lib/clickhouse/data/. Atomic databases, introduced a few years ago, add UUID-based indirection: the data/ tree contains soft links that point into /var/lib/clickhouse/store/ where data actually lives. This indirection makes atomic operations like EXCHANGE TABLES possible without moving data. System tables system.databases and system.tables expose the UUID, metadata path, and data paths for every object.

The merge process. ClickHouse continuously merges small parts into larger ones in the background. The merge selector scans partitions and identifies candidate parts; the merge executor schedules and runs them. During a merge, ClickHouse reads the source parts into memory, re-sorts and re-compresses the data, writes the new part, and then marks the old parts as inactive. Old parts are removed automatically after approximately eight minutes. Because both old and new parts coexist briefly, merges require free disk space equal to the size of the data being merged. system.merges shows active merges in real time; system.part_log (when enabled) provides a full historical audit of every merge and part event.

Part naming and tracking. Part names encode the partition ID, the minimum and maximum block numbers covered by the part, and the merge level. Level zero parts were created directly by inserts; level one parts have been merged once; and so on. system.parts exposes all of this along with an active flag distinguishing live parts from inactive parts waiting for cleanup.

Hard links and ALTER TABLE FREEZE. ClickHouse exploits Linux hard links throughout its storage layer. Because multiple directory entries can point to the same inode, ClickHouse can create an instant snapshot of a table’s parts in a shadow/ directory without copying any data. ALTER TABLE ... FREEZE does exactly this: it hard-links every part into shadow/, providing a stable snapshot for backup tools to copy while ClickHouse continues merging and inserting. ALTER TABLE ... UNFREEZE removes the hard links. The critical operational implication is that frozen parts consume inodes and disk space that df will report but system.parts will not, which can cause apparent disk-full conditions. Hard links are also used in mutations to avoid rewriting unchanged columns.

Part manipulation commands. ALTER TABLE DROP PARTITION, DETACH PARTITION/PART, ATTACH PARTITION/PART, MOVE PARTITION TO TABLE, and related commands operate by moving or removing directory hard links rather than by copying data, making them nearly instantaneous even on very large tables. The detached/ subdirectory inside each table’s data directory is where detached parts live until they are explicitly re-attached or dropped.

Mutations: ALTER TABLE DELETE and ALTER TABLE UPDATE. Mutations are asynchronous background rewrites of parts. An ALTER TABLE UPDATE is relatively efficient: ClickHouse only rewrites the column files that changed and hard-links all other column files from the old part into the new one, so unchanged data is not copied. An ALTER TABLE DELETE is expensive: every affected part must be fully rewritten because removing rows requires touching all column files. Mutations are tracked in system.mutations, which shows their status and any errors. Active merges executing a mutation are visible in system.merges. A stuck or runaway mutation can be stopped with KILL MUTATION WHERE ....

Common operational problems. Three problems arise from misuse of MergeTree semantics: too many parts (caused by inserts that scatter data across many partitions, or by inserts that are too small and too frequent), slow startup (caused by very large numbers of tables or parts, since ClickHouse must open file descriptors for every part on startup), and too many queued mutations (caused by issuing mutations without waiting for earlier ones to complete).

ReplicatedMergeTree. Replication in ClickHouse operates at the table level, not the database or cluster level. A ReplicatedMergeTree table is defined by specifying a ZooKeeper or Keeper path (identifying the shared logical table) and a replica name (identifying this particular copy). All tables sharing the same ZooKeeper path are replicas of one another. At the storage level, replicated tables are identical to non-replicated ones. Inserts, OPTIMIZE, and TRUNCATE are replicated. CREATE TABLE, DROP TABLE, RENAME, DETACH, and ATTACH are not replicated and must be executed on all nodes, or via ON CLUSTER. Replication is asynchronous but sequential: ZooKeeper or Keeper maintains the ordered queue of parts and scheduled merges that every replica must process. system.replication_queue and system.replicas expose the state of this queue and its replicas.

Distributed DDL and the ON CLUSTER queue. The ON CLUSTER clause on any DDL statement causes ClickHouse to place the command into a ZooKeeper task queue. A DDLWorker process on each node polls this queue and executes the command locally. system.distributed_ddl_queue shows what ClickHouse believes is in the queue; system.zookeeper lets you inspect the raw ZooKeeper entries. Stuck DDL tasks can sometimes self-heal; for persistent problems, entries can be removed directly from ZooKeeper using the Keeper CLI, but this is a last resort.

Here are the slides:

ClickHouse Data Management Internals-MergeTree Storage, Merges, Replication-2023-11-15 Download

Key Moments (Timestamps)

Key moments generated with AI assistance.

00:05 – Welcome and housekeeping
01:36 – Speaker introductions: Robert Hodges, Tatiana Solova (content), Alexander Zaitsev (CTO)
03:36 – What is ClickHouse? Overview for context
05:28 – Why understanding internals matters: faster queries, cost-efficient operation
06:24 – MergeTree: the one table engine to rule them all
07:11 – MergeTree table definition: ENGINE, PARTITION BY, ORDER BY, sparse primary key
09:13 – What happens on INSERT: constructing a part in RAM and writing to disk
10:37 – What a part looks like on disk: wide parts, .mrk2 mark files, compressed column files
14:28 – Compact parts vs. wide parts
14:52 – The /var/lib/clickhouse directory structure: data/, metadata/, store/
16:08 – Ordinary (Normal) vs. Atomic databases: UUID indirection and soft links
18:29 – Walking an Atomic database: system.databases, system.tables, soft links into store/
22:18 – Finding an Ordinary database table on disk: the data/ directory tree
25:05 – Why merges happen and how the merge algorithm works
28:47 – Part naming convention: partition_id_minblock_maxblock_level
30:10 – system.parts: active flag, level, row counts
31:16 – Part manipulation commands: DROP PARTITION, DETACH, ATTACH, MOVE
33:44 – How hard links work on Linux filesystems
37:04 – ALTER TABLE FREEZE and the shadow/ directory: instant snapshots via hard links
39:11 – Hard links, inodes, and the operational risk of frozen parts consuming disk space
41:16 – Mutations: ALTER TABLE DELETE vs. ALTER TABLE UPDATE
42:03 – How ALTER TABLE UPDATE uses hard links for unchanged columns
42:58 – How ALTER TABLE DELETE requires a full part rewrite
43:02 – system.mutations, system.merges, and KILL MUTATION
45:48 – Common problem: too many parts
47:54 – Common problem: slow startup from too many tables or parts
48:58 – Common problem: too many queued mutations
49:44 – ReplicatedMergeTree: table-level replication, ZooKeeper path, replica name
51:36 – What is and is not replicated in ReplicatedMergeTree
52:44 – Replication is asynchronous but sequential: the ZooKeeper queue
54:04 – system.replication_queue and system.replicas
55:06 – Distributed DDL and the ON CLUSTER queue: DDLWorker, system.distributed_ddl_queue
57:16 – Stuck DDL: wait first, inspect system.distributed_ddl_queue, remove from ZooKeeper as last resort
58:18 – Wrap-up: further learning, Altinity admin training class
1:01:22 – Q&A: temp_move directories in S3 cache, part count guidelines

Webinar Transcript

[00:05] – Welcome and Housekeeping

Robert: Hi everybody and welcome to our webinar on ClickHouse® data management internals. We will be talking about MergeTree storage, merges, replication, the mysterious DDL replication queue, and other good things. This is really going to dig into internals in a big way.

This video is being recorded. We will send an email to everyone who signed up with links to the recording and the slides, so you do not have to take frantic notes. We also have time for questions. I should be joined shortly by my partner in crime Alexander Zaitsev, our CTO. If you have questions, feel free to pop them into the chat or the Q&A box. If they are relevant to what I am currently discussing I will address them right away; otherwise we will have time at the end.

[01:36] – Introductions

Robert: I am Robert Hodges, Altinity CEO, but for the purposes of this talk I am a database and specifically a ClickHouse geek. I have been working with databases for about 40 years. Two other people contributed profoundly to this presentation. Tatiana Solova is an outstanding Altinity support engineer who has run ClickHouse for years, leads our administrative training class, and prepared a large part of the content in this talk. Alexander Zaitsev is our CTO and the architect of Altinity.Cloud; he has run and operated ClickHouse for many years as well.

Altinity is an enterprise provider for ClickHouse. Among other things, we operate Altinity.Cloud, the first managed ClickHouse developed for the Amazon public cloud, now also running in Google and Azure. We are the authors of the Kubernetes Operator for ClickHouse® and have long experience operating very large numbers of clusters at scale for customers and on their behalf.

[03:36] – ClickHouse Overview

Robert: For those who have not heard of ClickHouse: it is a very popular open-source real-time analytic database. It is particularly suited for real-time access to data where human analysts can do complex queries across very large amounts of data and get answers back in a second or less. With pre-aggregation, results can come back in as little as 20 milliseconds, fast enough to feed query results into a software process rendering a page or delivering recommendations.

Like MySQL it understands SQL, runs practically anywhere, and is open source (Apache 2.0). Like most data warehouses it stores data in columns, is very good at parallelizing execution using vectorized CPU instructions, and scales to many petabytes. We ourselves run Altinity.Cloud instances where individual servers attach 40 to 50 terabytes of block storage, and you can extend further to S3.

[05:28] – Why Understanding Internals Matters

Robert: This talk belongs to a series where we present what is going on under the covers in ClickHouse. The idea is simple: if you understand what ClickHouse is doing, you can make it faster, use it more efficiently, and understand the cost of different types of operations well enough to operate cost-efficiently. ClickHouse has a relatively simple execution model; the storage layer is less simple, but practically anyone can understand it given enough exposure. That knowledge is what we try to make as accessible as possible here.

[06:24] – MergeTree: The One Table Engine to Rule Them All

Robert: We are going to start with the basics of MergeTree, which is the workhorse table for big data in ClickHouse. ClickHouse has many table types, each serving a different purpose: selecting data directly from Kafka topics, maintaining data in memory, maintaining dictionaries. For this talk we focus entirely on MergeTree. All of its variants, including AggregatingMergeTree and ReplicatedMergeTree, behave essentially the same way at the storage level. What you learn here applies to all of them.

[07:11] – MergeTree Table Definition

Robert: A MergeTree table definition looks like SQL, except for some critical differences. Looking past the data types, the key elements are the ENGINE clause, the PARTITION BY clause, the ORDER BY clause, and optionally a separate PRIMARY KEY.

The PARTITION BY clause tells ClickHouse how to break the table into pieces. For time-series data you typically partition by month, which keeps all rows for a given month within the same set of parts. The ORDER BY clause controls how rows are physically sorted within each part. Ordering is critical in columnar databases because it determines which rows are adjacent, which limits how much data you scan in a query, and it profoundly affects compression since similar values compress better when they are contiguous. For large tables it is important to keep the primary key index to fewer columns than ORDER BY, because the sparse index is designed to fit in memory to allow fast lookups even on extremely large tables.

[09:13] – INSERT: Constructing a Part in RAM and Writing to Disk

Robert: When you do an insert, ClickHouse parses the statement, does some planning, then loads the data. What happens during load is that ClickHouse constructs what is called a part entirely in RAM: it sorts the data, compresses it, and writes it out to storage. This process is typically very quick; even a simple insert is mostly bound by network latency to the server rather than processing time.

So the immediate result of every insert is a new part on disk. If you just left it at that, you would end up with many small parts, which are inefficient for queries for several reasons we will cover shortly. This is why merging exists.

[10:37] – What a Part Looks Like on Disk: Wide Parts

Robert: Inside a MergeTree table the data are divided into parts, and parts live within partitions. In the example we are using, all rows in any individual part belong to the same month. There can be many parts for a given month: initially every insert produces at least one part, and merges then collapse them.

Looking closely at a part directory, rows within the part are sorted by the ORDER BY columns and a sparse index helps locate rows by those columns quickly.

This is what ClickHouse calls a wide part, and it is the primary structure to understand. A wide part has two files per column: a .mrk2 file and a compressed data file. The .mrk2 file is a mark file, which is an index into the starting points of compressed data blocks within the column’s data file. The primary key index structure points to particular marks, and marks point to particular blocks of compressed column data. This two-level indirection allows ClickHouse to efficiently locate and read just the portions of each column needed for a query.

One of the best things about ClickHouse is that unlike, say, InnoDB in MySQL where the storage is a largely opaque tablespace, you can go to the server and look directly at the individual files on disk. If you have access to the host, go to /var/lib/clickhouse/data/, navigate into the database and table directories, and you can see the parts and their constituent files.

[14:28] – Compact Parts

Robert: For small blocks, ClickHouse uses a compact part instead of a wide part. Compact parts collapse the column structure down to a much smaller number of files instead of two files per column. This means ClickHouse does not have to open hundreds of file descriptors for a small part. Compact parts are an important performance feature for systems with frequent small inserts.

[14:52] – The /var/lib/clickhouse Directory Structure

Robert: On a typical ClickHouse server the storage lives in /var/lib/clickhouse. Three directories are particularly important. The data/ directory is where you navigate to find the actual data inside your tables. The metadata/ directory is where all SQL table and database definitions are stored. The store/ directory is a relatively recent addition to the layout and is used to store table data for Atomic databases, as we are about to discuss.

[16:08] – Ordinary vs. Atomic Databases

Robert: One of the first things you encounter as you explore the on-disk structures is the difference between what ClickHouse calls Ordinary (Normal) databases and Atomic databases.

Ordinary databases are the original format. They lay things out on disk without any indirection, so the directory structure directly mirrors the logical structure: the database is a directory, the table is a subdirectory, and the parts are inside the table directory. This is simple and transparent.

However, this structure has a limitation: operations like EXCHANGE TABLES (where you atomically swap a new copy of a table into place) cannot be done safely with a plain directory structure, because renaming directories is not atomic at the filesystem level.

Atomic databases, introduced a few years ago, add UUID-based indirection. Every database and table is assigned a UUID. The actual files live under /var/lib/clickhouse/store/ organized by UUID. The entries visible under data/ are soft links that point into store/. This means that atomic operations like EXCHANGE TABLES can be done by simply flipping which UUID a soft link points to, without moving any data.

An important practical note: ClickHouse does not automatically upgrade an existing Ordinary database to Atomic. If you have an older server or an older default database, it will remain Ordinary until you convert it manually. A brand-new installation creates default as Atomic.

You can always find out which type any database is by querying system.databases. An Atomic database will have a UUID filled in; an Ordinary database will have an empty UUID. Similarly, system.tables exposes the metadata path and data paths for every table, pointing you directly to the relevant locations on disk.

[22:18] – Walking an Ordinary Database on Disk

Robert: In an Ordinary database, you can navigate directly. Do ls under data/default/ontime and you will see the part directories, the detached/ subdirectory, and a format_version.txt file. You can also query system.parts to get the same information including the exact path of every part.

The detached/ directory is important: it is where parts go when you issue an ALTER TABLE DETACH PART or ALTER TABLE DETACH PARTITION. ClickHouse moves the part directory there and then forgets about it. The data still exists on disk and can be re-attached later.

[25:05] – Why Merges Happen and How the Merge Algorithm Works

Robert: Because every insert creates at least one new part, and because small parts are inefficient for queries and consume large numbers of file descriptors (at minimum 218 files for a table with 109 columns), ClickHouse continuously merges parts in the background to produce fewer, larger parts.

The merge algorithm has two components. The merge selector runs continuously, scanning partitions and identifying sets of adjacent parts that are candidates for merging. The merge executor schedules the selected merges in the correct order and runs them.

When a merge runs, ClickHouse reads the source parts into memory, re-sorts all the data (applying the ORDER BY again across the combined rows), recompresses it, and writes the result as a new part. The old parts are then marked as inactive and removed automatically after approximately eight minutes.

Because both the old parts and the new merged part exist simultaneously on disk during the merge, you always need free disk space equal to at least the size of the data being merged. Active merging on a very busy system means you must maintain a buffer of free space, otherwise merges will stop entirely.

You can see what merges are currently running by querying system.merges. For a complete historical record, enable system.part_log, which logs every merge and every part event. This table is invaluable for understanding merge activity patterns over time.

[28:47] – Part Naming Convention

Robert: Part names encode their lineage. The format is partitionID_minBlock_maxBlock_level. The partition ID identifies which partition (for example, 201801 for January 2018). The min and max block numbers indicate which original insert blocks are contained in this part. The level indicates how many times the data has been merged: level 0 parts were created directly by an insert, level 1 have been merged once, level 2 twice, and so on. You can actually trace the provenance of any part from its name alone, which is a powerful diagnostic tool. For a deep reference on part naming and MVCC in ClickHouse®, the Altinity Knowledge Base article covers the details with live examples.

[30:10] – system.parts: Tracking the State of Your Data

Robert: system.parts is one of the most important tables for managing ClickHouse storage. It shows every part, both active and inactive, with columns for the part name, whether it is active, its merge level, its row count, its compressed size, and the partition it belongs to. After a merge you will see the old parts with active = 0, meaning they are marked for deletion but not yet removed, alongside the new merged part with active = 1. The merged part will have a much larger row count than any of its source parts.

[31:16] – Part Manipulation Commands

Robert: ClickHouse provides a family of ALTER TABLE commands for managing parts and partitions. The most commonly used ones include DROP PARTITION, DETACH PARTITION, DETACH PART, ATTACH PARTITION, ATTACH PART, MOVE PARTITION TO TABLE, and CLEAR COLUMN IN PARTITION.

These commands are extremely fast because, for operations local to a single disk, ClickHouse is not moving bytes of data. It is moving directory entries, which are hard links to the underlying files. Dropping an entire partition that contains a year of data, for example, completes in a second or two regardless of the data volume.

To use these commands you need to identify the partition IDs and part names from system.parts. A simple query selecting partition, partition_id, and name from system.parts for your table gives you all the identifiers you need.

[33:44] – How Hard Links Work on Linux

Robert: ClickHouse makes extensive use of Linux hard links throughout its storage layer, so understanding them is essential.

In a Linux filesystem, files are stored as blobs of data identified by inodes. A directory entry is just a name that points (hard links) to an inode. The important consequence is that multiple directory entries can point to the same inode, meaning two paths can refer to the same file. The file’s data is not deleted until all hard links to its inode are removed. The filesystem maintains a reference count on each inode for this purpose.

This property allows ClickHouse to create cheap snapshots of data: instead of copying files, ClickHouse creates new directory entries pointing to the existing inodes. The data is shared; only a small amount of new directory metadata is created.

[37:04] – ALTER TABLE FREEZE and the shadow/ Directory

Robert: The most visible use of hard links for operators is the ALTER TABLE FREEZE command for backups. When you run ALTER TABLE default.ontime FREEZE WITH NAME 'my_backup', ClickHouse creates a directory under /var/lib/clickhouse/shadow/my_backup/ and for every active part in the table it creates hard links inside that directory pointing to the same inode as the real part files. The operation is nearly instantaneous regardless of table size.

The result is a stable snapshot: even if ClickHouse continues inserting data and merging parts, the original part files will not disappear because the shadow directory’s hard links keep their inodes alive. A backup tool can then take its time copying the shadow directory to a remote location without risk of the data changing or disappearing.

ALTER TABLE ... UNFREEZE WITH NAME 'my_backup' removes the hard links from the shadow directory. At that point, if ClickHouse has already merged away the original parts, those files will be released and the disk space reclaimed.

[39:11] – Hard Links, Inodes, and the Risk of Frozen Parts

Robert: There is an important operational consequence of how freeze works. While a freeze is active, the part files it references occupy both disk space and inodes even though system.parts does not know they exist. This can cause two types of problems.

First, apparent disk-full conditions. If you run df and see 120 GB used on a disk where system.parts only accounts for 100 GB, the missing 20 GB may be in shadow directories from freezes. Second, inode exhaustion. Linux filesystems have a finite number of inodes. If you accumulate a very large number of small parts, each with its own set of frozen hard links, you can exhaust the inode table entirely. This is a serious problem because it makes it impossible to create new files even when disk space is available.

The practical rule is: always unfreeze backup snapshots promptly after you have finished copying them to remote storage.

[41:16] – Mutations: ALTER TABLE DELETE and ALTER TABLE UPDATE

Robert: When you need to change the content of existing rows in ClickHouse, you use mutations: ALTER TABLE DELETE WHERE ... and ALTER TABLE UPDATE column = expr WHERE .... These are asynchronous operations. When you issue one, it returns immediately, and the actual work happens in the background. This design reflects ClickHouse’s assumption that in-place changes to existing data are relatively rare in time-ordered analytic workloads.

ALTER TABLE UPDATE is relatively efficient. When ClickHouse processes the mutation on a given part, it only rewrites the column files for the columns named in the UPDATE clause. For every other column in the part, it creates hard links in the new part directory pointing back to the original inode. Only the changed columns are actually rewritten. The old part is then marked inactive and removed in the usual eight-minute window.

ALTER TABLE DELETE is expensive. Removing rows requires rewriting the entire part because all column files must be updated to remove the deleted rows’ offsets. Hard links cannot be used for any column. For large parts this generates substantial I/O and takes significant time.

[43:02] – Tracking and Controlling Mutations

Robert: Mutations are tracked in system.mutations. This table shows every active mutation, which parts have been processed, and which parts remain. If a mutation is failing it will also show the error message. The query SELECT * FROM system.mutations WHERE latest_fail_time > toDateTime(0) identifies any mutation that has encountered errors.

Active mutations currently executing can also be seen in system.merges, since the merge executor handles mutation work.

If a mutation gets stuck, is running too slowly, or was issued by mistake, you can stop it with:

sql

KILL MUTATION WHERE database = 'default' AND table = 'ontime' AND mutation_id = 'mutation_12345.txt'

ClickHouse will stop the mutation at the next opportunity. In rare cases (typically bugs) a mutation may not stop immediately, but this is uncommon.

[45:48] – Common Problem: Too Many Parts

Robert: Too many parts is one of the most common errors in production ClickHouse systems. ClickHouse will start rejecting or delaying inserts when the number of parts per partition exceeds the parts_to_throw_insert threshold.

The root cause is almost always the application generating inserts that create too many parts too quickly for the background merge process to keep up. The most typical scenario is inserting data in small batches, or inserting data where the rows span many different partition values, so a single insert block creates one new part per partition rather than one part total. For example, if you are partitioning by month and insert records spanning several years of data in one batch, ClickHouse creates a separate new part for each month represented.

The best fix is always to go back to the application and change the insert pattern: use larger batches, and ensure that each insert block targets as few partitions as possible. MergeTree settings can provide temporary relief, but they do not address the underlying cause.

[47:54] – Common Problem: Slow Startup from Too Many Tables or Parts

Robert: On startup, ClickHouse must open file descriptors for every part of every table to verify its state. A wide part with 109 columns has at minimum 218 files. If you have very large numbers of tables, or very large numbers of small parts, ClickHouse startup time can stretch to minutes or even longer.

There is an active discussion in the ClickHouse community about installations with 100,000+ tables. The simple operational guidance is: do not have a large number of tables. ClickHouse works best with a small number of very large tables. Similarly, large parts are fine. A part that is 50 gigabytes is not a problem for MergeTree; what causes problems is having thousands of tiny parts.

[48:58] – Common Problem: Too Many Queued Mutations

Robert: Mutations are processed sequentially from a queue. Because they return immediately without showing you the cost, it is easy to queue up many mutations, only to discover they are making slow progress. Mutations block one another: a mutation at the head of the queue must complete (or be killed) before later ones can start. If you queue many large mutations and the first one takes hours, the rest will wait.

The operational guidance is to be conservative about mutations: issue them one at a time, verify each one completes before issuing the next, and avoid running large mutations on tables with many large parts unless you understand the time and I/O cost involved.

[49:44] – ReplicatedMergeTree: Table-Level Replication

Robert: MergeTree tables support replication at the level of individual tables, not at the database or cluster level. To enable replication, you change the engine to ReplicatedMergeTree and supply two additional parameters.

The first parameter is a ZooKeeper or ClickHouse Keeper path that identifies the logical table. All replicas sharing this path will replicate data to one another. The second parameter is the replica name, which uniquely identifies this specific copy. Typically the replica name is set to the host name of the server using the {replica} macro.

At the storage level, replicated tables are structurally identical to non-replicated ones. The replication logic is layered on top of the same part structures, mark files, and column files.

[51:36] – What Is and Is Not Replicated

Robert: In a ReplicatedMergeTree table, inserts, OPTIMIZE TABLE, and TRUNCATE are automatically replicated. ALTER TABLE DDL operations that change the schema (adding a column, modifying a column type) are also replicated, which is necessary for new data to be compatible across replicas.

Operations that are specifically not replicated include CREATE TABLE, DROP TABLE, RENAME TABLE, DETACH, and ATTACH. These must be executed on every node individually, or via ON CLUSTER for distributed DDL. Two operations on an existing table are also not replicated: ALTER TABLE ... FREEZE (you typically only want to back up one replica) and ALTER TABLE ... MOVE PART TO DISK/VOLUME (because different replicas may have different storage configurations).

[52:44] – Replication Is Asynchronous but Sequential

Robert: Replication in ClickHouse is asynchronous: an insert does not wait for all replicas to acknowledge before returning. However, it is sequential: ZooKeeper or Keeper maintains an ordered queue of operations that every replica must process in the same order. This includes both new parts from inserts and merge schedules, since merges also change parts and must be synchronized across replicas to keep them consistent.

A replica that has been offline, for example for maintenance, will reconnect and process the queue to catch up on everything it missed. system.replication_queue shows the current state of this queue, including what type of operation is pending and which parts are involved. system.replicas gives a higher-level view of replica health and synchronization status across all replicated tables.

[55:06] – Distributed DDL and the ON CLUSTER Queue

Robert: The ON CLUSTER clause makes it possible to issue a single DDL command that executes on every node in a cluster. For example, CREATE TABLE IF NOT EXISTS ontime ON CLUSTER '{cluster}' (...) will create the table on all nodes without having to connect to each one individually.

Internally this works through a ZooKeeper task queue. When a node receives an ON CLUSTER DDL command, it writes the command as an entry into a ZooKeeper path. A background process called the DDL worker on each node polls this path, finds tasks assigned to it, and executes them locally. Results are written back to ZooKeeper so the initiating node can report success or failure.

You can inspect the state of this queue from two angles. system.distributed_ddl_queue shows what ClickHouse believes is in the queue. system.zookeeper lets you inspect the raw ZooKeeper nodes themselves. For diagnosing why a particular node is not executing DDL, querying system.distributed_ddl_queue cluster-wide using clusterAllReplicas() is especially useful.

For more detail on how ALTERs and DDL operations work internally, including mutation mechanics and the DDL queue, the Altinity Knowledge Base has a dedicated article.

[57:16] – Stuck DDL: What to Do

Robert: DDL tasks can occasionally get stuck, for example if a node is offline long enough that its DDL worker falls behind, or if there is a transient error during execution. The first and most important step is to wait. ClickHouse has many self-healing mechanisms for the DDL queue and many apparent problems will resolve themselves.

If waiting does not help, inspect system.distributed_ddl_queue to understand which task is stuck and why. As a last resort, DDL entries can be removed directly from ZooKeeper using the Keeper CLI or the ZooKeeper CLI, which effectively cancels those tasks. This is a destructive operation and should only be done if you understand exactly what is being removed. If you are an Altinity customer, this is exactly the kind of situation we help with.

[58:18] – Wrap-Up and Further Learning

Robert: This is a deep subject. There is much more that we did not cover today: S3 object storage, detailed query execution and how it accesses parts, compact versus wide parts in depth, column storage formats, and caching, which is a critically important topic for understanding ClickHouse performance when storage is slow. These topics all have existing resources in the Altinity library.

If you want to learn more, the best path is the Altinity administrator training class for ClickHouse®. The material in this talk represents just the tip of what that class covers, including resource management, cluster management, and much more.

For written documentation, the ClickHouse official docs are excellent. The Altinity Knowledge Base has practical, down-to-earth articles on specific problems. Altinity’s YouTube channel and blog have many additional videos and articles. And if you really want to understand what ClickHouse does with storage, reading the C++ source code is often the best guide: it is quite readable and well-commented.

[1:01:22] – Q&A

Audience: I have hundreds of directories named temp_move_NNN_timestamp_tablename. What is going on?

Alexander: Those temp_move directories are created when ClickHouse is moving data between different storage volumes, for example from one disk to another. When ClickHouse starts a move operation, it creates a temporary directory first. Normally this gets cleaned up when the move completes.

Robert: Right, similarly when ClickHouse does a merge it creates a temporary directory that is normally cleaned up. If these temporary directories are persisting for a long time, something is failing partway through the move and not cleaning up after itself. In this case it sounds like you are using S3 storage and the directories are appearing in the S3 cache. This looks like a specific bug. Please log this as an issue in the ClickHouse issue tracker: the people working on object storage would want to know about it.

Audience: How many parts is too many? How many tables?

Robert: For parts, aim for roughly a thousand active parts per partition or less. Large parts are fine. A single part that is 50 gigabytes is perfectly healthy for MergeTree, so do not be concerned about part size. What causes problems is a very large number of small parts. For tables, the same principle applies: ClickHouse works best with a small number of very large tables. Having tens of thousands of tables will create serious startup time and resource management problems. If you can redesign to reduce the number of tables, you should.

FAQ Section

Q: What is the difference between an Ordinary (Normal) and an Atomic database in ClickHouse, and does it matter?

A: Ordinary databases, which are the original ClickHouse format, store tables directly in the filesystem hierarchy under /var/lib/clickhouse/data/. Atomic databases, introduced a few years ago, give every database and table a UUID and store the actual data under /var/lib/clickhouse/store/ organized by those UUIDs. The entries visible under data/ are soft links pointing into store/. The practical difference is that Atomic databases support atomic table operations like EXCHANGE TABLES, where two tables are swapped instantaneously by flipping soft links rather than by moving data. This is useful for blue-green deployments and safe schema migrations. ClickHouse does not upgrade Ordinary databases automatically, so older servers may still have Ordinary databases, but new installations default to Atomic. You can check system.databases to see which type each database is.

Q: Why do ALTER TABLE commands like DROP PARTITION and DETACH PART complete almost instantly even on huge tables?

A: Because they operate on hard links rather than on data bytes. In the ClickHouse storage model, a part is a directory of files. Those files are referenced by hard links from the part directory. Dropping a partition means removing the hard links that reference the part files, which the Linux filesystem handles as a simple directory operation regardless of how much data is inside. The data bytes are not physically removed until the reference count on each inode reaches zero, which happens automatically when ClickHouse removes both the part directory and any shadow directories referencing those files. The actual on-disk reclamation happens in the background, but the operation completes immediately from ClickHouse’s perspective.

Q: What is ALTER TABLE FREEZE used for and what are the operational risks?

A: ALTER TABLE FREEZE creates an instant snapshot of a table by hard-linking all its active part files into a shadow/ subdirectory. This is the foundation of ClickHouse’s native backup mechanism. Because the snapshot uses hard links, it is instantaneous and does not block inserts or merges. The risk is that while the freeze is active, the hard-linked files cannot be reclaimed even if ClickHouse merges or drops the originals. This means the shadow directory consumes both disk space and inodes that system.parts will not account for, which can produce misleading disk-full conditions. Always unfreeze backup snapshots promptly after copying them to remote storage to release those resources. For more on this, see the Altinity blog article on ClickHouse backups.

Q: What is the difference between ALTER TABLE DELETE and ALTER TABLE UPDATE in terms of cost?

A: They differ significantly in how much data they must rewrite. ALTER TABLE UPDATE only rewrites the column files for the columns being changed. All unchanged column files in the new part are created as hard links pointing to the original files, so their data is not copied. This makes column updates relatively affordable even on large tables. ALTER TABLE DELETE, by contrast, must rewrite every column in every affected part because removing rows requires recalculating the offsets and storage positions for all remaining rows in all columns. There is no way to share unchanged data via hard links when rows are being deleted. For this reason, large-scale deletes are expensive operations in ClickHouse, and the recommendation is to use TTL-based expiry or partition drops when possible rather than row-level deletes.

Q: What system tables are most useful for diagnosing MergeTree operational problems?

A: The most important ones are: system.parts, which shows the state of every active and inactive part including row counts, sizes, and merge levels; system.merges, which shows currently running merges and mutations; system.part_log, when enabled, which provides a complete historical record of all part events including merges, inserts, and drops; system.mutations, which shows all pending and completed mutations and any errors they have encountered; system.replication_queue, which shows the per-table replication work queue for replicated tables; system.replicas, which gives a high-level health view of all replicated tables; and system.distributed_ddl_queue, which shows the state of pending ON CLUSTER DDL tasks. Taken together these tables give you a complete picture of what ClickHouse is doing with your data at any moment.

© 2026 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE^® SOLUTIONS

Get in touch with ClickHouse experts.

Leave a Reply Cancel reply