The Doctor Is In–Quick First Aid for Broken ClickHouse® Clusters

Recorded: October 16 @ 08:00 am PDT
Presenters: Tatiana Saltikova, Robert Hodges, and Diego Nieto

In this webinar, Altinity’s CEO, Robert Hodges, and Support Engineer, Diego Nieto, walk through six of the most common and damaging production ClickHouse® cluster problems: how to recognize them, what is actually happening under the covers, and how to fix them. A special thank-you goes to Tatiana Solova, who helped prepare many of the materials used in the presentation.

The talk opens by noting that these problems are hard to reproduce on a laptop. They appear on large, heavily loaded systems with real production data, which is precisely where they matter most. Robert and Diego cover each problem in sequence.

Problem 1: Too many simultaneous queries. When max_concurrent_queries is exceeded, ClickHouse rejects queries outright. The right fix order is: make queries more efficient first, then add resources, then use per-user profile limits, and only then raise the limit itself (after confirming the server is not already resource-saturated).

Problem 2: Too many parts. The infamous TOO_MANY_PARTS error is triggered when the number of parts per partition exceeds parts_to_throw_insert. Root causes include inserts that are too small, inserts that spray data across many partitions, materialized views with different partitioning from their source, and ClickHouse not merging small blocks fast enough. Fixes include larger batches, async inserts, Kafka buffering, and matching insert blocks to partition keys.

Problem 3: Detached and broken parts. Parts can end up detached by humans, as leftovers from merges, due to replication issues, or because of genuine corruption. The system.detached_parts table is your diagnostic starting point. The correct action depends on the reason field: no reason means manually detached (just re-attach); ignored or covered-by-broken means safe to drop; cloned is a leftover from replica repair; broken-on-start or broken means corrupt data; unexpected means ZooKeeper does not know about it and SYSTEM RESTORE REPLICA is needed.

Problem 4: Suspicious parts at startup. When ClickHouse detects too many broken parts on startup, it stops rather than proceeding. This is a circuit breaker protecting against misconfiguration. Check your configuration first. If it is safe to proceed, options include raising max_suspicious_broken_parts, editing the table SQL file to add the setting per-table, or using the force_restore_data flag.

Problem 5: Stuck mutations. Mutations (ALTER TABLE DELETE, column modifications, adding projections) are processed in a sequential queue. They can get stuck due to a bad mutation that will always fail, a second mutation blocked by the first, or simply very slow progress on a large table. Diagnose using system.mutations and system.merges. Fix by using KILL MUTATION to clear bad or blocking entries, or wait patiently for large legitimate mutations to finish.

Problem 6: Stuck replication and lost replicas. Replication depends on ZooKeeper or Keeper. Common causes of stuck replication include replicas going read-only because they cannot reach ZooKeeper, the replication queue blocked by large or broken mutations, old stuck tasks in the queue, and resource overload. Lost replicas occur when a replica is offline long enough that the Keeper log (which stores the last 1,000 operations per table) has moved on. The fix for a lost replica involves detaching the table, running DROP REPLICA to clear stale ZooKeeper metadata, attaching the table, running SYSTEM RESTORE REPLICA, and then SYSTEM SYNC REPLICA.

The session closes with Q&A covering how mutations affect concurrent reads (they do not block reads; new parts replace old ones atomically at merge completion), what the “leader” concept in system.replicas means, and how to specify a preferred projection for a query using the preferred_optimized_projection_name setting.

Here are the slides:

The-Doctor-Is-In–Quick-First-Aid-for-Broken-ClickHouse-Clusters-2024-10-17 Download

Key Moments (Timestamps)

Key moments generated with AI assistance.

00:19 – Welcome, housekeeping, and shout-out to Tatiana Solova
02:16 – Introductions: Robert Hodges and Diego Natto
04:00 – ClickHouse overview for context
06:32 – Why these problems are hard to reproduce outside production
07:26 – Problem 1: Too many simultaneous queries
11:10 – Fix 1: Making queries more efficient
12:05 – Fix 2: Per-user settings profiles to protect other users
13:01 – Fix 3: When and how to raise max_concurrent_queries safely
16:28 – Problem 2: Too many parts
17:00 – What parts are and why they matter
19:37 – The parts_to_throw_insert limit and what triggers it
21:19 – Root causes and fixes for too many parts
25:49 – Problem 3: Detached and broken parts
27:55 – What can go wrong with parts
28:30 – Using system.detached_parts to diagnose
30:34 – Fix table: actions by reason code
33:25 – Problem 4: Suspicious parts blocking startup
35:04 – Fixes for the suspicious parts circuit breaker
39:36 – Problem 5: Stuck mutations
40:00 – What mutations are and how they are queued
42:58 – Three ways mutations get stuck
43:06 – Diagnosing with system.mutations and system.merges
44:43 – Fix: KILL MUTATION or wait
44:56 – Problem 6: Stuck replication and lost replicas
46:52 – How replication works: ZooKeeper, the replication queue
47:00 – Diagnosing replication health
49:04 – Common replication problems and fixes
54:20 – Lost replicas: the Keeper log limit and when it matters
56:17 – Fix: detach, drop replica, attach, SYSTEM RESTORE REPLICA, SYSTEM SYNC REPLICA
58:04 – Q&A: mutations and concurrent reads, leader in system.replicas, projection selection
59:46 – Summary and wrap-up

Webinar Transcript

[00:19] – Welcome and Housekeeping

Robert: Welcome everybody. Today’s webinar is “The Doctor Is In: Quick First Aid for Broken ClickHouse® Clusters.” My name is Robert Hodges and I will be presenting today with Diego Natto. We will do introductions in just a second. We also want to give a huge shout-out upfront to Tatiana Solova, who helped prepare many of the materials used in our presentation today.

A few tips to help you enjoy it more. This presentation is being recorded. You will get a link to the recording as well as a link to the slides shortly after the webinar ends, usually the same day or at the latest the next morning. We also have time for Q&A. If you have questions about these topics or about other ways ClickHouse servers can break, feel free to put them into the webinar chat or the Zoom Q&A box. We will take questions at the end of each section. We are going to talk about six important types of problems that you can have with ClickHouse clusters.

Important warning: if you or a loved one is experiencing a data-threatening ClickHouse emergency, please stop listening to this webinar right now and get help immediately. You can talk to us on Slack, that is where Diego and I usually hang out along with a lot of other Altinity people. You can also hit the contact link on our website, set up a meeting, or go to the ClickHouse Slack. Get help. Do not wait.

[02:16] – Introductions

Robert: My name is Robert Hodges. I am the CEO of Altinity. I have been working on databases for about 40 years. I have been working a great deal on Kubernetes and data since 2018, and I have been working with ClickHouse itself since 2019.

Diego: Hello, nice to be here. I am Diego Natto, a support software engineer at Altinity. I am mostly interested in databases, ClickHouse obviously, and others like PostgreSQL. Also Python and REST a little bit.

Robert: Just a little bit about Altinity: we do support and cloud services for ClickHouse. Altinity.Cloud is our cloud platform, which allows you to run ClickHouse either in our cloud if you just want an endpoint and get started, or in your own cloud accounts using a bring-your-own-cloud model. We are the authors of the Altinity Kubernetes Operator for ClickHouse®, one of the most popular operators on GitHub for databases. We are also the maintainers of Altinity Backup for ClickHouse®, Altinity Stable® Builds for ClickHouse®, and a number of other open-source projects. On top of that, we have contributed on the order of 600 PRs directly to ClickHouse itself.

[04:00] – ClickHouse Overview

Robert: ClickHouse® is a real-time analytic database. You can think of it as combining the best aspects of a transactional database like MySQL with an analytic database like Vertica. From the MySQL side: SQL understanding, tables, runs practically anywhere, open source, simple to set up. From the analytic side: a shared-nothing architecture where processes on a network each have their own storage and communicate over the network to execute queries. ClickHouse stores data in columns. Those columns are highly compressed and include secondary indexes and primary key indexes that allow you to find things quickly. ClickHouse has vectorized execution, treating columns as arrays that can be pushed onto individual CPU cores. It is very efficient at soaking up all available cores. It can also fan queries out in parallel across many nodes on the network. It is very scalable: systems processing more than a petabyte of compressed data are common now, and we are seeing even larger storage as time goes on. With Apache 2.0 licensing and a very large community, ClickHouse is probably the single most popular real-time analytic database on GitHub.

[06:32] – Why These Problems Are Hard to Reproduce

Robert: The problems we are going to talk about today are interesting because if you are just running ClickHouse on your laptop you are very unlikely to see them. They are actually quite hard to recreate in a test environment. They appear on large systems with a lot of data, heavily loaded, maybe underresourced, maybe hitting peaks in processing. We are going to dive in and talk about six of these problems: how you recognize them, what is actually going on, and how you fix them.

[07:26] – Problem 1: Too Many Simultaneous Queries

Robert: One of the most common problems people hit in production ClickHouse systems is a “too many simultaneous queries” error. If you have a busy ClickHouse system with a lot of different users or applications accessing it, you may from time to time see an error like this: Code: 202. Too many simultaneous queries. Maximum: 25. Your query does not run; it just breaks and you have to retry later. This is a surprisingly common error in busy systems.

When people see this, their immediate response is “I need to fix this.” There are a number of options. They range from making your queries more efficient all the way to increasing the max_concurrent_queries limit, which is the setting being exceeded when you get that error message.

If you just talk to your manager, he or she will probably say “just raise the limit.” It turns out that is possibly the worst way to fix it in many cases. In most production systems the limit is set at a reasonable level, and what it is really doing is protecting you from having your ClickHouse server swamped by a very large number of users, in which case queries just go slower and slower until the system grinds to a halt.

The best fix, when you have time, is to make the queries more efficient. The reason queries are piling up is because they are taking a while to run. If your queries run faster, you are less likely to hit this limit. If queries run very fast indeed, the number of concurrent queries will never reach a high level.

The second thing, the quick fix for when this happens on Sunday morning, is to add more resources. If you are operating in a cloud or on Kubernetes, it is usually pretty simple to scale up to a larger VM size. Adding more CPU, and to a lesser extent memory, can make this problem disappear quickly.

Third, sometimes particular applications or users are beating up the system. You can put per-user limits on those particular users so their queries break, but they will not break everybody else’s. This is a middle ground when you cannot fix the issue yourself and cannot get somebody else to fix their code either.

Finally, when you have looked at all these, you can go ahead and raise max_concurrent_queries, but we will show you how to do that safely.

Making queries more efficient. Here is a trivial example: computing the average of a series of numbers by scanning a billion rows versus computing it analytically with a closed-form expression. It is a silly example, but it illustrates that by thinking about a query differently it can run much faster.

Limiting users with settings profiles. A settings profile allows you to set limits that apply to a given user or set of users. Here is an example using the RBAC ALTER USER command: setting max_concurrent_queries_for_user to 15 (so any given user can only have 15 concurrent queries) and setting a total limit under that profile to 20. In the test system where the server limit was 25, this protects other users.

Raising max_concurrent_queries safely. There are times when raising this is absolutely the right thing to do. First, check whether the server is actually overloaded. Look at load average with htop. If the load average is not particularly high and CPUs are not saturating, then it is time to think about raising the limit. If you do not have enough resources, your next step is to add them and then run the check again. If you see adequate resources, you can go ahead and raise max_concurrent_queries. This is a global server setting you can set via XML in a max_connections.d config file. ClickHouse will pick it up without a restart. If you skip this check and just raise the limit from 100 to 500, you will simply pile up 500 queries instead of 100 and the system will jam to a halt again.

Diego Q&A: These restrictions apply to ClickHouse users, not operating system users or host-level processes. There is no connection between OS users and ClickHouse users. They are totally different, like any database that has its own set of users.

[16:28] – Problem 2: Too Many Parts

Robert: Practically every person who uses ClickHouse at a moderately large scale with real data has seen this error. Let me first explain what parts actually are.

In large analytic databases, we partition the data: we divide the table into pieces that can be treated as essentially independent for the purposes of both querying and managing the data. For example, if you have time-series data from an observability system, you might partition by month, grouping all the data from a single month into a single partition. Within each partition, ClickHouse further splits the data into parts: chunks of the table where everything within a chunk is within a single partition. Parts are the basic unit of read and write when inserting data and when doing updates.

When you insert a chunk of data, ClickHouse creates a part of up to a million rows by default and writes it to storage. You can immediately query that part after the insert returns. However, these parts, which are often 10,000 to 500,000 rows, are too small for efficient querying. ClickHouse watches parts as they arrive and systematically rewrites them into bigger parts in the background. This process of merging: longer run lengths, fewer files to open, more efficient reads. It is also key to enabling high-performance updates and deletes. This is largely the origin of the name “MergeTree.”

The error. When you see the “too many parts” error, ClickHouse is hitting a per-partition limit called parts_to_throw_insert. The default is 300 parts per partition. When that limit is exceeded, ClickHouse starts rejecting inserts (or delaying them with the related parts_to_delay_insert setting). The message about “merges are proceeding significantly slower than inserts” is what usually grabs people’s attention, but that is not always what is actually happening. What is happening is the part count limit is being exceeded.

Root causes and fixes:

Inserts are too small. This is the most common cause. Analytic databases with columnar storage are very efficient for reading but not particularly efficient for writing, because by default ClickHouse creates a couple of files per column. If you have a 100-column table, writing a single row means creating about 200 files. We want large batches. A million rows is great. Ten million rows is also good. ClickHouse can take very large batches, so do not worry about that. Sometimes it is a simple change in your app. You can also enable async inserts, which buffers inserts in memory and flushes them as a larger batch. The downside is potential data loss if ClickHouse crashes while the buffer is in memory. A third approach is routing inserts through Kafka or Redpanda, which lets you buffer data from upstream sources and then read from the Kafka topic in larger batches.

Inserts going to multiple partitions. When you are loading data by time, it is great if all rows in a single insert have roughly the same date or the same month, because then all of them land in one partition and create one part. If the insert contains data spread across many different months, ClickHouse will create separate parts in each partition. A 100,000-row insert that spans 100 partitions creates 100 parts, not 1. Try to match your insert blocks to a single partition.

Materialized views with different partitioning. This is a subtle one. A materialized view is basically a trigger that operates on arriving data: it runs a query on the inserted rows and puts the results into another table. If that target table has radically different partitioning from the source, the materialized view will quietly generate hundreds or even thousands of tiny parts. This is a production problem that can take a while to figure out.

ClickHouse not merging fast enough. There are cases where ClickHouse is simply not merging quickly enough. There are advanced settings for this, for example max_bytes_to_merge_at_min_space_in_pool, which tells ClickHouse to preferentially merge smaller parts first to prevent them from building up. At this point you will need to read the code: these settings are not well documented and the C++ comments are usually your best guide.

[25:49] – Problem 3: Detached and Broken Parts

Robert: Speaking of parts, let us consider bad things that can happen to them inside ClickHouse. In addition to having too many parts, you can have parts that get detached or become broken.

What can go wrong with a part. On a cluster with replication across three ClickHouse nodes, a part can end up detached for several reasons. Humans sometimes detach parts intentionally, for example during maintenance or when moving data. Parts can be left over from a merge operation where ClickHouse merged several parts into a single larger one but failed to clean up the source parts. A part can get out of date and not be dropped, or fail to have its metadata recorded in Keeper. And parts can get corrupted: a single bit flip in a checksums.txt file inside the part is enough to prevent ClickHouse from reading it.

If a part has problems, it usually ends up in system.detached_parts. You can query this table directly. Here is an example showing one part that was manually detached by a user (I happen to know because I did it myself) and another that was broken on startup because a file in the part was corrupted. You can simulate this yourself by deleting a file from a part directory and restarting the server.

Fix table by reason code. The reason column in system.detached_parts is your guide:

No reason: The part was detached by a human. Find out whether it was detached intentionally. If not, run ALTER TABLE ... ATTACH PART to put it back. This is nearly instantaneous.
ignored or covered-by-broken: A bigger part already covers the same data blocks. These are safe to delete with ALTER TABLE ... DROP DETACHED PART.
cloned: Leftover from repairing a lost replica. Also safe to drop.
broken-on-start or broken: Corrupt data. If the part is already covered by another copy (for example, because it was fetched from another replica), safe to delete. To find out if it is covered, compare system.detached_parts with system.parts. There is a query for this in the Altinity Knowledge Base. If the part is not covered, you have real data loss, and you need to determine whether the data is something you can afford to lose. The blog article on understanding detached parts covers the full lifecycle in detail.
unexpected: ClickHouse found a part on disk but cannot find it in ZooKeeper. This means ClickHouse does not think it belongs to the table. Do not delete this one without running SYSTEM RESTORE REPLICA first to see if the metadata can be recovered.

Diego Q&A on data loss from broken parts: If you drop a part because it is broken and some of the columns in that part are corrupted, you may be able to recover the non-corrupted columns but those specific corrupted columns will be lost. Data loss from broken parts can be partial.

[33:25] – Problem 4: Suspicious Parts Blocking Startup

Robert: A variation on the broken-parts problem is what ClickHouse calls “suspicious parts.” This problem typically happens on startup.

When ClickHouse starts up, it examines each table’s parts to make sure they are healthy. If ClickHouse sees more than a certain number of what appear to be broken parts, it stops. The reason is important: ClickHouse is protecting you from a case where you changed the configuration, for example you changed a ZooKeeper path for a replicated table or moved storage around and left the config inconsistent. ClickHouse is essentially giving you a chance to look at the situation and fix it before proceeding, rather than plowing ahead, detaching all the broken parts, and restarting, which could hide a serious misconfiguration.

What to do. The first thing is to check your configuration. If you have not changed it recently, you are probably fine. In that case there are several options to get past this: raise the max_suspicious_broken_parts MergeTree setting to a large number, which will let ClickHouse come up and you can then investigate the detached parts. You can also edit the SQL file that contains the ATTACH TABLE command for that specific table and add the setting there, so other tables will still be protected. If you can access the filesystem and know the data is safe to restore from other replicas, you can use the force_restore_data flag, which tells ClickHouse to fetch data from other replicas and come up. See the Knowledge Base article on suspiciously many broken parts for the full procedure.

Long-term prevention. If you are seeing broken parts recurrently, these problems are usually a sign of something wrong with how you are running ClickHouse. Scale up your hosts, because one way to get broken parts is ClickHouse crashing in the middle of an operation, leaving partially written data. Load the system less by making queries more efficient. Avoid overusing replication: having hundreds of replicated tables creates situations where replication can get bogged down, differences can accumulate between hosts, and crashes during writes can lead to inconsistent parts. ClickHouse works best with a few really large tables.

[39:36] – Problem 5: Stuck Mutations

Robert: Let us talk about stuck mutations. The first question people new to ClickHouse always ask is: what exactly is a mutation?

A mutation means you have asked ClickHouse to change a table. They come in different varieties:

ALTER TABLE DELETE WHERE ...: the most heavyweight, because every part that contains matching rows must be completely rewritten.
Adding or dropping a projection: only affects the projection files within the part, no full rewrite needed.
ALTER TABLE MODIFY COLUMN: only affects the one column, not the rest of the part.

In all cases, ClickHouse processes these mutations in a queue and creates new parts. The mutation is submitted to a queue via an ALTER command, and unless you use the SYNC option, it happens in the background. ClickHouse has background threads that look for parts needing rewriting and process them over time.

Three ways mutations get stuck:

Bad mutation that always fails. For example, you might run an ALTER TABLE MODIFY COLUMN to convert a string column to an integer, but some values in the data cannot be converted. ClickHouse will queue this mutation happily but it will always fail when it tries to process any part containing invalid values. You will see errors in system.mutations.

Second mutation blocked by the first. Mutations run sequentially in the queue. If a bad mutation is first in the queue, a good mutation behind it will never run until the bad one is cleared.

Mutation on a very large table. If you legitimately need to delete rows from a table with thousands of very large parts, that mutation is going to be slow. Very slow. This is probably the single most common reason people see mutations stuck.

Diagnosing with system tables. system.merges shows you actual merges and mutations happening on parts right now. system.mutations shows you the queue and the order of processing. The most useful diagnostic query is to look for mutations that have had at least one error by filtering on latest_fail_time > toDateTime(0). This gives you the error message and the timestamp of the last failure.

Fixing stuck mutations. Use KILL MUTATION WHERE database = '...' AND table = '...' AND mutation_id = '...' to remove a bad or blocking mutation from the queue. This is like killing a query: provide matching column values and the mutation is removed. In rare cases a mutation will not stop immediately, but generally this unblocks the queue and lets other mutations run. The other answer, for a mutation you actually want and that is processing a huge table, is simply to wait.

Diego Q&A: do mutations block reads? The short answer is no. When a mutation is scheduled, ClickHouse reads the parts and applies the mutation operations, creating new parts. The original parts remain live throughout this process. Queries continue reading the old parts. When the mutation finishes and the new parts are complete, ClickHouse atomically marks the old parts as inactive and the new parts as active. Queries that were already in progress against the old parts continue to completion; all new queries read the new parts. There is no blocking.

[44:56] – Problem 6: Stuck Replication and Lost Replicas

Robert: In a related story, stuck mutations can lead to stuck replication, but there are other ways replication can get stuck as well.

How replication works. When you insert a part into a replica, ClickHouse writes the part to storage and updates the metadata in ZooKeeper. Other ClickHouse servers that are replicating that table watch that metadata. When they see a new part arrive, they fetch it and apply it. There is a queue of things waiting to be replicated: new parts, merge operations, and so on. ClickHouse uses ZooKeeper or Keeper to hold this queue and ensure it is properly synchronized so that operations happen in order and do not corrupt data.

Checking replication health. If you think replication is broken, there are several things to check. First, use system.replicas. It has a row for every table using a ReplicatedMergeTree engine and tells you the state of that table. You can run this query over an entire cluster. One of the most important columns to look for is is_readonly: a replica in read-only mode cannot accept changes, usually because it cannot talk to ZooKeeper. Next, check ZooKeeper connectivity using system.zookeeper_connection. If ClickHouse cannot talk to ZooKeeper, it cannot replicate. If you need to dig deeper, you can get the ZooKeeper path from system.replicas and run SELECT against system.zookeeper to inspect the actual metadata. Finally, check the state of the replication queue itself with system.replication_queue: a simple query counting entries and flagging those with errors will tell you whether you have a blocked queue and what the errors are.

Common replication problems and fixes:

Replicas are read-only. Most common reason is they cannot see ZooKeeper. Check ZooKeeper health, fix connectivity, and once ZooKeeper is reachable the replicas will typically recover. You can also use SYSTEM RESTORE REPLICA to rebuild the ZooKeeper metadata from disk if needed. See the Altinity Knowledge Base on ClickHouse replication problems for detailed diagnostic queries.

Replication queue blocked by mutations. This is one of the single most common reasons replication is stuck. Look for large mutations in system.mutations. If they are blocking, either wait or kill them and run them later. For broken mutations, delete them to free up the queue.

Old or stuck tasks in the queue. For tasks over 24 hours old that have not completed, MUTATE_PARTS and MERGE_PARTS tasks can be safely deleted from the queue. Look for high values in num_tries or num_postponed.

Replication overloaded. Sometimes just adding resources to your ClickHouse cluster will cure replication lag. Replication requires reading and writing. If you do not have sufficient resources or have very high query load, replication falls behind. Adding resources can fix it. And finally, a really useful piece of advice: wait. A lot of replication problems, if you just wait, will get better. If there was a large batch job or a large mutation passing through, the system may lag for a while but will bounce back once the load passes.

Diego Q&A on the “leader” concept in system.replicas: The leader replica is the one that is ahead of other replicas in terms of executed operations in the replicated log. It is essentially a pointer indicating which replica has written more operations. In older versions of ClickHouse this was more meaningful for troubleshooting, but in current versions you generally do not need to check whether a replica is the leader or not. It is more of a legacy concept.

Lost replicas. The Keeper log stores the last 1,000 operations per table. If a replica is offline long enough that the log has moved on, the replica can no longer tell what parts it is missing just by reading the queue. ClickHouse marks it as lost. The replica must then poll an active replica table to find out which parts it is missing and sync up. This normally happens automatically. But sometimes you may need to help ClickHouse by manually restoring the replica.

The restore procedure. First, detach the table so it becomes unavailable. Run DROP REPLICA against the ZooKeeper path for that replica, which wipes out the stale ZooKeeper metadata. Then attach the table: it will come back but will be read-only because ZooKeeper metadata is missing. Run SYSTEM RESTORE REPLICA, which internally goes through a process of detaching and reattaching all the table’s partitions, recreating the ZooKeeper metadata from the local disk files. Finally run SYSTEM SYNC REPLICA to wait until the replica is properly synchronized. One caveat: if you had parts that were intentionally detached for a valid reason, this procedure will reattach them, so check your detached parts before running it.

[59:46] – Summary and Wrap-Up

Robert: A few summary points.

ClickHouse is actually quite hard to break. The hardest part about doing a talk like this is simulating these conditions, because they are tricky to recreate. If you are running on a laptop or in a test environment, chances are you will not see these problems. You will see them on production systems with a lot of data and heavy load, which is precisely where you care most about them.

You can fix these problems as we have shown. Diego probably has more experience than almost anybody on our team at fixing broken parts. But the best thing to do is prevent the problems in the first place.

Prevention: Write a good application. Use a good schema with good compression levels to minimize IO. Write efficient queries that read the least amount of data needed. These will all help queries run faster and put less load on the system.

Provision adequate hardware. When your systems need CPU, they need to actually have it. RAM and storage bandwidth also matter. Inadequate resources can cause ClickHouse to go from serving a query in 100 milliseconds to 15 seconds.

Do not abuse mutations and replication. Refreshing the entire contents of a big table on a daily basis via mutation is going to cause trouble. It creates a huge mutation that can block other operations and block replication. And if the system crashes during that period, you have a mess.

When problems come up: First, wait. Sometimes they cure themselves because the load was transient. As soon as the peak passes, ClickHouse will bounce back.

If that does not work, fix it as we have shown. For the most part these problems do not cause data loss, but there is definitely careful analysis that goes into some of them.

If things get serious, we are available to help. We do this for a living. Contact us. We also offer training on exactly this topic: some of the slides in this talk are taken from our administration training class.

For further reading, the Altinity Knowledge Base is probably the best place to find these specific procedures. For some of these problems, particularly around settings, the ClickHouse source code itself is often the best reference: read the C++ comments.

FAQ Section

Q: What is the “too many parts” error and what is the most common cause?

A: The TOO_MANY_PARTS error occurs when the number of data parts in a single partition of a MergeTree table exceeds the parts_to_throw_insert threshold, which defaults to 300. ClickHouse responds by either delaying or rejecting inserts to protect the background merge process from being overwhelmed. The most common cause is inserts that are too small. Because ClickHouse stores each column in its own files, every insert creates at least one file per column. Inserting a row at a time means creating hundreds of tiny parts, which the merge process cannot keep up with. The fix is to batch inserts into groups of at least tens of thousands of rows, ideally a million or more.

Q: How do I know whether a detached part is safe to delete?

A: Check the reason column in system.detached_parts. Parts with the reason ignored or covered-by-broken are safe to drop because a larger valid part already covers their data. Parts with the reason cloned are leftovers from replica repair and are also safe to drop. Parts with no reason were detached by a human and should only be dropped after confirming they are no longer needed. Parts with reason broken-on-start or broken represent corrupt data: check whether the same part exists in system.parts (meaning it was fetched from another replica and the broken copy is already replaced). If it is covered, drop it. If it is not covered, you have real data loss. Parts with reason unexpected should not be deleted without first running SYSTEM RESTORE REPLICA to see whether the metadata can be recovered.

Q: What is the safest way to kill a stuck mutation?

A: Use KILL MUTATION WHERE database = '<db>' AND table = '<table>' AND mutation_id = '<id>'. You can find the mutation ID and any errors by querying system.mutations filtered by latest_fail_time > toDateTime(0). The KILL MUTATION command removes the mutation from the queue and is generally safe. In rare cases a mutation does not stop immediately due to a bug, but this is uncommon. After killing a blocking mutation, any mutations behind it in the queue will begin processing.

Q: Do mutations block reads on the table while they are running?

A: No. ClickHouse handles mutations without blocking reads. When a mutation is being applied to a part, it creates a new version of that part with the changes applied. The original part remains live and visible to all read queries throughout the process. When the mutation finishes and the new part is fully written, ClickHouse atomically marks the old part as inactive and the new part as active. Queries that were already reading the old part continue to completion. All new queries will read the new part. There is no point during a mutation where reads are blocked.

Q: What is the SYSTEM RESTORE REPLICA command and when should I use it?

A: SYSTEM RESTORE REPLICA rebuilds the ZooKeeper or Keeper metadata for a replicated table by reading the actual data parts on disk. You use it in two scenarios: first, after a replica has gone so far out of sync that ClickHouse cannot recover automatically, which you diagnose by seeing the replica in read-only mode or finding that its replication queue is hopelessly out of date. Second, when you have parts in system.detached_parts with reason unexpected, meaning ClickHouse found parts on disk that have no corresponding ZooKeeper metadata. The procedure is: detach the table, run DROP REPLICA to clear the stale ZooKeeper node, attach the table (it comes back read-only), run SYSTEM RESTORE REPLICA, and then run SYSTEM SYNC REPLICA to wait for the replica to catch up.

Q: Why does the “suspiciously many broken parts” error stop ClickHouse from starting?

A: ClickHouse does this deliberately as a circuit breaker. If it detects too many broken parts on startup, it stops rather than proceeding and detaching them all. The reason is that a large number of broken parts often indicates a configuration problem, such as pointing a replicated table to the wrong ZooKeeper path or having moved storage without updating the configuration. In that case, blindly detaching all the parts and continuing would cause data loss and hide the underlying problem. By stopping, ClickHouse gives you the chance to check your configuration first. If your configuration is correct and the broken parts are a result of a crash or hardware failure, you can safely proceed by raising max_suspicious_broken_parts, editing the table SQL file to add the setting for just that table, or using the force_restore_data flag.

© 2026 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE^® SOLUTIONS

Get in touch with ClickHouse experts.

Related:

Leave a Reply Cancel reply