Five Ways to Make Your ClickHouse® Slow (and How to Avoid Them)

Recorded: February 18 @ 08:00 am PST
Presenters: Robert Hodges

In this webinar, Altinity CEO Robert Hodges and ClickHouse architect Mikhail Filimonov break down five of the most common performance killers that teams encounter when building or scaling ClickHouse-based systems. Starting from a quick review of the shared-nothing, columnar architecture that makes ClickHouse exceptionally fast, the presenters then walk through each major pitfall with concrete examples drawn from the public “On Time” airline flight dataset.

The five problems covered are: (1) poor table design, particularly over-partitioning by high-cardinality columns instead of time; (2) too many small inserts that overwhelm the merge process; (3) poorly structured queries, including inappropriate aggregates and suboptimal joins; (4) insufficient hardware resources, compounded by the fact that ClickHouse servers simultaneously handle queries, inserts, merges, and updates; and (5) migrations from non-relational databases such as Splunk or Elasticsearch, where data organization assumptions differ fundamentally from ClickHouse’s model.

Throughout the session, Hodges and Filimonov emphasize that performance problems are nearly always solvable but require testing against realistic data and workloads. They cover practical tools, including system tables, async inserts, HyperLogLog aggregates, dictionaries, and ClickHouse Keeper monitoring. The session closes with an audience Q&A covering join spill-to-disk behavior, materialized views, sharding strategy, and Keeper health metrics.

Here are the slides:

Five Ways to Make Your ClickHouse Slow-2026-02-18 Download

Key Moments (Timestamps)

Key moments generated with AI assistance.

00:00 – Introduction and housekeeping
01:14 – About Altinity and session overview
03:54 – ClickHouse architecture review: shared-nothing, columnar storage, and replication
07:29 – Why ClickHouse is fast: columnar compression and parallel reads
12:43 – Problem 1: Bad table design and over-partitioning
21:31 – Background merges, partition by time best practices, compression codecs
23:26 – Problem 2: Too many small inserts and write amplification
28:39 – Async inserts and Kafka-based ingestion
31:25 – Problem 3: Bad queries, aggregation memory, and join optimization
39:28 – Using system.query_log to profile queries
41:05 – Problem 4: Insufficient resources and capacity planning
49:39 – Problem 5: Migrating from non-relational databases (Splunk, Elasticsearch)
57:09 – Summary and action items
59:52 – Q&A: joins and spill-to-disk, dictionaries, materialized views, sharding

Webinar Transcript

[00:00] – Welcome and Housekeeping

Robert: Hello everybody, and welcome to our webinar on five ways to make your ClickHouse slow, and how to avoid those ways. My name is Robert Hodges. I’m going to be here today with Mikhail Filimonov, and we are going to get started in one minute because I’m going to give people 60 seconds to join. We had a bunch of signups. Hopefully, more people will be dropping in, and then we’ll get started. I’m going to mute and we’ll be back in 60 seconds. Thank you very much.

[01:14] – Introductions and About Altinity

Robert: All right, the 60 seconds are over. Let’s get going on five ways to make your ClickHouse slow and how to avoid them. My name is Robert Hodges. I’m CEO of Altinity. I have with me here today my colleague, Mikhail Filimonov, who’s our ClickHouse architect, and we are going to be talking about ClickHouse performance.

Before we dive in, I’d like to cover a few brief things that will make this webinar more fun and enjoyable. First, it is being recorded. Not only that, we will give you that recording as well as a copy of the slides, so no need to make hasty notes. You’re going to get it all in email within a few hours of this talk ending.

We also have time for questions. Mikhail is here and can help answer questions both in the foreground and in the background. If you have a question as we go along, please go ahead and pop it into the Q&A box. It’s available on the Zoom menu bar, and there’s also a chat. You can throw questions in there as well. We will answer them no matter where you put them, and we should have a little bit of time at the end for extra questions. If there’s something you want to know, just throw it out there and we’ll answer it if we can.

With that, let’s get going. Quick introduction to Altinity. If you haven’t heard of us, many of you are friends and have been working with us for years, or you otherwise contribute to the ClickHouse community. We are a vendor for ClickHouse. We are not the same as ClickHouse Incorporated. We collaborate on open source, but we are competitors in the market.

What do we do? We started out offering enterprise support, 24×7 support for ClickHouse, for any purpose, from embedded systems to huge cloud clusters with hundreds of nodes. We also run Altinity.Cloud, which was the first cloud for ClickHouse introduced in North America on Amazon and has been going strong for six years. If you’ve been around, you may be aware that we are the authors of the Altinity Kubernetes Operator for ClickHouse, a very popular way of running ClickHouse on Kubernetes. We also have something called Project Antalya, where we’re adapting ClickHouse to use Iceberg as native storage. Those are all things we work on. We’d be happy to talk about them. But today, we’re going to talk about performance.

[03:54] – ClickHouse Architecture Overview

Robert: To get started, I’d like to give a quick overview of the ClickHouse architecture. Most people on the call are pretty familiar with this, but let’s review it just in case.

ClickHouse is what we call a real-time analytics database. It’s designed to read large quantities of data very quickly, to ingest them quickly, and to answer queries across data sets that extend into trillions of rows.

The architecture ClickHouse is based on is called shared-nothing. This is a traditional architecture, particularly for analytic systems, and it’s been around for decades. The way it works is you have ClickHouse servers with attached storage. If you’re running on Amazon, that would be EBS. It could be attached NVMe SSD if you’re running on your own rack machines. Those servers are then connected across a network.

When we make tables, we can have them be replicated so that updates sent to any of the tables will automatically replicate to others. The way that works is through something called ClickHouse Keeper, which used to be ZooKeeper. It’s now shifting to Keeper, a rewrite of Apache ZooKeeper in C++, that maintains a list of who has which parts so that ClickHouse knows how to move the data around so that every node has its own copy.

We can use block storage, and over time ClickHouse has become very good at reading data from object storage, particularly S3, but also the native APIs of Azure and Google. That’s something we’ve covered in other webinars. So this is the basic architecture. It has grown over the years and become extremely popular.

What do people build with it? The canonical system is what we call a real-time event pipeline. Imagine you have a security management system reading things like DNS requests. Those could be collected from individual machines and fed into an event stream like Kafka to buffer them and separate producers from consumers. Kafka would deliver them to ClickHouse, where they’d be stored in MergeTree tables. MergeTree is the workhorse engine for large-scale data. The data, which can arrive at millions of rows per second, is then used to feed dashboards, APIs, and alerts. These are standard things people build on the application side, but the core storage engine is ClickHouse itself.

[07:29] – Why ClickHouse Is Fast: Columnar Storage and Compression

Robert: This has become very popular, so it’s worth looking at a small example. In this talk, I’ll mostly use examples drawn from a very popular data set called On Time. This sample data set has a row for every commercial flight in the United States over the last 30 or 40 years. We use a subset, but it has about 200 million rows. The sample query asks which carriers had the worst cancellation rates. There’s a carrier called G4 that had 16% of their flights canceled in 2020, and you wouldn’t want to fly them.

Notice the performance numbers. This scanned just under 200 million rows and came back in less than a second. You might think that isn’t super fast, but we did everything possible to make it slow. We used an m7g xlarge Graviton instance with four vCPUs, 14 gigs of allocated memory, very slow EBS storage set to 125 megabytes per second, and a special setting that forced cold reads using direct IO to skip the buffer cache. And we still got results under a second. If we tune those settings up with bigger instances and data resident in RAM, it becomes much faster.

How does it do this? The secret is columnar storage that is compressed and accessed in parallel. To be totally clear about what’s going on: a database like PostgreSQL or MySQL stores table data in rows. This particular table has 109 columns. If you touch any value in any column for a row, you read the entire row. And another side effect is that rows tend to be minimally compressed or not at all.

ClickHouse stores data in columns. Each of the 109 columns is stored in two files: one with offsets and one with actual data. So if you have a query that touches only three columns, like the one we just showed, you only read the files for those three columns. This is very efficient for wide table scans.

Here’s the math on why this is so good. In a row database like MySQL, that query would touch every row and read 61 gigabytes of data, single-threaded. In ClickHouse, because of columnar storage, we’re going to read only three columns, cutting the data by 98.5%. We’re reading about a gigabyte when expanded in memory, but the actual data read from disk is only 17 megabytes because the data compresses extremely well. That’s a 3,000 to 4,000x reduction in data, which is what makes full table scans efficient. Finally, ClickHouse parallelizes this across four threads in our example, each reading about four megabytes. This is why ClickHouse is fast.

[12:43] – Problem 1: Bad Table Design and Over-Partitioning

Robert: With that, it might seem like performance is a given. Well, it’s not quite that simple. Let’s dive into what can actually go wrong.

The first problem people tend to run into is table design. This is the first place people land, and if you haven’t used ClickHouse before, it’s really important to understand how it works.

ClickHouse, like many analytic databases, breaks tables into pieces called parts. When you have a ClickHouse table definition, it includes two really important clauses. The PARTITION BY clause says: take the table and break it into sections divided by year. The ORDER BY clause says: within those sections, order the parts first by carrier, then by flight date. If you go onto the host and look on disk, you’ll see a bunch of parts with names like “2017” followed by numbers that show how many times those parts have been merged.

ClickHouse asks you to make this decision for it. It doesn’t do it automatically. As a rule of thumb, you want to get about a thousand parts or so. You can have more or less and it will work fine, but if you have a lot more, it won’t work well.

The second thing is the sparse index. It’s sometimes called the primary key index. It’s designed to fit in memory, which means it doesn’t store an entry for every single row, but by default for about every 8,000 rows. Those entries point to offsets in the different files that represent the different columns. So that’s how you lay out storage. You have to make this decision yourself, and that gives you the opportunity to make mistakes.

Let’s make a change. We’re going to partition by carrier and also break parts up by month. On the face of it, this seems reasonable. If you’ve worked in multi-tenant systems, it’s very common in MySQL to put each tenant in their own database. So why not put each tenant in their own parts? If we get a GDPR request, we can just drop all the parts for that tenant, and they’re out of there. But this actually has some bad effects on performance.

So we create the table with this partitioning scheme, insert the data, and when we run that same query again, it becomes vastly slower. It was less than a second before. With this design, that query goes to five seconds, six times slower, simply because this choice increased the number of parts, and each part has files for 109 columns. ClickHouse now needs to open a vast number of files to do the scan.

The reason this happens is that airline flights don’t distribute evenly across carriers. Southwest does the most flights of any US airline, accounting for about one-seventh of all commercial flights. There’s enormous variation between carriers in terms of flight volume. That means at the tail end of the distribution, you get a huge number of very small parts. Small parts are slower to open and read.

The partition-by-year approach works well because about the same number of flights happen every year: roughly five million. So you get partitions with even sizes, and they tend to be large.

When it comes to how to pick an ORDER BY, PRIMARY KEY, and PARTITION BY for the MergeTree family, the cheat sheet is: partition by time, picking some reasonable period. If you’re holding data for 14 days, partition by day. If you have 30 years of data, partition by year. The second thing, which you have to fight against if you’re used to other databases, is not to split the data for tenants. Sort it instead. What that does is arrange your tenants into contiguous sections of data, so queries are really fast because it’s all located in the same part of storage. The trade-off is that if you do need to remove a tenant, you’ll have to update the parts. But in systems like observability, that’s a small price to pay.

Other things you can do for schema performance include using proper data types, using what are called codecs, which morph data before compression. And then if that’s still not enough, you can switch from the default LZ4 compression to ZSTD. We only recommend this when you’ve got idle CPU capacity but your IO is taking up too much time.

[21:31] – Background Merges, Measuring Compression, and Codecs

Mikhail: For large systems, you definitely want a good partitioning scheme that results in nice large partitions. There’s another thing that can happen if you don’t get the partition by right: it can cause excessive merging as new data arrives. If the parts are small, ClickHouse will say it wants to make them bigger, and as new rows arrive it will keep going back to those small partitions and try to grow them, leading to more merge activity.

Robert: Measuring compression is straightforward thanks to ClickHouse’s system tables, which are the best of any database I’ve ever worked with. They allow you to measure compression levels at the part level, table level, or column level. A typical query would check the compression for the columns you’re reading in a query, and the formatReadableSize function makes the output human-readable. You’ll soon love these tables. They’re great.

[23:26] – Problem 2: Too Many Small Inserts

Robert: The next problem is too many inserts. This is a very common problem in analytic databases, because if you learned SQL on relational databases, you’re used to inserts that look like INSERT INTO some_table VALUES … with a row or two at a time. If you’re doing this in ClickHouse, you’re doing it wrong.

Small inserts beat the daylights out of your server because ClickHouse wants to coalesce them into larger files. This process is called merging. When you do an insert, ClickHouse creates a new part. One virtue of this is that the inserted data is instantly queryable in columnar format. The problem is that the files are so small that performance suffers.

In the background, ClickHouse will run merges. It looks for small contiguous parts, grabs them, and rewrites them to make bigger parts. Over time it ends up with the largest parts it can make, up to a point at which it stops merging. This means ClickHouse will eventually reach a part size that makes queries efficient, but in the meantime, you pay in two ways. First, it takes CPU time to grab, read, sort, and rewrite these parts. Second, it consumes IO bandwidth. We call this write amplification. When data is inserted, it’s going to be rewritten again and again until the part becomes large enough that ClickHouse no longer touches it.

This can swamp your system because it takes up CPU and IO bandwidth needed for queries. There are a couple of ways to get around this.

One is simply to batch the data yourself. Put data in big CSV or Parquet files and do it a million rows at a time. ClickHouse does not care how big these are. You just push them up via the HTTP interface.

In some cases, that’s not convenient, for example, if data is coming from Fluent Bit. By default, Fluent Bit only writes 2 MB chunks. For situations like this, we can use async inserts. When you enable them, you can have a bunch of clients simultaneously inserting into ClickHouse. ClickHouse will put those rows into a single buffer, and then, after a reasonable period, say half a second, or after hitting a certain size, it will flush the whole buffer to storage and create one larger part. This is a great feature. When it was first introduced, it was somewhat unstable, but it’s now a standard recommendation for dealing with small inserts.

To set it up, you enable async inserts, configure the wait_for_async_insert setting, and attach those settings to a profile assigned to the user doing the inserts. When those inserts arrive, they’ll automatically get buffered and merged.

[28:39] – Async Inserts and Kafka-Based Ingestion

Robert: Is it possible to enable async inserts for Kafka-based ingestion?

Mikhail: Actually, for async inserts, if you use the Kafka engine inside ClickHouse, it has the same buffering functionality built into the engine itself. So you don’t need async inserts, and they won’t work if you try to enable them for the Kafka engine. Kafka itself works in a very similar way, so you don’t need async inserts for the Kafka table engine.

But if you’re doing inserts from an external component, for example, Kafka Connect, a Golang application, or something similar, then from ClickHouse’s perspective, it’s a regular insert, and you can use async inserts in that case.

Robert: That’s exactly the answer I would have given. It just sounds more authoritative coming from you, Mikhail, since you actually wrote the code. There’s kind of a balance with Kafka. A lot of people use it because it simplifies systems in one sense: it handles the small insert problem and uses connections more efficiently. You don’t have thousands of clients consuming TCP connections. The flip side is that you have to manage Kafka. You can decide which approach fits better.

[31:25] – Problem 3: Bad Queries

Robert: Let’s talk about bad queries. Good queries are all the same, but bad queries are all different in their own way.

Very small differences in what you ask for, how you formulate a query, or what language you use can make huge differences in response time. Here’s a simple example. We have a query asking for the average departure delay broken out by carrier. That’s very simple. We could ask for a more detailed form of it, where we break the data out not just by carrier but also by flight date, and also collect the tail numbers associated with those delays.

As you might expect, it costs more. This second query is 4.25 times slower. More interestingly, it uses vastly more memory, over 20,000 times as much. The reason is that averages are very compact to store in memory. But when we split them out and request an exact, unique count, we’re storing a lot of information in memory.

One option is simply to ask for less information. Fewer group-by keys, that kind of thing. But there might be reasons why you need that information. Let’s dig into this.

What happens when we process a query? The query gets parsed and turned into an abstract syntax tree, a plan is made, and then we do a scan. In our example server this means four threads marching through the data, buffering what they’re scanning in hash tables in memory. At the end, we take all those hash tables, merge them together, sort the result, and hand it back to the client.

The key thing is these hash tables. What dominates their size is how much aggregation we’re doing. The group-by key identifies a slot in the hash table that accumulates data for that particular key. The more group-by keys, the bigger the hash tables. The merge step collapses them, then we sort and return.

Knowing that hash tables can get big, how can we make them smaller? One simple way is to choose different aggregates. A great example: picking unique values over time, like the unique aircraft associated with flights on a particular date. If you use uniqExact, it builds a hash table and stores each individual value. We can change that to use uniq, which gives an approximate count using a data structure called HyperLogLog. That has a fixed size. When we switch to this, we get almost exactly the same answer, but it’s 15% faster and, more importantly, uses five times less RAM.

Other things to look at: joins. Contrary to ClickHouse’s reputation, it does joins pretty well, but there are still places where you can get big performance advantages if you look closely.

Here’s an example: we’re joining to get the full airport name instead of just the IATA code. In the original query, we were joining directly. In the optimized version, we break the query out so the scan and aggregation happen in a subquery first. That subquery doesn’t join anything. We let it return with aggregation done, which gives us a much smaller data set, and then join that smaller data set against the airport name table. As a result, the query runs much faster and uses less RAM.

This is an interesting case because, as Mikhail and I discussed before, ClickHouse does have an increasingly capable query analyzer, but it doesn’t make this optimization automatically. What was happening was that each of those scan threads was dragging along the airport names with the additional information it was collecting. That was causing both high memory use and extra processing time.

Understanding what ClickHouse is doing internally is really important for knowing what optimizations you can make. That’s a very common theme: knowing what ClickHouse is doing inside gives you insight into how to make it run efficiently.

How do you get these performance stats? You can read them off the ClickHouse client, but you can get even more information out of system.query_log. If you’re optimizing queries, live in that table. It provides great information, and it’s all available through SQL.

One more thing on query optimization: test on real data. We’ve been giving examples going bottom up, but what you really want to do is put realistic data out there and run queries against it. Take the queries you don’t like and start optimizing them. Only apply these tricks if the query is actually slow. If it’s fast enough for your users and doesn’t take too much machine capacity, leave it alone. ClickHouse gets a lot of things right all by itself. But in the cases where it doesn’t, you need to dig in and use these optimizations. It’s very rare that you can’t make a ClickHouse query faster, but it requires testing against real data.

[41:05] – Problem 4: Insufficient Resources

Robert: Problem number four: simply not having enough resources available to ClickHouse.

Let’s say we had that bad partition scheme and couldn’t fix it, after hours of optimization. Well, at that point, we may have no choice but to throw hardware at it. We were running on a 4-vCPU machine with very slow EBS access at 125 megabytes per second. I can just upgrade it. If I’m running Kubernetes, using the Altinity Kubernetes Operator for ClickHouse, it’s very simple to upgrade to a 32-vCPU server and also increase the IO bandwidth. By making that change and rerunning the query, I get much faster cold reads, and the hot reads are faster than the properly optimized schema on the smaller box.

So resources are really helpful, and sometimes when you have a problem, the right answer is just to add more resources to buy time while you figure out the underlying issue.

There’s one thing that’s really important to understand about resource management with ClickHouse: it’s not just queries. ClickHouse servers do four things simultaneously. Every day, they’re handling queries, inserting data, performing merges, and handling updates. Processing inserts can be expensive, especially if you’re parsing JSON. Merges are computationally intensive and burn storage bandwidth. Updates are also real: in practice, you may need to remove tenants for GDPR, or handle late-arriving data that affects measurements.

All of these put pressure on CPU, RAM, and IO bandwidth. And there’s another consideration: a lot of inserts or merges put pressure on ClickHouse Keeper. If Keeper is slow to answer, inserts slow down. If it gets overloaded enough, your tables go read-only. You don’t want that.

So when thinking about resource management, it’s really critical to do at least four things. One: When you’re deploying a system, test on real workloads. This is the number one thing you can do. If you’re planning a 100-server system, figure out your total workload per server and test on a single server, beating the daylights out of it. Optimize queries, get to the best performance you can, then start adding replicas and shards. Use real data, the real data size, and make sure the test is concurrent. Queries that run fine with one test user behave very differently when 400 analysts are sending the same query simultaneously.

Two: detect trouble. Build monitoring in. Monitor CPU, IO wait, RAM usage, buffer cache exhaustion, and network. Three: scale quickly when trouble hits. Ideally, you see it coming before it happens. If you do have performance problems on Sunday, just scale up, go home, and deal with it on Monday. Kubernetes is very helpful here. You can boost the VM size and buy time to fix the app. Four: loop back and fix it. When you see resource problems, figure out what’s wrong and address the root cause.

[47:24] – Monitoring ClickHouse Keeper

Mikhail: To monitor Keeper pressure, start with the usual metrics: CPU, disk IO, and similar standard metrics. One of the most important Keeper-specific metrics is outstanding requests, meaning how many requests are in the queue. There is also a metric showing fsync time, which shows the round-trip to disk. Keeper relies on fsync heavily, so disk latency needs to be really low.

Robert: It’s really important to keep Keeper from getting overloaded, because when it does, it’s not just that things slow down. If Keeper crashes, your tables go read-only and your inserts stop. We know from experience it’s tempting to under-resource Keeper because when systems are lightly loaded, Keeper barely uses any resources. The problem is that when it does become loaded, it’s difficult to recover because scaling those systems up under load can cause additional problems. My personal feeling is: get good metrics, look at what Mikhail’s talking about, add a bigger machine than you think you need, and then forget about it.

[49:39] – Problem 5: Migrating from Non-Relational Databases

Robert: The last problem is one we’ve had to grapple with the most over time: migration from non-compatible databases to ClickHouse.

What’s a compatible database? PostgreSQL and MySQL are compatible. A query like selecting average departure delay runs on MySQL or PostgreSQL, and when you migrate the data to ClickHouse the same query runs. The only difference is that it typically runs anywhere from a thousand to two thousand times faster. The reason is these are both tabular systems. Yes, PostgreSQL and MySQL store data in rows, but they have the same data types and similar SQL dialects. These migrations tend to be fairly straightforward: dump the data out of MySQL and into ClickHouse.

Where it gets harder is migrating from other systems. A very popular one is Splunk. Another is Elasticsearch. These systems are challenging because they organize data very differently.

Splunk stores incoming log messages as events in a full-text index. Unlike a regular database index like a primary key, full-text indexes, also called inverted indexes, store values in the index rather than keys. This lets you ask questions where you scan over the data and ask for all records that have any combination of values, and you can get the answer back very quickly.

When you get to ClickHouse, this is a problem because ClickHouse is not organized this way. ClickHouse stores data in a table and uses secondary indexes, which we call skip indexes. These are used to determine where not to look when answering queries. When migrating from Splunk, you need to think carefully about how to achieve that same kind of performance.

First, indexes are per column. You’ll need to identify which columns people are querying. You might add a full-text index, though ClickHouse’s full-text index has been rewritten a couple of times, and the current version is in beta, and it can be potentially very large. ClickHouse does have good Bloom filter indexes. Mikhail did a great blog article on those a couple of years back. They’re somewhat fussy and take some playing around with to tune properly, but they’re another solid option. Finally, you can use the LIKE operator, which performs very quickly in ClickHouse, using SIMD instructions on Intel hardware.

For any of these options, you need to understand that people coming from Splunk or Elasticsearch are accustomed to a certain search experience. You may be changing their lives if you don’t let them do it the same way. So you have to support those patterns. Second, you need to test on real data to make sure your choice of indexes and queries solves the problem. This is where it gets genuinely hard.

The point is that when you’re moving from a non-relational database, you’re moving to a different data organization with different expectations. Many people have done this successfully, and we’ve helped many customers through it, but it doesn’t happen by magic. You have to test queries under realistic load, understand the search patterns people expect, figure out how to replicate them in ClickHouse, and build in time for it.

This applies not just to Splunk and Elasticsearch, but even to databases like Snowflake, where a lot of the optimization around partitioning, codecs, compression, and sort orders is done for you automatically. When you come to ClickHouse, you need to make those choices yourself.

One thing I love about ClickHouse’s system log messages: if you’re using indexes, you can see in the logs when a bloom filter index is helping you skip granules. It might tell you it’s wiping out half your granules. That may not be great; you’d want it to skip even more. But it’s information you can use to guide further tuning.

[57:09] – Summary and Key Takeaways

Robert: So that’s it. We’ve hit the end. These performance problems are very common when people come into ClickHouse or when their systems grow. Here’s what you can do to avoid each one.

Bad schema: Tune your schema to reduce IO. Put in realistic amounts of data, and you’ll quickly see what’s happening. ClickHouse’s system tables are very powerful tools for optimizing the schema.

Inserts: Always make inserts as big as possible, but not bigger. If you need data in 30 seconds, put it in every 5 seconds to give the rest of your pipeline time to move the data. Use a reasonable value consistent with your SLAs.

Queries: Test on real data and fix the slow ones. This is where we keep saying it: put a real workload on it. Use a single host if you’re building a large system, but beat that host up. Test for concurrency and realistic data size.

Hardware: Test the hardware on realistic workloads and be able to increase capacity before you hit problems. That means not just having the ability to scale up, but having metrics that tell you you’re running into trouble before things fail.

Migrations from non-relational databases: Plan time into the schedule. Think about what people liked about the system they’re leaving. There’s a reason people pay a lot of money for Splunk: it’s a pretty good database. When you move to ClickHouse, take into consideration what those databases do well. Don’t just drop it into ClickHouse and expect to start saving money without doing any work.

And finally, don’t assume ClickHouse will be fast. When it’s been properly tested and tuned, it’s really great. It’s the best and fastest database I’ve ever worked with, but you need to test it, prove it for yourself, and make the changes and the investment to get the speed.

[59:52] – Q&A: Joins, Spill-to-Disk, Dictionaries, Materialized Views, and Sharding

Mikhail: Starting with the question about joins and group-bys: there’s a newer option to spill join data and hash table data to disk. The question is whether we should still prefer the old recommendation of fitting the right-hand table into RAM.

The answer is yes. If you can do that, it’s better. Every kind of spilling to disk, every kind of swapping, is always worse for performance. You do it not to make things fancier, but because you have no other choice. If your join doesn’t fit in RAM, you can use the spill-to-disk option. But if you can run in the old-school way with the right-hand table in RAM, that’s faster. Spill to disk means joins will be slower, but they’ll eventually finish without dying due to out-of-memory errors.

Robert: One thing we didn’t talk about is dictionaries. These were developed very early in ClickHouse and they’re still really powerful because they allow sharing the right-hand side of a join across multiple queries, keeping it ready all the time in server memory.

Mikhail: Exactly. For slowly changing dimensions in a star schema, things like resolving client ID to client name or rules-type lookups, dictionaries are the best possible thing you can do. It can be much better than joins. Of course, if you’re using a tool that writes joins for you, the joins should also work. But remember that there are more options in ClickHouse.

On the question about materialized views: using one raw table and ten to twenty target tables is usually not the greatest practice. Every single insert into the original table will trigger a sequence of inserts to all 20 tables. It’s like an after-insert trigger. Every time you insert to the main table, it will trigger 20 more inserts, impacting insert performance significantly. You’ll need to deal with merges inside 20 tables instead of one.

The typical way to avoid this is either to push something inside the main table itself by creating a projection, or if you need several tiers of data, like bronze, silver, and gold, make the movements between those tiers not in real time. Don’t trigger them by every single insert; trigger them every 10 seconds or every minute. This way, you won’t impact insert speed. You can still fill your bronze layer very quickly, and every 10 seconds or every minute you trigger a flush from the freshest data in the bronze layer to the silver tier.

On sharding: it’s a very broad topic, but one of the main decision points is whether you have several tables with some normalization. Sharding can help you make joins local, which is very important. If you can’t do joins locally, the nodes need to exchange a significant amount of data, which impacts query speed. Network speed is not something you can easily scale; it will never come close to the speed of RAM. If nodes need to exchange significant amounts of data, the query won’t scale linearly. The amount of data exchanged is heavily impacted by the sharding schema.

Robert: One thing we’ve run into is that when you have shards, you’re going to use a distributed table to query across them. You need to be careful about what happens at the initiator node, which is the node where the query hits first. If you write complex queries, some processing can end up forced onto the initiator node and become a huge bottleneck. I’ve seen cases where the scans on shards took only a few seconds in aggregate across dozens of threads, but then it took a hundred seconds to process the data through the distributed table engine because some operations ended up serialized. This can happen with complex queries that have a lot of CTEs. Keep it simple. That’s usually the way to solve that problem.

One key thing about adding shards in real time: if you partition by time, you can add shards very simply by adding more hardware and letting new data redistribute. In the best case, you just add more hardware and let the shards fill up. If you’re keeping data for 14 days, after 14 days your new shards will be equally balanced with the old ones.

I think we’ll call it a day. Mikhail, thank you so much. You helped this go from a subpar talk to definitely above average. I hope you’ve all enjoyed this. If you have more questions, we’re reachable at altinity.com. You can come to our Slack channel or contact us on LinkedIn. We love databases and we love to help. Good luck with your ClickHouse applications, and thank you very much for joining us today.

Mikhail: Thank you, guys, too. Bye-bye.

FAQ Section

Q: What is the most common ClickHouse performance mistake for new users?

A: The most common mistake is bad table design, specifically over-partitioning by a high-cardinality column like a tenant ID or carrier instead of by time. This creates a very large number of small parts, each with files for every column, which dramatically increases the IO cost of every query. The fix is to partition by time and sort by the tenant or high-cardinality column within the partition.

Q: What are async inserts, and when should I use them?

A: Async inserts are a ClickHouse feature that buffers multiple small inserts from many clients into a single in-memory buffer and flushes them to storage as one larger part. They are the recommended solution when your architecture doesn’t allow client-side batching, for example, when using agents like Fluent Bit that write small chunks by default. Enable them by setting async_insert = 1 and wait_for_async_insert = 1 on the inserting user’s profile. Note that they are not needed, and won’t work, when using the Kafka table engine, which has equivalent buffering built in.

Q: How can I reduce memory usage in aggregation queries?

A: The most effective approach is to choose approximate aggregates when exact precision isn’t required. Replacing uniqExact with uniq (which uses HyperLogLog internally) can reduce RAM usage by five times or more while returning results that are very close to exact. Additionally, restructuring joins so that aggregation happens in a subquery first reduces the amount of data each thread must carry through the scan phase, lowering both memory use and execution time.

Q: What is ClickHouse Keeper, and why does it matter for performance?

A: ClickHouse Keeper is a C++ rewrite of Apache ZooKeeper that coordinates replication across ClickHouse nodes. It tracks which parts exist on which replicas. If Keeper becomes overloaded or slow to respond, insert performance degrades immediately. If it becomes overloaded enough to crash, tables go read-only and inserts stop entirely. Key metrics to monitor are outstanding requests, fsync latency, and standard CPU and disk IO. The practical recommendation is to provision a larger Keeper node than you think you need and monitor it proactively.

Q: What should I expect when migrating from Splunk or Elasticsearch to ClickHouse?

A: These are non-trivial migrations because Splunk and Elasticsearch are built around full-text (inverted) indexes, allowing arbitrary token searches across unstructured log data. ClickHouse stores data in sorted, columnar tables and relies on the primary key sort order plus optional skip indexes like Bloom filters for search performance. The migration requires identifying which columns are searched most often, applying appropriate skip indexes, testing on real data at realistic concurrency, and building time into the project schedule. It is absolutely achievable, but it does not happen automatically.

Q: How does sharding affect ClickHouse query performance?

A: Sharding distributes data across multiple nodes using a distributed table layer. The most important consideration is choosing a shard key that keeps related data on the same node so that joins can happen locally. If the query planner must move large amounts of data between nodes to answer a join, query performance will not scale linearly with the number of shards because network bandwidth is far slower than RAM. Additionally, complex queries with many CTEs can serialize processing onto the initiator node, creating a bottleneck. Keeping distributed queries as simple as possible is generally the best approach.

© Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE^® SOLUTIONS

Get in touch with ClickHouse experts.

Related:

Leave a Reply Cancel reply