Five Things Every New ClickHouse® User Should Know (Part 2: Admin)

Recorded: August 20 @ 08:00 am PDT
Presenters: Robert Hodges & Tatiana Saltykova
This webinar is the second installment in Altinity’s “Five Things Every New ClickHouse® User Should Know” series, with Part 1 focusing on development. Robert Hodges (CEO, Altinity) and Tatiana Saltykova (ClickHouse administration instructor and Altinity production cluster engineer) cover five high-priority administration topics in approximately one hour.
The session begins with a practical overview of deployment environments: Kubernetes with the Altinity Kubernetes Operator for ClickHouse®, bare-metal or VM deployments, and managed cloud options such as Altinity.Cloud, with honest trade-offs discussed for each. The replication section explains how ClickHouse’s shared-nothing architecture uses shards for throughput and replicas for high availability, how ClickHouse Keeper (and its predecessor ZooKeeper) coordinates part tracking, and what to do when the replication queue becomes congested. Schema change covers what can and cannot be changed on MergeTree tables after creation, the difference between lightweight and heavyweight schema changes, and how mutations flow through the system. The upgrade section contrasts leading-edge and trailing-edge upgrade strategies, explains the difference between upstream LTS releases and Altinity Stable® Builds for ClickHouse®, and shows how the Kubernetes operator makes rolling upgrades a one-line change. The session closes with a tour of ClickHouse’s system tables for monitoring, performance investigation, and capacity planning.
The Q&A addresses multi-tenant partitioning strategies, Kubernetes liveness probe pitfalls, and how to recover a lost replica in both Kubernetes and bare-metal environments.
Here are the slides:
Key Moments (Timestamps)
Key moments generated with AI assistance.
- 00:06 – Welcome, housekeeping, and speaker introductions
- 01:29 – About Altinity
- 03:35 – Topic 1: Choosing your deployment environment
- 04:20 – Kubernetes with the Altinity Operator: architecture overview
- 08:00 – How Kubernetes resource allocation works; CSI drivers, autoscalers, IAM
- 09:22 – Deploying ClickHouse with the operator: YAML, rolling upgrades
- 10:10 – Bare-metal and VM deployments; packages, config files, config.d pattern
- 13:49 – Cloud deployments: Altinity.Cloud and trade-offs
- 15:30 – Topic 2: Replication, sharding, and ClickHouse Keeper
- 21:00 – ReplicatedMergeTree table definitions and Zookeeper paths
- 22:30 – How inserts flow through replication
- 25:00 – Replication queue and what to do when it gets stuck
- 27:55 – Lesson for developers: avoid bulk table refreshes on replicated tables
- 28:26 – Topic 3: Schema changes, mutations, and what cannot be altered
- 32:31 – Topic 4: Upgrading ClickHouse; LTS vs. monthly releases; Altinity Stable Builds
- 40:38 – Rolling upgrades in Kubernetes: one-line operator change
- 44:50 – Cons of Kubernetes; liveness probe pitfalls
- 47:30 – Topic 5: System tables overview and monitoring
- 54:40 – Investigating resource-heavy queries with system.query_log
- 57:25 – Summary and resources
- 59:33 – Q&A: multi-tenant settings, recovering a lost replica
Webinar Transcript
[00:06] – Welcome and Housekeeping
Robert: Hi everybody, and welcome to our webinar on five things every new ClickHouse® user should know. We’re focusing today on administration. In the webinar last month we talked about development, so this is Part 2.
A little housekeeping before we get too far in. This is being recorded, and we will send you a link to the recording and the slides within a few hours of the end of the webinar. We have time and ability to take questions. My colleague Tatiana Saltykova is also here on this talk with me. She was responsible for many of the slides and also teaches our class on administration and runs production ClickHouse clusters. She is an authority on everything we’re going to be discussing today. Feel free to throw your questions into the Q&A box or the chat. We’ll get to as many of them as we can, and maybe go a few minutes over if we have to.
My background: I am Robert Hodges, Altinity’s CEO. I have been working with ClickHouse since the beginning of 2019, specifically January 1st. It’s my favorite database, as I think is probably the case for Tatiana as well. So let’s dive in.
[01:29] – About Altinity
Robert: Who are we? Altinity is a vendor for ClickHouse. We have two main offerings. The first, which we began with back in 2017, is enterprise support: 24×7 support for people who like to run ClickHouse themselves. That’s about half our customers. The other half run Altinity.Cloud, a cloud service that runs ClickHouse for you either in our account or in your account. We are not the same as clickhouse.com. People sometimes get confused. We are in fact competitors, although we collaborate on making the code better.
[03:35] – Topic 1: Choosing Your Deployment Environment
Robert: ClickHouse can literally run anywhere. There was a blog article a few years back about running ClickHouse on a mobile phone. I don’t think you’d want to do that for production, but wherever you choose to run it, you should understand the environment well.
It is very popular for people who build SaaS or large-scale environments to run things on Kubernetes. Let’s talk about that first. You can run on premises, but most people run in the cloud these days on a managed Kubernetes service like Amazon EKS. On top of Kubernetes, you’ll have base services such as Prometheus and Grafana for monitoring and observability, the ClickHouse operator to manage ClickHouse itself, your ClickHouse processes, and Keeper, which we’ll talk about shortly. You’ll manage the software using things like Terraform to set up the cloud environment and tools like Helm or Argo CD once you’re in Kubernetes.
If you go this route, what you’ll want to do is use the Altinity Kubernetes Operator for ClickHouse®. Before you can do that, you’ll need to bring up your Kubernetes environment. If you’re using EKS, we have Terraform blueprints that will help you do that and make it very easy. The blueprint we provide in open source will optionally set up the cluster as well as the operator. It’s kind of batteries included. Most people take these examples and tweak them to suit their own needs. It takes about 20 minutes for the cluster to come up in EKS.
[08:00] – Kubernetes Resource Allocation: What You Need to Understand
Robert: People say Kubernetes is complicated. For developers, I think that’s not really true. Applications are complicated, and the Kubernetes model reflects that complexity. Where it is complicated, and where you as an administrator really need to stretch yourself, is understanding how Kubernetes allocates resources, particularly compute and storage.
If you’re running ClickHouse yourself on Kubernetes, you definitely want to understand things like cluster autoscalers. You want to understand CSI drivers, which help you allocate storage. And one thing that’s common across all of these mechanisms, including network integration, is you have to understand the underlying permissions model. For example, EKS is going to allocate things and needs suitable IAM roles to do that. There is abundant documentation on how to do this, but you do need to know it.
[09:22] – Deploying with the Altinity Operator
Robert: Why do we use operators? They make complex systems like ClickHouse simple. You can describe a basic ClickHouse installation in 6 to 10 lines of YAML. For example, if you want to change the version of ClickHouse you’re running, you just change the image property to another container image name, and the operator will take care of upgrading or downgrading ClickHouse to use that version. There are many other things you can do: change the number of replicas, the number of shards, the amount of storage, all with very simple changes.
[10:10] – Bare-Metal and VM Deployments
Robert: Kubernetes is not the only way. Many people run ClickHouse directly on their own hosts, whether racked hardware, which is common for financial services, or VMs allocated directly in a cloud. You can deploy on ARM, Intel, AMD, or PowerPC. The images are easy to get, and the Altinity Stable® Builds for ClickHouse® are located at builds.altinity.cloud.
If you’re running this way, you’ll need to understand the process running on the host. There are standard ports and a standard directory structure from the packages: /etc/clickhouse-server for configuration, /var/lib/clickhouse for data, and /var/log/clickhouse-server for logs. I strongly recommend using the packages rather than setting up the binaries yourself, because they do it correctly and, importantly, when it comes time to upgrade they will do the right thing with configuration files.
This brings up a really important point. If you’re running ClickHouse yourself, you need to understand how configuration files work. There are two main types. config.xml controls server operation: ports, which system tables are enabled, and so on. users.xml is where you set up baked-in users for ClickHouse.
For creating accounts, use SQL commands. We call this role-based access control (RBAC). The default user, the root-equivalent user, and any users that need to come up at startup you’ll define in users.xml. It’s really important not to modify these files directly. Instead, add your changes to config.d for server configuration, or users.d for user definitions. Those changes get applied to the base files. There’s a path you can check to see the fully processed version, which is useful when you’ve put something in and it broke ClickHouse.
[13:49] – Cloud Deployments and Trade-offs
Robert: Nowadays a very large number of people run ClickHouse in one of the managed clouds: ClickHouse Cloud or Altinity.Cloud. These have trade-offs. The big trade-off to understand is that clouds in general trade convenience for cost and performance.
I’m not saying clouds are necessarily more costly when you think about total cost of ownership. In fact, they can be way cheaper because you don’t have to hire people to take care of ClickHouse. But if you are operating at scale and you have data center expertise, you can run ClickHouse more cheaply, particularly with stable workloads. You can also make it run extremely fast by buying the right hardware. ClickHouse runs great on hardware with 126 or 128 cores and 256 gigs of RAM. So these are things you’ll need to think about. There is no right answer. It’s all about what your business needs.
[15:30] – Topic 2: Replication, Sharding, and ClickHouse Keeper
Robert: Understanding replication and how to manage it is a huge topic and I’m just going to touch on it, but it’ll give you a sense of how ClickHouse works in a cluster and what you need to learn to manage it well.
ClickHouse was originally developed with a shared-nothing architecture: a bunch of ClickHouse servers connected over a network, each with storage they own. Very early in its history, even when it was still closed source, it introduced replication and sharding.
Shards take your data for a large table and break it up, storing chunks on different servers. You can think of these as disjoint collections of data. When you run a query over a sharded table, you send a subquery to each shard, collect all the results, put them together, and hand them to the user. Shards allow you to increase IOPS, insert faster, and query more effectively because you’re splitting the work.
Replicas help with concurrency for queries and with high availability. If you lose a copy of the data, no problem. With three replicas, you still have two that can serve queries.
The most common organization for large tables is sharded and replicated: shards that split the large table into pieces that fit on individual servers, with each shard having two or more replicas. Another pattern is having the same data replicated to every server in the cluster: useful for reference data that you join on, like the names of your business units. There’s also the all-sharded pattern where every server has its own copy of the data but you want to query across all copies simultaneously. A great example for this is system tables, which allows you to do a single query across all servers and get answers like what processes are currently running.
[21:00] – ReplicatedMergeTree and Keeper Paths
Robert: When you define a replicated table, you’ll use an engine called ReplicatedMergeTree. In the classical definition, you provide a path. If the path is the same across nodes, those nodes are replicas. If the paths are different, they’re shards. So you can see how the paths combine replicas and shards into clusters.
What is this path actually doing? It’s a path out in ZooKeeper or Keeper. ZooKeeper keeps track of who is in the cluster, what parts of the data each node possesses, so they can find each other and ensure parts are properly replicated.
What is a cluster? It’s just a collection of servers, defined in a configuration file by convention called remote_servers.xml. It just has XML tags declaring a shard, then the specific hosts in that shard. That’s it. You can define the patterns of shards and replicas any way you want.
[22:30] – How Inserts Flow Through Replication
Robert: How does replication actually work? When somebody inserts data into one of the tables, ZooKeeper tracks that there are new parts inserted, and the other servers listening on ZooKeeper will see these parts arrive and then go fetch them from the original server using port 9009, an internal port used for moving parts around. The ZooKeeper nodes form what’s called an ensemble. You typically run three of them on their own hosts where no one can interfere with them. It’s very important for them to have plenty of CPU cycles and fast storage available.
You’ll notice I say ZooKeeper in brackets because there’s a replacement called Keeper. We’ll talk about the difference in a moment.
ZooKeeper ensures that this operates as a sequential queue. Different operations, like adding new parts or merging parts, are enqueued and processed in order. One side effect is that if something for a particular table gets stuck in the queue, it will block everything behind it. The usual cure is to wait, but first you need to find out what’s wrong. One of the most important things to do when this happens is check system.replication_queue. The system tables are glorious in ClickHouse. You can query the replication queue to count how many items are queued or look for specific items that may be blocking progress. Most of the time, if you’re not overloaded and are adequately provisioned, you won’t see problems. But when your system gets overloaded, or when people make huge numbers of changes, the replication queue can become quite full.
One lesson to pass to developers: if you have replicated tables, constantly refreshing the entire contents of a table, the way you might in MySQL by pushing a new copy of a report table every night, can really kill replication if you do it for multiple tables at once. It just has a lot of data to replicate.
[22:35] – ZooKeeper vs. ClickHouse Keeper
Robert: A question we commonly get is: what’s the difference between ZooKeeper and Keeper, and which one should I use?
ClickHouse originally used ZooKeeper because it was available and handled consensus well. Starting about four or five years ago, the ClickHouse team wrote Keeper, which implements the ZooKeeper API but is written in C++ and is bundled into the ClickHouse server process.
In a nutshell: if you’re using ZooKeeper now and it works, keep it. It’s probably not worth upgrading. If you’re developing a new system and you have the choice, pick Keeper. Long-term, Keeper is the future. ZooKeeper now has very slow development, which means if you have a bug, it may or may not be easy to get it fixed. Additionally, there are increasingly features being added to ClickHouse, like the S3Queue engine, that depend on new features only in Keeper. So for new systems, Keeper is the right choice.
As for configuration, it goes in the XML files just like everything else. And a really key thing: the system.zookeeper table is something you’ll constantly refer to when managing Keeper or ZooKeeper, to verify you’re properly connected and to read the paths inside Keeper. It’s a super useful tool to have.
[28:26] – Topic 3: Schema Changes and Mutations
Robert: Databases don’t stay static. Let’s talk about how to change the schema.
In MergeTree and its variants like ReplicatedMergeTree and AggregatingMergeTree, these are what you use for production analytic data, there are a couple of things that cannot be changed easily once set. First is PARTITION BY, which splits the table into chunks divided by a partition key. Once you set that, you can’t change it. If you want to change it, you need to copy the data to another table. This is something in ClickHouse you want to get right the first time. The general guideline is to shoot for something that gives you around a thousand parts. For most time-oriented tables, partitioning by month gives you a good start.
The other immutable aspect is ORDER BY. The order of data within parts is critical in analytic databases. You can change ORDER BY, but only when you’re adding columns, and only by appending to the end of the ordering. ClickHouse doesn’t allow changes that would cause previously written parts to become invalid. If you need to change ORDER BY or PARTITION BY substantially, it’s going to be painful. This is one of the big pain points of ClickHouse administration and requires real effort.
Schema changes in general, though, are pretty easy. If you’re used to MySQL or PostgreSQL, where tables become unavailable when you’re building indexes or doing other operations, ClickHouse is not like that. It’s usually good about doing online schema changes. Adding a column, for example, happens almost instantly. We call these lightweight schema changes: they block so briefly that you’re unlikely to notice.
There are heavyweight schema changes, like changing data types, which not only change the metadata but require rematerializing the data. ClickHouse handles this efficiently, but if you have a big table it has to make that change across many parts. It does this by generating what’s called a mutation. ClickHouse has a queue for mutations, and they’re processed one after another. Mutations mean that operations like deleting data or changing data types will take a while to flow through if you have a large table. Normally this doesn’t create a problem unless you do it a lot and you have a crowded replication queue, since mutations also need to be replicated between servers. There is a system.mutations table you can look at to see how your mutations are doing.
[32:31] – Topic 4: Upgrading ClickHouse
Robert: Upgrade is probably the least favorite part of a DBA’s life because it’s where things can go really wrong. Let’s talk about how to do it in ClickHouse.
ClickHouse is a large open-source project. In any given year, several hundred unique GitHub contributors merge changes into the project. It’s changing fast, and that means there are actually a couple of different upgrade strategies.
Every month, a new build of ClickHouse comes out. Then there are longer-term releases, called LTS releases, which come out twice a year, typically in March and August. Those two philosophies sit at opposite ends of the spectrum.
The first is a leading-edge strategy: as soon as a new version comes out, you upgrade directly. This gives you new features and bug fixes, but the downside is you’re going to hit bugs. Every ClickHouse release involves an enormous API surface that changes frequently. Settings change, SQL behaviors change, there are regressions. If you run monthly releases, there’s a good chance you’ll hit something. You need automation that lets you revert quickly, like a gopher darting back into its hole if the food isn’t good.
The second is a trailing-edge strategy, suited for large production applications where API changes can break things and testing takes significant time. In this case, you upgrade to LTS versions, test thoroughly, and then leave it for maybe another couple of years. You might schedule a month for testing against your actual application before you commit to an upgrade.
Release support windows: LTS versions get bug fixes for a year, maintained by ClickHouse Incorporated. Monthly releases get bug fixes for three months. If you’re running a non-LTS release, you need to be prepared to upgrade before it goes off support.
Altinity does builds as well. We call them Altinity Stable® Builds for ClickHouse®. They’re based on LTS releases and designed for people who want a trailing-edge upgrade strategy: few upgrades, but really sure they’re going to work. We typically publish these three months after the LTS release appears and maintain them for three years. These builds are open source just like ClickHouse itself and free to anyone, but if you want us to fix bugs, you need a support contract. It works a bit like Red Hat, except we don’t restrict access to the repo.
[38:00] – Upgrade Commands and Planning
Robert: If you’re running on EC2 or racked hardware, here’s an example of the apt upgrade flow. You set the version and then install the three packages: client, server, and common static. Notably, apt install does not restart the ClickHouse server. That means you can install the software and then choose when to cause the upgrade to happen by restarting. We have an upgrade guide on our website, written by Dima Borovtsov, who has done many of these and is very familiar with the process.
A really key point on upgrade planning: API changes. In large applications, API changes are one of the biggest problems people run into. It’s not a problem in the sense that the new features are bad, but it is a problem in the sense that it may break your applications. If those applications are critical to your business, it’s better to know that in advance rather than plow through and have things completely break.
There’s also an approach used by very large applications that upgrade at long intervals: spin up brand new servers running the new version, connect them to the old servers using replication, let the data replicate over, test the application against it, and at some point switch over to the new version. This is very resource and labor intensive, but if you have upgrades that involve major changes to the application, it’s a good way to ensure continuity.
[40:38] – Rolling Upgrades in Kubernetes
Robert: There was a question about how you do an upgrade with the Kubernetes operator. This is really glorious. Remember that YAML I showed you with the image property? That is a tag pointing to a Docker Hub image. The way you trigger an upgrade with the Kubernetes operator is you just change that image tag. The operator will then do a rolling upgrade.
This is a big reason to use an operator as opposed to writing your own implementation with stateful sets. The operator has a built-in feature where it essentially does a canary upgrade. You submit the change. The operator says, “I see you’ve got a new version. I’m going to go to one of the servers, update the pod definition so it has the new image, pull the image. If that works, I’m going to begin going in parallel across all the shards.” In each case, as it’s upgrading the replicas, it gives them a chance to shut down, drain out of the load balancer, allow current queries to finish, and then do the upgrade. You get all of that for free just by making a one-line change to the Docker image. This is one of the places where Kubernetes orchestration excels.
Many people run ClickHouse on their own hosts. In that case, you’ll want some automation to handle rolling upgrades because doing it manually across a large cluster is not something you want to do.
[44:50] – Cons of Kubernetes; the Liveness Probe Problem
Robert: Are there cons to using Kubernetes? Yes, it’s more complicated. I’ll give you a specific example of where Kubernetes is tricky.
Kubernetes is built on a philosophy that if things get sick, you just restart them. The way it determines whether something is sick is through what’s called a liveness probe: it’s constantly checking to see if a pod is still responding. If it’s not, after a certain period, Kubernetes will restart it. That’s great for a web server, but it’s not good for a server like ClickHouse where you might have an hour for caches to populate, or a long restart process where it has to check all the parts, open them up, and pull in indexes.
If you run it on your own host, the server will just get busy and non-responsive, but as soon as it clears the backlog it’ll come back. In Kubernetes, it’ll be restarted. So you have issues like that to be aware of.
The other thing about Kubernetes is you really need to understand how it works, and specifically how it connects to your cloud infrastructure underneath. If you know that, and many people today do, you’ll be fine. If you don’t, you may find it difficult. You either need to learn it or run ClickHouse a different way.
[47:30] – Topic 5: System Tables
Robert: Let’s talk about the last topic. I love system tables. They are one of the single best features in ClickHouse. They’re the best system tables of any database I’ve ever used.
This is going to be somewhat of a lightning tour because there are so many of them. I want to convey the richness available here. First, you have tables that give you basic monitoring: system.asynchronous_metrics, system.metrics, and system.events. Events are counters, like the number of inserts or the number of queries. If you export these and feed them into Grafana, you get time-series views of things like queries per second. Asynchronous metrics are counters updated in the background. The Altinity Kubernetes Operator automatically exports these in Prometheus format so they can be scraped into Prometheus and then visualized in Grafana, which is the standard and most popular way to build monitoring dashboards. Learn more about ClickHouse monitoring best practices.
But there’s a lot more. If you’re trying to figure out why IO is going slowly, there’s a raft of system tables to look at. First, check whether your tables have too many parts. If a table has 100,000 parts, that’s going to be a problem almost no matter what. You can look in system.parts and count the parts by table. system.tables and system.columns will tell you useful things like table sizes, compression levels at the column level. If something is compressing poorly, that’s a clue to try changing the compression, which you can do online.
Once you move from static analysis to live investigation, system.processes shows you what’s running right now. And for history, two particularly useful tables that you almost always want to have enabled are the part log and the query log. The part log shows you the history of everything that happened to parts: vital for understanding how fast you’re ingesting data and how big the blocks are, because ClickHouse likes big blocks. The query log shows you all queries that have been run, what exact resources they used, and how much. It’s an incredibly helpful table. Merges are a big consumer of resources. There’s system.merges, which tells you what merges are active right now, so if your system is slow you can check which specific table is the culprit.
[54:40] – Investigating Resource-Heavy Queries
Robert: Here’s a great example query from Misha Filimonov for finding resource-heavy query patterns. The normalized_query_hash feature in the query log hashes similar queries down to the same value so you can group them, then shows a sample. The ProfileEvents column in the query log is a really key field: it contains specific resource counters, like microseconds of execution time. This example also uses clusterAllReplicas, which treats every system.query_log table across your entire cluster as a separate shard and runs the query across all shards. So this query gives you statistics across your entire cluster, not just a single server.
Here’s another example: you have a specific query ID from a distributed query being sent to multiple underlying servers for reading data out of Parquet. You want to find out what your S3 statistics were for this entire distributed query, not just the query that arrived on the initiator host, but all the subqueries that were generated from it and how much demand they put on your S3 storage.
This has been a lightning tour, and it’s gone pretty deep. But the richness is there in these system tables.
[57:25] – Summary and Resources
Robert: Let me give a quick summary. ClickHouse can run anywhere. Make the choice that fits what you want to accomplish. If you’re running a trading system, good chance you don’t want Kubernetes because you want raw performance and zero unplanned downtime. Run on racked hardware. But understand how it works wherever you run it.
Replication is the single most powerful feature inside open source ClickHouse, but you need to understand how it works. Things like mutations and excessive numbers of changes can cause replication to bog down. Knowing where to look is really important. Many replication problems are cured by time: if you can stop whatever your application was doing that was causing the overload and then wait, it will clear up. ClickHouse is very robust.
ClickHouse is great for online schema changes, but if you have to materialize things it will add load to your system. There are just a couple of things, like PARTITION BY, that you’ve got to get right the first time, otherwise you’ll have to copy the data later, and that is painful.
For upgrade: the key success factors are planning, testing, and automation. The automation is particularly important because you’re often running many servers. If you have 24 servers across eight shards and three replicas, you don’t want to be doing that by hand.
And system tables: everybody uses them. The best thing you can do is export the values into Grafana, Datadog, or whatever monitoring platform you use. You want to have access to this information. It’s an incredibly rich source of understanding of what ClickHouse is up to and, when things are wrong, what’s causing the problem.
Most of the slides I presented came from Tatiana’s administration course. We teach an eight-hour course on ClickHouse administration. It’s not an hour like this. It’s eight hours of glorious delving into ClickHouse administration.
One word of caution: if you’re using an AI overview for ClickHouse administration questions, verify what it’s telling you. ClickHouse administration is complicated and there isn’t a lot of training data. If an AI tells you, “Don’t worry about detached parts, just delete them,” that’s actually not the right answer. Verify what it’s telling you, and if you have questions, come talk to us.
[59:33] – Q&A
Robert: Let’s take a few questions.
Tatiana: There’s a question about which important server session settings should be configured for multi-tenant use specifically. There’s no single setting that fits multi-tenant specifically. By default most of the settings are good to go, and usually what happens is you start with the defaults and change them as you see how your system behaves. It depends on the specific use case. I don’t think there’s something special for multi-tenancy in particular.
Robert: That’s a great topic because I would say it’s not so much settings but partitioning and sorting. In multi-tenant systems, if you come from MySQL, you think, “I want to have a database for every tenant.” That’s the right thing to do in MySQL. In ClickHouse it’s not. It’s almost always wrong.
What you want to do in ClickHouse is partition by time, because that creates partitions that are about the same size modulo growth or decrease in traffic. But then you order by tenant. Here’s an example: a database of airline data where you can think of carriers as your tenants. We’ve partitioned by year to get even chunks of data each year, and then the carrier is the tenant. By ordering on carrier, all the data for a given carrier is in a contiguous section of storage. That’s a very basic but really key lesson that we always recommend.
Tatiana: There’s also a question about how to best retrieve a lost replica when a host got lost. I’d need to know: is the storage also lost, or just the host?
If the data is completely lost, the simplest thing is to create a new replica. There are SYSTEM DROP REPLICA commands you can use to drop the metadata from Keeper so it doesn’t get stuck there, and then just create a new replica.
Robert: In Kubernetes, if you remove the PVC and then delete the pod, what will happen is the pod will come back up and automatically repopulate the data. Because the operator sets up replication for you, it’ll just happen automatically.
Tatiana: If the replicas were offline for long enough, they may get detached from their metadata in Keeper. In that case, you can use SYSTEM RESTORE REPLICA, which reads the data back off disk and gets the replica properly synced with Keeper again. So yes, if your storage is intact, you can restore from it. That’s a feature that Altinity contributed, and we use it all the time.
Robert: Great. I think we’re over time. Thank you so much everybody. Thanks to Tatiana and Josh. Please do come visit us at altinity.com or in our Slack channel. Good luck running ClickHouse. It is a wonderful database and you can build great applications. Tell your friends about it and let us know how it worked out. Thank you very much and have a great day.
FAQ Section
Q: Should I run ClickHouse on Kubernetes or on bare metal?
A: There is no universally correct answer. Kubernetes is an excellent choice for teams that need to manage multiple clusters, want rolling zero-downtime upgrades, or are already invested in the Kubernetes ecosystem. The Altinity Kubernetes Operator for ClickHouse® makes deploying and upgrading ClickHouse clusters as simple as changing a YAML image tag. However, Kubernetes adds complexity in the form of liveness probe behavior that can restart ClickHouse during a cold cache warm-up or part-loading sequence, and it requires understanding CSI drivers, autoscalers, and IAM roles. For workloads that require maximum raw performance and zero unplanned downtime, such as trading systems, racked bare-metal hardware may be the better choice.
Q: What is the difference between ZooKeeper and ClickHouse Keeper, and which one should I use?
A: Both serve the same purpose: maintaining a shared metadata tree that coordinates replication across ClickHouse nodes. ZooKeeper is the original implementation, written in Java. ClickHouse Keeper is a C++ rewrite that implements the same API and is bundled into the ClickHouse server binary. If you are already running ZooKeeper and it works, keeping it is reasonable. For new deployments, use Keeper. It has active development, it does not require a separate Java runtime, and some newer ClickHouse features depend exclusively on capabilities only available in Keeper.
Q: What are the safest upgrade paths for a large production ClickHouse deployment?
A: For large production systems where API changes can break applications and testing is expensive, a trailing-edge upgrade strategy is recommended. This means upgrading to LTS releases rather than monthly builds, testing the new version thoroughly against your actual application and SQL queries before committing to the upgrade, and ideally scheduling a testing window of several weeks. Altinity Stable® Builds for ClickHouse® are designed for this scenario: they are based on LTS releases, published after additional testing, and maintained for three years. For teams using the Altinity Kubernetes Operator for ClickHouse®, the rolling upgrade mechanism handles the cutover safely by draining replicas before restarting them, so the upgrade itself carries minimal operational risk once testing is complete.
Q: Why can’t I change PARTITION BY or ORDER BY after creating a MergeTree table?
A: The PARTITION BY expression determines how parts are physically divided on disk, and the ORDER BY expression determines the sort order of data within those parts. Changing either of these would require rewriting every part in the table, which ClickHouse does not do in place. If you need to change one of these properties after the fact, you must copy the data to a new table with the desired definition. This is one of the most operationally expensive things you can do on a large ClickHouse table, which is why getting the initial schema right is strongly emphasized. The general guideline is to partition by time and sort by the columns you most frequently filter on, with tenant or high-cardinality identifiers ordered first within partitions.
Q: What are the most important system tables for a ClickHouse administrator to know?
A: The most critical tables for day-to-day administration are system.query_log for historical query performance and resource usage; system.replication_queue for monitoring replication health and diagnosing stuck operations; system.merges for understanding current merge activity; system.mutations for tracking the progress of schema change or delete operations; system.parts for auditing part counts per table; and system.processes for seeing what is running right now. The system.metrics, system.events, and system.asynchronous_metrics tables are best exported into a monitoring system like Prometheus and Grafana for time-series visibility. Enabling part_log and query_log persistently is strongly recommended.
Q: How do you handle multi-tenant workloads in ClickHouse without creating a separate database per tenant?
A: Unlike MySQL or PostgreSQL, where putting each tenant in its own database or table is common practice, ClickHouse performs best when all tenant data is in a shared table. The correct pattern is to partition by time, which produces evenly sized partitions, and to sort the data by the tenant identifier as the leading column in the ORDER BY clause. This places each tenant’s data in a contiguous section of storage within each partition, making tenant-scoped queries highly efficient. Dropping a tenant’s data requires using ALTER TABLE … IN PARTITION or similar operations rather than dropping a table, but the performance and storage efficiency trade-off is almost always worth it.
© Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.
ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.