---
url: 'https://altinity.com/webinarspage/cloud-native-clickhouse-at-scale-using-the-altinity-kubernetes-operator-for-clickhouse'
title: 'Cloud Native ClickHouse® at Scale: Using the Altinity Kubernetes Operator for ClickHouse'
author:
  name: Cristina Munteanu
  url: 'https://altinity.com/author/cmunteanu/'
date: '2023-03-10T13:51:30-08:00'
modified: '2026-06-03T07:55:09-07:00'
type: post
summary: 'Deploy and manage ClickHouse on Kubernetes: cluster creation, shard scaling, rolling upgrades, and production best practices with the Altinity Operator.'
categories:
  - Webinars
tags:
  - Altinity
  - ClickHouse
  - Kubernetes
  - Open Source
  - webinar
image: 'https://altinity.com/wp-content/uploads/2023/03/Webinar-Mar-7.webp'
published: true
---

# Cloud Native ClickHouse® at Scale: Using the Altinity Kubernetes Operator for ClickHouse

**Recorded:** Thursday, March 7, 2023  
**Presenters:** Robert Hodges

Altinity CEO Robert Hodges walks through the full lifecycle of running ClickHouse on Kubernetes using the Altinity Kubernetes Operator for ClickHouse, the open-source project Altinity first published in 2018 and uses as the foundation for Altinity.Cloud. The session opens with a clear explanation of why ClickHouse clusters are too complex to manage with raw Kubernetes resources alone, and how the operator pattern solves this by introducing a single ClickHouseInstallation custom resource that drives the reconciliation of all the underlying pods, StatefulSets, PersistentVolumeClaims, services, and ZooKeeper resources.

A live demo on Minikube walks through the complete workflow: installing the operator in one command, provisioning a dev ZooKeeper node, applying a 30-to-40-line YAML cluster definition, watching pods come up in roughly 20 seconds, adding a second shard by changing a single parameter, and triggering a rolling version upgrade by updating the container image tag. Robert then covers production-grade setup on managed Kubernetes services such as EKS and GKE, including Karpenter autoscaling, CSI driver configuration, pod placement across availability zones using pod templates and node selectors, and vertical scaling by editing the CRD to change instance type and storage size. The session closes with three essential safety rules and a survey of production essentials: external networking via service templates, backup via clickhouse-backup as a sidecar, the operator security hardening guide, and the monitoring and alerting requirements that make Kubernetes clusters operable at scale.

**Here are the slides:**

[Cloud-Native-ClickHouse-at-Scale-Using-the-Altinity-Kubernetes-Operator-for-ClickHouse](https://altinity.com/wp-content/uploads/2024/05/Cloud-Native-ClickHouse-at-Scale-Using-the-Altinity-Kubernetes-Operator-for-ClickHouse.pdf)[Download](https://altinity.com/wp-content/uploads/2024/05/Cloud-Native-ClickHouse-at-Scale-Using-the-Altinity-Kubernetes-Operator-for-ClickHouse.pdf)

## **Key Moments (Timestamps)**

Key moments generated with AI assistance.

- 0:06 – Introduction and housekeeping: Robert Hodges, backed by Altinity engineering

- 1:47 – Speaker introduction: Robert Hodges, CEO of Altinity

- 2:58 – ClickHouse overview: SQL analytic database, open source, columnar, parallel

- 3:51 – What Kubernetes is: orchestration, container-based apps, resource model

- 4:55 – Simple example: running ClickHouse in a Docker image on a VM

- 5:19 – Kubernetes resource model: StatefulSet, Pod, PVC, PersistentVolume

- 6:35 – The Kubernetes control loop: desired state vs. actual state, reconciliation

- 8:00 – A realistic ClickHouse cluster: 2 shards, 2 replicas, availability zones, ZooKeeper

- 9:07 – Why raw resource definitions are unmanageable: dozens to hundreds of YAML files

- 11:16 – The operator pattern: custom resource definitions and the controller

- 13:22 – The Altinity Kubernetes Operator: history (2018), Apache 2.0, how it works

- 14:57 – Installing the operator: single kubectl apply command

- 16:00 – Installing dev ZooKeeper: single yaml apply, one node for development

- 17:06 – ClickHouseInstallation CRD: shards, replicas, pod templates, volume claim templates

- 17:52 – reclaimPolicy: Retain: protecting storage from accidental deletion

- 20:07 – Live demo begins: Minikube cluster, kubectl watch, operator log view

- 21:06 – Applying dev.yaml: cluster definition loaded into Kubernetes

- 21:46 – Cluster comes up in ~20 seconds: StatefulSet, pods, services

- 23:10 – Adding a second shard: changing shards count to 2, applying dev01.yaml

- 25:28 – Adding a new shard: schema propagation and operator behavior

- 25:47 – Rolling version upgrade: changing the container image tag, applying dev02.yaml

- 28:02 – Operator upgrade behavior: one pod at a time, query draining

- 30:14 – Q&A: automatic resharding (work in progress)

- 30:49 – Q&A: schema propagation to new replicas (automatic)

- 31:38 – Q&A: parallelization and unnecessary restarts (operator improvements underway)

- 31:44 – Adding users in the CRD: RBAC access management example

- 32:25 – Transition to production setup

- 33:30 – Choosing a Kubernetes platform: managed services (EKS, GKE, AKS), Rancher, OpenShift

- 35:07 – Three steps for production: provision cluster, autoscaling, block storage

- 36:41 – Karpenter autoscaling: VM allocation and deallocation based on pod requests

- 37:30 – Block storage: EBS/GKE block storage, advantages over local SSD, CSI drivers

- 38:38 – Placing pods across availability zones: per-replica pod templates with nodeSelector

- 40:27 – Specifying VM type in pod templates: M5 large, resource requests

- 42:00 – Why resource requests matter: OOM killer, preventing co-scheduling conflicts

- 43:07 – Vertical scaling: changing VM type and storage size in the CRD

- 44:00 – Storage scaling notes: can expand but not shrink, automatic restart caveat

- 45:02 – Storage class best practices: volumeExpansion, default storage class flags

- 46:08 – Q&A: how the operator adds replicas (StatefulSet, schema copy, replication)

- 47:43 – Safety tips: never delete the operator while clusters are running

- 48:59 – Safety tip 2: reclaimPolicy Retain protects storage on accidental deletion

- 49:40 – Safety tip 3: move data off shards before deactivating them

- 50:12 – Advanced topics: external networking (service templates in 99-max.yaml)

- 52:20 – Backup: clickhouse-backup as a sidecar

- 52:55 – Security: operator hardening guide

- 53:28 – Monitoring and alerting: essential for Kubernetes operations

- 54:52 – Altinity.Cloud Anywhere: managed Kubernetes with customer data ownership

- 56:00 – Summary and call to action: open source, GitHub contributions welcome

- 58:18 – Q&A: automatic resharding (repeated)

- 59:47 – Q&A: reusing pod names across clusters (use clickhouse-backup to migrate)

- 1:01:15 – Closing remarks

---

## **Webinar Transcript**

### **[0:06] — Introduction and Housekeeping**

**Robert:** Hello everybody and welcome to today’s webinar on cloud-native ClickHouse at scale. We’ll be talking about using the Altinity Kubernetes Operator for ClickHouse. My name is Robert Hodges. I put this webinar together and I’ll be presenting it today, backed up by the great people in Altinity engineering.

A couple of things before we dive in. This webinar is being recorded, so you don’t need to take frantic notes. We will send you a link to the recording within about 24 hours if you signed up. We’ll also send you a link to the slides. For questions, use the Q&A box or post things in the chat. If a question is relevant to what we’re discussing at the time I’ll just answer it on the spot; otherwise we’ll defer to the end.

### **[1:47] — Speaker Introduction**

**Robert:** My name is Robert Hodges. My day job is I’m CEO of Altinity, but I’m basically a database geek. I’ve been working on databases for about 40 years. I do a lot of work with Kubernetes, started working with it about five years ago at VMware, and it’s of course a big part of our lives here at Altinity. The other folks who helped put this together are the Altinity engineering team, with huge amounts of experience in databases, particularly analytic databases like ClickHouse, as well as a great deal of experience in Kubernetes. In fact, we work constantly with Kubernetes. Our cloud is based on it. All of the software we’re discussing today, including [the Altinity Kubernetes Operator for ClickHouse](https://altinity.com/kubernetes-operator/), is used in our cloud. We first published the operator about four years ago and we’ve been building on it ever since.

### **[2:58] — ClickHouse Overview**

**Robert:** Quick primer in case anyone here has not worked with ClickHouse before. It’s basically like MySQL but it does analytics. There are similarities in the SQL dialect. It speaks SQL, it’s open source under Apache 2.0, and it has features that make it a really outstanding real-time analytic database: the ability to run anywhere, columnar storage, incredibly good compression, parallel execution, and all of that. ClickHouse will be the subject here, but what we’re going to be talking about for the rest of this webinar is how to package ClickHouse and run it on Kubernetes.

### **[3:51] — What Kubernetes Is**

**Robert:** Kubernetes has been described as a management system or orchestration system for container-based applications. If you use Docker or another container system, Kubernetes will run those for you in various environments and do all the things necessary to connect your distributed application with the underlying infrastructure it needs.

Here’s the simplest possible application: running ClickHouse in a Docker image on a VM somewhere with some attached storage. You probably wouldn’t use Kubernetes for just this, but it’s a good starting example because it gives us a way of describing how Kubernetes actually implements an application.

### **[5:19] — The Kubernetes Resource Model**

**Robert:** What Kubernetes does as part of container orchestration is map resource definitions to infrastructure. If you haven’t used Kubernetes before, you’ll come in and find a bunch of different kinds of resources: StatefulSets, pods, PersistentVolumeClaims. These are building blocks for distributed applications. They’re not tied to any specific infrastructure, but they’re there so you can describe to Kubernetes the shape of your distributed application and what resources it needs.

A StatefulSet describes a container-based application that needs attached storage. A pod more or less correlates to a container. A PersistentVolumeClaim is a request for storage, and a PersistentVolume is the actual patch of storage allocated to fulfill that request. These map to runtime abstractions you’ll recognize from Docker, which then map to actual physical infrastructure: the pod maps to a process running on a host controlled by cgroups, the volume maps to a patch of EBS storage on Amazon.

### **[6:35] — The Kubernetes Control Loop**

**Robert:** How does Kubernetes actually do this mapping? Basically there’s a control loop running inside Kubernetes. It looks at the resource definitions as a desired state, then looks at the actual state, and decides what action it needs to take to make those two things match. It goes round and round: as the desired state changes, Kubernetes looks at each resource and says: do I see a pod out in the environment? No, I’m going to create it. Do I see a change in the resource definition to make storage bigger? Okay, I’m going to extend that storage. This loop allows Kubernetes to make iterative adjustments so that eventually the infrastructure state matches what you asked for.

### **[8:00] — A Realistic ClickHouse Cluster**

**Robert:** That’s pretty simple, but when we actually run ClickHouse clusters they’re quite a bit more complicated than just a single process. This is a typical example of a ClickHouse installation: two shards, two replicas per shard, the replicas spread across availability zones, and ZooKeeper with three copies also spread across availability zones.

Once you start mapping this to Kubernetes using its basic building block resources, it becomes very complicated. You can see stateful sets, pods, persistent volume claims, persistent volumes, all these resources to define a ClickHouse server, mapping to storage, with potentially dozens or even hundreds of resource definitions that need to be pushed into Kubernetes and properly managed.

### **[11:16] — The Operator Pattern**

**Robert:** If you just had to create and manage all these individual resource definitions yourself, this would be very difficult. The Kubernetes community recognized this several years ago and introduced something called an operator.

An operator is basically a custom resource definition. Kubernetes provides resource definitions for things like PersistentVolume. A custom resource definition can be for anything you want. We have one that describes a ClickHouse cluster. To make it work, you need a controller, a component that watches for these resource definitions as they’re loaded into Kubernetes, and when they change it decides what to do inside Kubernetes to make them a reality. You can think of it as a layer above the resources Kubernetes provides.

When you put a ClickHouseInstallation resource in, the operator gets a message saying a new resource definition has arrived. It looks at the other resources defined in Kubernetes and if they’re absent it creates new resource definitions; if they’re there but not in the right state it makes changes. This process is often called reconciliation: looking at the desired state and actual state and making changes to bring them into alignment.

This feature was fundamentally the thing that made Kubernetes become very good for managing data. Databases are complicated: they have lots of pieces, and processes around them like upgrades or adding replicas that need to be done in a particular order. Operators give you a way to implement that and then have a relatively simple description that users can give to say what they want.

### **[13:22] — The Altinity Kubernetes Operator for ClickHouse**

**Robert:** We started working on our ClickHouse operator in 2018. That operator model was fairly new at that time. The operator as it stands today, [running ClickHouse on Kubernetes with the open-source Altinity Kubernetes Operator for ClickHouse](https://docs.altinity.com/altinitykubernetesoperator/), is distributed as a container, Apache 2.0, and you load it into Kubernetes. Once it’s running you can give it ClickHouseInstallation custom resource definitions, apply them with the standard kubectl command, and the operator will adjust reality to set up the cluster for you.

There are now tens of thousands of clusters worldwide that have been set up or are currently running using this operator.

### **[14:57] — Installing the Operator and ZooKeeper**

**Robert:** The first thing you do is install the operator from GitHub. You can do this in a single line. It pulls a YAML file that contains the custom resource definition for ClickHouse clusters, pulls the operator image, and installs it in the kube-system namespace. It sets up a service account and the other parameters necessary for the operator to run. On my home network the operator is up and running in about 10 seconds.

The next thing you need if you’re going to work with ClickHouse is ZooKeeper. You can also use ClickHouse Keeper, but we’re focused on ZooKeeper here. We provide a YAML file which will pop up a dev ZooKeeper for you: just one node. You create a namespace, apply the YAML file, and you’re done. There’s a warning here: in production you’d want more than one node, because if you’re using replicated tables and ZooKeeper stops, ingest stops. But for development purposes these commands will get you up and running.

### **[17:06] — The ClickHouseInstallation CRD**

**Robert:** At this point you can define your cluster. The CRD is between 30 and 40 lines of YAML to create a basic cluster. It’s quite simple. There’s a kind field that says what sort of resource this is, a specification section with the layout, pod templates, storage templates, and ZooKeeper configuration.

For the layout: one shard and two replicas. Pod template: this just says hey I want to run ClickHouse in a container, and here’s where to find it. I’m using Altinity Stable images here, which are the long-term support builds that Altinity produces for ClickHouse. You can use any version you want.

Volume claim templates: this defines your storage. A couple of things to note. First, the resource request says how much we want: in this case 50 GB. Second, there’s a setting called reclaimPolicy: Retain. This is a feature of our operator. It says: if you delete the ClickHouseInstallation, don’t delete the underlying PersistentVolumeClaim. This protects you from accidentally losing storage if you delete your cluster, which has happened. With this setting you can recreate the cluster and it will automatically reattach to the existing storage.

### **[20:07] — Live Demo: Creating a Cluster**

**Robert:** Let me now actually demonstrate this on a Kubernetes cluster running Minikube in a server in my closet. We can see a kubectl watch showing all resources in the default namespace: nothing running yet. We also have the operator log visible so we can watch what the operator is doing.

Here’s the dev.yaml cluster definition. Let’s apply it:

kubectl apply -f dev.yaml

This loads the resource definition into Kubernetes. A second or two after that, the operator receives the event and begins creating resources. You can see a new StatefulSet being defined, then the pods starting to run, one coming up after another. Within about 20 seconds we have a full ClickHouse cluster: one shard and two replicas. There are also load balancer services created: one that load-balances across the two replicas for queries where you don’t care which server you land on, and individual services for each pod for when you need to target a specific replica, for example during ingest.

### **[23:10] — Adding a Second Shard**

**Robert:** Let’s make a change. We’ll add another shard by changing the shardsCount parameter from 1 to 2 and applying dev01.yaml:

kubectl apply -f dev01.yaml

After a few seconds, the operator receives the event and begins creating a new StatefulSet, then new pods come up. The operator typically does things one at a time. During upgrades it upgrades pods one at a time: it waits for current queries to drain out of a pod before proceeding. This means you can do these operations while your applications are running. We now have two shards with two replicas each.

An important note on schema propagation: when you add replicas, the operator will automatically copy the schema to the new replica. You don’t have to manually run around and grab table definitions from another replica. For replicated tables, ClickHouse replication will then automatically populate the table data.

### **[25:47] — Rolling Version Upgrade**

**Robert:** Let’s do an upgrade. For dev02.yaml we change nothing except the image tag in the pod template: we replace the ClickHouse version string with a newer version:

kubectl apply -f dev02.yaml

The operator will notice the changed image, plan the upgrade, and begin rolling it out. You can watch it in the operator log: it’s evaluating the cluster and creating a plan. Then it goes pod by pod, stopping each pod, re-spinning it with the new image while reattaching the existing storage. During this process each pod runs the new version of ClickHouse with its existing data. It’s as simple as changing a string in YAML. That’s the basic automation for upgrading between ClickHouse versions.

### **[31:44] — Adding Users in the CRD**

**Robert:** One other common question is how to add users. You can add configuration files, configuration settings, change the container entry point, add sidecars, and many other things in the CRD definition. A quick example: if you want to add a user called root to your clusters and enable access management so it can run SQL RBAC commands, you add it in the spec section of the CRD with a hashed password. It’s basically the same information you’d put in an XML file, just expressed as YAML.

### **[32:25] — Production Setup**

**Robert:** That covers basic operations in a dev environment. Let’s talk about using the operator for production.

For production Kubernetes, Minikube is great for development but you’d never run production on it. The certificates expire after 365 days and everything stops working. What we do instead is run on managed Kubernetes services: Amazon EKS, Google GKE, Azure AKS, or DigitalOcean. There’s also Red Hat OpenShift, Rancher, kops. EKS and GKE have a minimal charge; mostly you’re paying for the underlying VMs, storage, and networking, which you’d pay for anyway.

### **[35:07] — Three Steps for Production Kubernetes**

**Robert:** Setup is pretty simple. There are basically three things to do.

First, provision the cluster. Whatever tools are available: eksctl on Amazon, gcloud on Google. You set it up and you’ll have worker nodes.

Second, set up autoscaling. For production systems you’ll want VMs to spin up automatically when needed and go away when no longer needed. We use Karpenter, which watches Kubernetes for pod resource requests and when it sees something that can’t be scheduled, provisions an appropriate VM for it. It also deallocates VMs when they’re no longer needed. This is important for [cutting compute costs by scaling ClickHouse servers to zero on Kubernetes](https://altinity.com/blog/cut-compute-costs-by-scaling-clickhouse-servers-to-zero-on-kubernetes) when clusters aren’t in use.

Third, manage block storage. You can run ClickHouse clusters on NVMe SSD local storage, but we normally use block storage like EBS. At least in Amazon and GKE environments, block storage is just as fast as local SSD for most purposes due to storage bandwidth limits on VMs. The key advantage is that it enables scaling: you can reattach the same block storage to a larger or smaller VM, giving you different compute capacity without moving data. You do need to configure the EBS CSI driver (required since Kubernetes 1.23), create the right storage class, and enable a storage class provisioner.

### **[38:38] — Placing Pods Across Availability Zones**

**Robert:** For production, you’ll want your replicas spread across availability zones. The way to do this with the operator is to use per-replica pod templates instead of a single shared template.

Instead of specifying a replicasCount of 2, you give a list of replica specifications, each pointing to a different named pod template. The first replica uses the zone-2a template, the second uses zone-2b. The operator counts the items in the list and uses the matching template for each replica.

Then you define those pod templates. Beyond just specifying the container image, you add a node selector:

nodeSelector:

  topology.kubernetes.io/zone: us-west-2a

This says the pod can only run on a node labeled with that availability zone. If such a node doesn’t exist, the pod won’t schedule, Karpenter will notice and create the right VM, and then the pod will land on it.

You also specify the instance type in the pod spec:

nodeSelector:

  node.kubernetes.io/instance-type: m5.large

This ensures ClickHouse runs on the right hardware.

### **[42:00] — Why Resource Requests Matter**

**Robert:** Resource requests are important in production environments. ClickHouse can actually see how much memory is available on the VM, because Docker and other container systems don’t hide this from the process. If your resource requests are set too low, Kubernetes may decide this pod is using too much memory and kill it. If you leave resource requests unset entirely, Kubernetes may co-schedule multiple ClickHouse servers on the same node. Setting requests appropriately both prevents the OOM killer from triggering and ensures one ClickHouse process per node.

For tips on [running ClickHouse in Kubernetes](https://kb.altinity.com/altinity-kb-kubernetes/) including resource request configuration, the Altinity Knowledge Base has detailed guidance.

### **[43:07] — Vertical Scaling**

**Robert:** To scale up vertically from M5 large to M5 xlarge, you change the nodeSelector instance type in the pod template from m5.large to m5.xlarge. Apply the file and the operator will reschedule the pods on nodes of the new VM type. To increase storage from 50 GB to 100 GB, change the storage size in the volume claim template and apply. The operator will expand the underlying persistent volumes.

Currently the operator will restart your servers as it allocates more storage. This will be fixed in a future release. One important rule: you can always increase storage but you cannot shrink a filesystem once it’s in use.

### **[45:02] — Storage Class Best Practices**

**Robert:** Pay attention to your storage classes. Even in a dev environment you should verify storage is really allocated correctly. Kubernetes is not good about reporting errors if you have a YAML syntax mistake; it will often just silently pass over it. Verify your PVCs are actually created and have the right type.

On the storage class itself, check that allowVolumeExpansion is set to true. Most Amazon and GKE storage providers support this. Also note the default storage class: if this flag is set to true, any pod that doesn’t specify a storage class will get this one. These settings don’t change much, but once they’re right you don’t have to worry about them again.

### **[46:08] — Q&A: How the Operator Adds Replicas**

**Robert:** When you add a replica, the operator: adds the StatefulSet, which triggers the provisioner to add a VM; once the VM is available, it updates the cluster metadata, including config files like remote_servers.xml, to tell the cluster about the new nodes; then it copies the schema to the new replica. Once the schema is there, ClickHouse replication automatically populates the table data for replicated tables. The process is seamless, and if you’re using distributed tables there are standard checks to confirm the replica is caught up.

### **[47:43] — Safety Tips**

**Robert:** There are three things you absolutely want to be careful about.

**Safety tip 1:** Never run kubectl delete -f clickhouse-operator-install-bundle.yaml while you have running clusters. That YAML bundle includes the custom resource definition, which is global across the entire Kubernetes cluster. When you delete the CRD, Kubernetes no longer knows what to do with ClickHouseInstallation resources, and if you reinstall the operator it will delete them. Clean up your clusters first, then remove the operator.

**Safety tip 2:** The reclaimPolicy: Retain setting is your friend. If you accidentally delete your ClickHouseInstallation, this setting means your PersistentVolumeClaims survive. You can recreate the cluster and it will reattach to the existing storage automatically.

**Safety tip 3:** When removing shards, move the data off first. If you have two shards and reduce to one, the operator will drop the second shard and it will not save your data. The operator allows you to connect to specific nodes precisely so you can move data around before making changes.

### **[50:12] — Advanced Topics**

**Robert:** External network access is complicated and different for every implementation. Kubernetes is very portable inside but the networking differs for every environment. We have a configuration file called 99-clickhouseinstallationmax.yaml in the operator GitHub repository which gives examples of everything you can do, including how to configure internal versus external load balancers. The flag that controls what load balancer gets allocated is a service annotation. For example, setting service.beta.kubernetes.io/aws-load-balancer-internal: “true” creates an internal load balancer that won’t expose a public IP.

For backup, we use [Altinity Backup for ClickHouse](https://github.com/Altinity/clickhouse-backup) running as a sidecar to ClickHouse. There are also approaches using EBS volume snapshots or GKE block storage snapshots that can be effective in cloud environments.

For security, there’s a write-up called the [security hardening guide for the Altinity Kubernetes Operator for ClickHouse](https://docs.altinity.com/operationsguide/security/), which just went into the project. It covers users, passing secrets around, securing the network, and a broad range of security best practices.

For [monitoring and alerting](https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-monitoring/): if you’re running things on Kubernetes, you absolutely must invest in this. If you can’t afford to invest in monitoring and alerting, you probably shouldn’t use Kubernetes. Kubernetes hides a lot of what’s happening on your nodes, and you can’t just SSH in and look at things. You need to be able to see quickly what’s going on across potentially many nodes. Our monitoring at Altinity.Cloud pulls metrics off both the VMs running ClickHouse and ClickHouse itself, so when something goes wrong we can figure it out without laying hands on it.

### **[54:52] — Altinity.Cloud Anywhere**

**Robert:** We have many users running this themselves: there are even large internet services in China running ClickHouse clusters on the operator for their own cloud services. If you don’t want to run this yourself, Altinity.Cloud can run it for you. And if you have your own Kubernetes, for example on GKE and you want to keep the data locally, Altinity.Cloud Anywhere allows you to install the Altinity Connector, register with Altinity.Cloud, and build a bidirectional management channel. We can then do all your cluster management from a centralized control plane. But you have the data, you have the code, and if you don’t like it you can just disconnect and your clusters continue running exactly as shown in the examples here.

### **[56:00] — Summary and Call to Action**

**Robert:** The Altinity Kubernetes Operator for ClickHouse manages ClickHouse clusters and does a pretty good job of it. It has over 1,200 stars on GitHub. In the world of operators it’s one of the most popular, and certainly the most popular for anything resembling an analytic database. You can try it out on Minikube in a few minutes, and in fact if you want to learn ClickHouse itself this is one of the best ways to do it, because setting up sharding and replication is complex configuration that the operator just handles automatically.

Most users we’re familiar with use managed Kubernetes services, and we do it ourselves. If you’re provisioning and running at scale, pay attention to scaling issues, mapping ClickHouse nodes to VMs, and taking care of services, backups, and monitoring. The docs on the GitHub site have examples of everything and you can always ask questions, file issues, and so on.

If you like this: please help us make it better. It is 100 percent open source, Apache licensed. Write blogs about it, tell people about it, log issues, and if you find something the operator is not handling, send us a pull request. We love them. We’ve accepted contributions from at least two cloud services that are using this. Contributors are very welcome.

### **[58:18] — Q&A: Automatic Resharding**

**Robert:** Andre asks again about automatic resharding. The answer is the same: it’s a work in progress, we’re very interested in the topic but I can’t give a date. Part of the reason is that ClickHouse itself doesn’t make it simple. Unlike Cassandra, ClickHouse doesn’t have automatic resharding yet, though there is work afoot on that. If this is something you’re interested in and would like to help with, contact us. You may get it faster if you help.

### **[59:47] — Q&A: Reusing Pod Names in Another Cluster**

**Robert:** The question is: can you reuse pod names in another cluster to have data transferred automatically? The answer is no. The simplest way to create a new cluster and migrate data is to use Altinity Backup for ClickHouse to back up to S3 or another storage location, create a new cluster, and restore into it. In our cloud we have this highly automated. Just change the cluster name in the YAML and everything else works. Use it to test upgrades: back up your existing data, bring it up in a new cluster, run the upgrade, make sure it’s good, then throw the test cluster away and do the real upgrade.

**Robert:** I think we’re at the end of the questions. Thank you so much for attending. Come check out the Altinity Kubernetes Operator for ClickHouse if you have questions, come talk to us. We run Altinity.Cloud, we do consulting, and if you think you’re really good at this and like it, we are hiring: we’re looking for a Go programmer today who can work on exactly this stuff.

With that, since we’ve covered all the questions, I’d like to thank you all very much. I’d also like to thank the Altinity engineering team who have worked on this for years and generated a lot of the information I’ve shared with you today. Thank you and have a great day.

## **FAQ**

**What problem does the Altinity Kubernetes Operator for ClickHouse solve?**

Running a production ClickHouse cluster on Kubernetes correctly requires defining dozens to hundreds of Kubernetes resources: StatefulSets, pods, PersistentVolumeClaims, PersistentVolumes, services, and ZooKeeper configuration, all coordinated across availability zones. Managing these manually is error-prone and complex. The operator introduces a ClickHouseInstallation custom resource that describes the desired cluster in a single 30 to 40 line YAML file. A controller watches for these definitions, reconciles them against the actual state of the cluster, and creates or adjusts all the underlying resources automatically.

**How do you install the operator and create a basic cluster?**

Install the operator with a single kubectl apply command that pulls a YAML bundle from the Altinity GitHub repository. This installs the ClickHouseInstallation CRD, the operator pod in kube-system, and the required service account. Then install ZooKeeper with another YAML file. Finally, apply a ClickHouseInstallation YAML that specifies shard count, replica count, container image, storage size, and ZooKeeper location. The cluster will be running in roughly 20 seconds.

**How does the operator handle rolling upgrades and schema propagation?**

To upgrade ClickHouse, change the container image tag in the pod template of your ClickHouseInstallation YAML and apply it. The operator detects the change and upgrades pods one at a time, letting current queries drain before stopping each pod, then re-spinning it with the new image while reattaching existing storage. When you add replicas, the operator automatically copies the table schema to new replicas and lets ClickHouse’s built-in replication populate the data.

**What is the reclaimPolicy: Retain setting and why is it important?**

reclaimPolicy: Retain is an Altinity Operator extension to the volume claim template. It instructs Kubernetes not to delete PersistentVolumeClaims when the ClickHouseInstallation resource is deleted. This protects you from accidentally losing data if you delete a cluster. With this setting, recreating the cluster by reapplying the YAML will automatically reattach to the existing storage.

**What are the three safety rules for running the operator in production?**

First, never run kubectl delete -f clickhouse-operator-install-bundle.yaml while you have running clusters. Deleting the operator bundle deletes the ClickHouseInstallation CRD, which causes Kubernetes to destroy your clusters. Delete your clusters first, then remove the operator. Second, always use reclaimPolicy: Retain to protect storage from accidental deletion. Third, never reduce the shard count without first moving the data off the shards you intend to remove. The operator will drop those shards and will not save the data automatically.

**What are the key differences between running the operator for development versus production?**

For development, Minikube is fine. For production, use managed Kubernetes services such as EKS, GKE, or AKS. Add Karpenter or node groups for autoscaling so VMs are allocated and deallocated automatically. Use block storage such as EBS rather than local SSD because block storage can be reattached to different VM sizes, enabling cost-effective vertical scaling. Define per-replica pod templates with node selectors to place replicas across availability zones. Set resource requests and limits to avoid the OOM killer and to prevent co-scheduling of multiple ClickHouse pods on the same node. And invest in monitoring, alerting, backup, and security before considering the system production-ready.

---

© 2022 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

