Webinars

Delivering Production ClickHouse® Apps on the Altinity Kubernetes Operator

Recorded: March 24 @ 08:00 am PDT
Presenters: Robert Hodges and Alexander Zaitsev

In this webinar, Robert Hodges (CEO, Altinity) and Alexander Zaitsev (CTO, Altinity) deliver a deep dive into the Altinity Kubernetes Operator for ClickHouse® — the open source operator that powers ClickHouse® deployments for companies ranging from startups to eBay and OpenAI. The session opens with a live demo spinning up a replicated ClickHouse® cluster on a laptop in under three minutes, then progresses through production-grade topics: cluster layouts, pod and storage templates, service templates, CHI installation templates, rolling upgrades with zero service interruption, storage scaling with no downtime, backup via sidecar containers, Prometheus monitoring with Grafana dashboards, and automation via Terraform and Helm. The webinar closes with a candid comparison to the ClickHouse, Inc. operator, a 2025 roadmap preview including post-start/pre-stop hooks, faster replica provisioning via volume snapshots, and a plugin system — plus a Q&A covering NVMe storage, KEDA autoscaling, and Project Antalya’s swarm clusters.

Key Moments (Timestamps)

AI helped write this. Mistakes may happen.

  • 00:04 – Welcome and housekeeping
  • 00:23 – Introduction: Robert Hodges and Alexander Zaitsev
  • 01:42 – Altinity background: ClickHouse® on Kubernetes since 2018
  • 03:18 – Live demo: spinning up a replicated cluster in under 3 minutes
  • 07:43 – ClickHouse® architecture overview: shared-nothing, replicas, Keeper
  • 09:03 – What is a Kubernetes operator? Custom resources and reconciliation
  • 10:53 – Installing the Altinity Kubernetes Operator for ClickHouse®
  • 13:27 – CHI resource deep dive: cluster layout, Keeper, pod templates, storage
  • 20:07 – Deployment tips: managed Kubernetes, node affinity, storage classes
  • 26:29 – Backup with Altinity Backup for ClickHouse® sidecar
  • 28:26 – Advanced features: service templates and CHI installation templates
  • 35:17 – Rescaling: adding/removing replicas and shards with zero downtime
  • 39:35 – Upgrades and configuration management
  • 43:51 – Maintenance tasks: stop, restart, suspend, troubleshoot mode
  • 47:25 – Monitoring: Prometheus metrics exporter and Grafana dashboards
  • 50:06 – Deployment automation: Terraform and Helm
  • 54:19 – Altinity Operator vs. ClickHouse, Inc. operator: a comparison
  • 59:21 – 2025 roadmap: hooks, CHI/CHK integration, volume snapshots, plugins
  • 1:01:54 – Q&A: NVMe, KEDA, custom metrics, Project Antalya swarm clusters

Transcript


[00:04–01:41] – Welcome and Housekeeping

Robert: Welcome everybody. We are going to deliver a webinar on delivering production ClickHouse® apps on the Altinity Kubernetes Operator for ClickHouse®. I hope the whole webinar goes better than that first line. Welcome again, and thank you so much for joining us today.

I’m going to be working with my colleague Alexander Zaitsev, who is the Altinity CTO. I am Robert Hodges, Altinity CEO. Before we start, a couple of housekeeping items so you won’t need to take frantic notes.

First, this webinar is being recorded. We’ll send you the recording and a link to the slides right after the webinar finishes — you’ll get it sometime later today. No need to take notes; as long as you’ve signed up, everything will be forwarded to your email.

Second, we do have time for questions and we encourage them. You can enter them in the Q&A box. Since there are two of us, we may be able to answer them in the background or as part of the presentation. We have a lot to cover, so we’ll probably reserve some questions for the end.

There’s also a chat. Some people like to put questions there. It doesn’t matter — type them in anywhere you want and we’ll answer them.


[01:42–03:17] – Altinity Background: ClickHouse® on Kubernetes Since 2018

Robert: Just a bit of background on Altinity. We are a vendor for ClickHouse® — the best analytic database on the planet. We’ve been around since 2017 and we have a lot of experience running ClickHouse® on Kubernetes.

We run Altinity.Cloud®, which has been up and running since 2020 and currently serves customers across five cloud platforms. We also have a large enterprise support business, supporting many organizations running ClickHouse® on Kubernetes.

This relationship with Kubernetes started early. We began getting into Kubernetes in a big way at the end of 2018. That experience led us to write the Altinity Kubernetes Operator for ClickHouse®, which was first released in early 2020 and has since become the foundation of many customer deployments as well as our own Altinity.Cloud®.

This talk will introduce you to the operator and show you the features that enable you not just to do development work, but to build full-scale data platforms — in some cases consisting of hundreds of nodes.


[03:18–07:42] – Live Demo: Spinning Up a Replicated Cluster in Under 3 Minutes

Alexander: So we start with the mini deployment I have on my laptop. The beauty of Kubernetes and the operator is that you can try production configurations on your laptop before deploying them to real production.

Let me create a namespace first:

kubectl create namespace test

Now I’ll deploy the operator using the simplest deployment mechanism:

kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml -n test

We can see a lot of resources have been created. Let me double-check that the operator is up and running. Good. Now let’s go to ClickHouse® itself.

I have a pre-created manifest that contains both a ClickHouse® Keeper installation and a ClickHouse® installation. The ClickHouse® installation references the Keeper installation — you can see in the keeper section the service name clickhouse-keeper-my-chk. Let’s create both resources:

kubectl create -f chi-and-chk.yaml -n test

Two resources were created. The operator reads those resource definitions — how many shards and replicas to create, what needs to be running — and starts spinning things up. We can see the first ClickHouse® pod starting, then the ClickHouse® Keeper pod (CHK), essentially in parallel.

Now everything is running. I can already connect to ClickHouse® using kubectl exec. I’m now in a ClickHouse® pod — one of the replicas. We started a replicated cluster, which is why we have Keeper.

Let me create a table. I’ll use a schema from an S3 bucket where we have some Parquet data:

CREATE TABLE ontime AS s3(‘s3://my-bucket/ontime/*.parquet’, ‘Parquet’)

ENGINE = ReplicatedMergeTree(‘/clickhouse/tables/{shard}/ontime’, ‘{replica}’)

ORDER BY (FlightDate, Carrier);

The schema has been created on both nodes. Now let me load a small portion of data — just 100 rows to keep things fast, since object storage is far away from this mini installation:

INSERT INTO ontime SELECT * FROM s3(‘s3://my-bucket/ontime/*.parquet’, ‘Parquet’) LIMIT 100;

Let me check that we have everything:

SELECT count() FROM ontime;

— 100

We have data in ClickHouse® deployed by the Kubernetes operator — exactly 100 rows across two nodes because we have replication. It took me three minutes to spin up a replicated cluster with the operator. On bare-metal ClickHouse®, it would take quite a lot longer just to figure out how to configure it properly.


[07:43–10:52] – ClickHouse® Architecture and the Operator Model

Robert: Let’s look at the architecture of ClickHouse® that we’re representing in Kubernetes. ClickHouse® is an analytic server with a shared-nothing architecture and attached storage connected by a network. In our examples, we have two servers set up as replicas, and a ClickHouse® Keeper instance used to maintain consensus between those replicas — tracking which parts are on which node, notifying replicas when new parts arrive, and so on.

What is a Kubernetes operator?

An operator is a special programming paradigm implemented by Kubernetes. It allows you to create custom resources — the way we model container-based applications in Kubernetes. More importantly, it allows you to define the processing on those resources.

The way it works: you feed in a custom resource definition using kubectl. Kubernetes reads it, looks up the resource type, and performs reconciliation — comparing the desired state in the custom resource against what’s actually running in Kubernetes, creating anything that’s missing and adjusting anything that’s different.

This is a very powerful mechanism for databases in particular, because databases don’t just have a layout — they have complicated procedures for moving between states. The operator encodes all of that knowledge so you don’t have to.


[10:53–13:26] – Installing the Altinity Kubernetes Operator for ClickHouse®

Robert: Installing the Altinity Kubernetes Operator for ClickHouse® is a simple operation that takes 10 seconds or less on a typical network connection. The operator is a container like everything else in Kubernetes, and this single command:

kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml

…pulls down a YAML file containing the full operator definition and everything it needs, including:

  • Config maps — tabular information and parameters the operator needs to function
  • Service account with roles — appropriate permissions to manipulate resources within the namespace(s) being operated
  • Custom Resource Definitions (CRDs) — four of them: CHK resources, CHI resources, and two types of templates
  • The operator itself — plus a sidecar service that exports metrics to Prometheus

All of this lands in the kube-system namespace automatically.


[13:27–20:06] – CHI Resource Deep Dive: Cluster Layout, Keeper, Pod Templates, Storage

Robert: Now let’s dig into the CHI (ClickHouseInstallation) resource and the key elements you’ll need even in a development environment: talking to Keeper, defining your cluster layout, defining your container (pod template), and defining storage (volume claim template).

Cluster layout and Keeper location:

apiVersion: “clickhouse.altinity.com/v1”

kind: ClickHouseInstallation

metadata:

  name: my-chi

spec:

  configuration:

    zookeeper:

      nodes:

        – host: clickhouse-keeper-my-chk

    clusters:

      – name: my-cluster

        layout:

          shardsCount: 1

          replicasCount: 2

This is all you need to set up a replicated cluster. The operator is smart about replicas: if you add replicas later, it will automatically transfer your schema to the new replicas — and even to new shards.

Server settings and configuration files:

You can add ClickHouse® configuration directly into the CHI definition. For example, changing the maximum number of concurrent queries or other settings that you’d normally manage through config files on bare-metal installations:

spec:

  configuration:

    files:

      config.d/max-connections.xml: |

        <clickhouse>

          <max_connections>500</max_connections>

        </clickhouse>

These are automatically set up in the container’s configuration when it lands on a host.

Pod templates:

spec:

  templates:

    podTemplates:

      – name: my-pod-template

        spec:

          containers:

            – name: clickhouse

              image: altinity/clickhouse-server:24.3.5.46.altinitystable

The Altinity Kubernetes Operator for ClickHouse® works with essentially any ClickHouse® build — Altinity Stable® Builds for ClickHouse®, official ClickHouse, Inc. builds, or Project Antalya builds. The oldest version we have running in production right now is 21.11, and it still works fine with the operator. Just name the image and the operator will pull it.

Volume claim templates:

spec:

  templates:

    volumeClaimTemplates:

      – name: data-volume

        reclaimPolicy: Retain

        spec:

          accessModes:

            – ReadWriteOnce

          resources:

            requests:

              storage: 10Gi

Two important settings here:

  1. Storage size — always include this. If you omit the volume claim template entirely, you’ll be using ephemeral storage, which evaporates when a container restarts.
  2. Reclaim policy: Retain — this is a critical safety mechanism. If you accidentally delete a shard or an entire cluster, the Persistent Volume Claims (PVCs) stay up. When you realize the error, you just reinstall and the operator finds the storage again.

[20:07–26:28] – Deployment Tips: Managed Kubernetes, Node Affinity, Storage Classes

Robert: Everything covered so far is sufficient for development. Now let’s talk about what’s needed to turn this into a production application.

Use managed Kubernetes.

Most people don’t want to run Kubernetes themselves, and they shouldn’t have to. Every major cloud provider offers managed Kubernetes: EKS (AWS), GKE (Google), AKS (Azure). These cost roughly $70–$100/month to run — practically free — and they handle Kubernetes updates, security patches, and cluster lifecycle management automatically.

Wire up cluster autoscaling and storage provisioning.

Kubernetes needs to be correctly wired to your cloud infrastructure so it can spin VMs up and down on demand. On AWS, this is done through the Cluster Autoscaler, typically configured via Terraform. Storage is defined through a storage class — a key resource in Kubernetes that your PVCs reference, and that a storage provisioner uses to allocate matching amounts of real storage automatically.

Allocate a separate host per ClickHouse® node.

The operator makes this easy with pod distribution settings:

spec:

  templates:

    podTemplates:

      – name: my-pod-template

        spec:

          nodeSelector:

            node.kubernetes.io/instance-type: m7g.xlarge

          tolerations:

            – key: “clickhouse”

              operator: “Exists”

              effect: “NoSchedule”

          affinity:

            podAntiAffinity:

              requiredDuringSchedulingIgnoredDuringExecution:

                – labelSelector:

                    matchExpressions:

                      – key: “clickhouse.altinity.com/chi”

                        operator: “Exists”

                  topologyKey: “kubernetes.io/hostname”

  • Node selector — target a specific instance type (e.g., M7G Graviton on AWS: high performance, lower cost)
  • Pod anti-affinity — the Highlander principle: there can only be one. Ensures no two ClickHouse® replicas share the same VM
  • Tolerations — allows ClickHouse® pods to be scheduled on tainted nodes reserved exclusively for ClickHouse®, keeping them away from Keeper and other applications

Storage classes for production:

apiVersion: storage.k8s.io/v1

kind: StorageClass

metadata:

  name: gp3-encrypted

provisioner: ebs.csi.aws.com

parameters:

  type: gp3

  encrypted: “true”

  fsType: xfs

  throughput: “500”

  iops: “6000”

allowVolumeExpansion: true

This is an example for GP3 EBS on AWS. Key features: encryption, XFS filesystem, configurable throughput and IOPS, and online volume expansion enabled. See the Altinity blog for details on how we handle live storage scaling in Altinity.Cloud®.


[26:29–28:25] – Backup with Altinity Backup for ClickHouse®

Robert: Backup is a many-splendored thing. We run Altinity Backup for ClickHouse® — our open source backup tool on GitHub.

Here’s how it works in Kubernetes. Instead of running one container per pod, we run two — a sidecar pattern:

  1. Container 1: The ClickHouse® server
  2. Container 2: The clickhouse-backup binary

Because both containers run in the same pod, they share the same process space and can access the same storage. This is necessary because ClickHouse® backup works by creating hard links to data files — which requires being in the same filesystem.

The backup container writes to object storage (S3 or equivalent). You allocate a bucket, ensure the IAM role is correctly wired for read/write access, and the backup runs from there. You can also use ClickHouse®’s built-in BACKUP command, which doesn’t require a sidecar. Both are valid options depending on your needs. Alexander will also cover roadmap improvements to backup support coming later this year.


[28:26–35:16] – Advanced Features: Service Templates and CHI Installation Templates

Alexander: Let me introduce some more advanced features of the operator that allow you to build very complex production systems and data platforms.

Service templates

In the operator, we support several types of services:

  • Load balancer service — randomly distributes connections across all ClickHouse® nodes in an installation (across all shards and replicas)
  • Replica service template — creates a service for every individual ClickHouse® node, allowing you to connect to a specific node from within Kubernetes
  • Cluster service template — creates a service for a specific cluster within a CHI (a single CHI can actually host multiple ClickHouse® clusters)

Here’s a simple service template example:

apiVersion: “clickhouse.altinity.com/v1”

kind: ClickHouseInstallationTemplate

metadata:

  name: my-service-template

spec:

  templates:

    serviceTemplates:

      – name: chi-service-{chi}

        generateName: “service-{chi}”

        spec:

          ports:

            – name: http

              port: 8123

            – name: tcp

              port: 9000

          type: LoadBalancer

The generateName field uses macros — {chi} is replaced with the name of the ClickHouse® installation — so you don’t need to hard-wire service names. For production on cloud Kubernetes, you’ll also add annotations to instruct the cloud provider how to configure the load balancer (internal vs. external, security settings, etc.).

CHI Installation Templates

This is one of the most powerful — and unique — features of the Altinity Kubernetes Operator for ClickHouse®. CHI installation templates are a separate Kubernetes resource (abbreviated CHIT) that have the same structure as a CHI but are used to define reusable configuration fragments.

Use cases:

  1. Shared defaults across multiple CHI installations — define once, apply everywhere
  2. Version management — define the ClickHouse® version in a template; update the template to upgrade all installations that reference it
  3. Inversion of control — automatically inject configuration into CHI installations matching a selector, without modifying the CHIs themselves

Here’s a simple version template:

apiVersion: “clickhouse.altinity.com/v1”

kind: ClickHouseInstallationTemplate

metadata:

  name: clickhouse-version-template

spec:

  templates:

    podTemplates:

      – name: default

        spec:

          containers:

            – name: clickhouse

              image: altinity/clickhouse-server:24.3.5.46.altinitystable

And referencing it from a CHI:

spec:

  useTemplates:

    – name: clickhouse-version-template

When a new ClickHouse® version is released, you update the template — not every individual CHI installation.

A more powerful example: an installation template that opens the PostgreSQL wire protocol port in ClickHouse®. This requires multiple coordinated changes — opening the container port, defining a special service template for PostgreSQL connections, and enabling the PostgreSQL port in the ClickHouse® configuration itself. You can do all of this in a single template that applies automatically to matching CHI installations via the policy: auto setting:

spec:

  templates:

    podTemplates:

      – name: clickhouse-with-pg-port

        # opens port 5432 in the container

  useTemplates:

    – name: pg-port-service-template

  configuration:

    settings:

      postgresql_port: 5432

  matchLabels:

    environment: production

  policy: auto

This is an inversion of control pattern — you never touch the CHI installation itself, but behavior is injected automatically. We use this extensively in Altinity.Cloud®.


[35:17–39:34] – Rescaling: Adding and Removing Replicas and Shards with Zero Downtime

Alexander: If you’re running in production, you typically start small but tend to grow. The operator handles both scale-up and scale-down with zero service interruption.

Adding replicas:

When you increase the replica count in your CHI definition, the operator:

  1. Creates new pods and storage
  2. Automatically creates the schema on the new replicas (all existing tables are replicated)
  3. Waits for replication to catch up before including the new replica in the load balancer

That third step is critical. On a large cluster with hundreds of terabytes of data, the new replica will be lagging behind — and if you start routing queries to it immediately, you may get incorrect results. The operator waits until replication is complete before the replica becomes available. This is one more reason to use the load balancer: if you connect directly to a specific replica, you lose this protection.

Adding a replica requires no restarts. It can be done on a running cluster transparently.

Adding shards:

The same process: increase the shard count, and the operator creates new pods, storage, and schema — waiting for replication to catch up if those shards have replicas. What the operator intentionally does not do is automatically redistribute existing data to new shards. This is a deliberate design decision: data sets are typically very large and resharding is expensive. New data is distributed across all shards going forward; old data remains where it is until it naturally ages out.

Scaling down:

Scaling from three replicas to two works the same way — the operator removes the node from the load balancer, removes resources, and importantly, cleans up ZooKeeper/Keeper metadata that ClickHouse® doesn’t clean up automatically when a node is removed.


[39:35–43:50] – Upgrades and Configuration Management

Alexander: This is something every production operator user needs to understand.

When an image (version) changes: Requires a pod restart, which is expected.

When a pod template or volume claim template changes: The operator recreates the StatefulSet. Many of these changes can’t be made on existing Kubernetes objects — this is a Kubernetes limitation. However, StatefulSet recreation is performed in a rolling fashion, so nodes are updated one at a time (or in configurable batches).

When ClickHouse® configuration changes: This varies:

  • Some changes can be applied with a fast restart using SYSTEM SHUTDOWN, which doesn’t require pod recreation — it’s just a ClickHouse® process restart
  • Some changes can be applied with no restart at all if the ClickHouse® server detects the configuration change and hot-reloads it automatically

Rolling restarts — zero service interruption:

We’ve invested a great deal of effort to make restarts as smooth as possible. The restart sequence is:

  1. Remove the node from the load balancer
  2. Remove the node from remote_servers — so distributed queries skip this node
  3. Wait for running queries to finish — clients are not interrupted
  4. Perform the restart or pod recreation
  5. Add the node back to remote_servers
  6. Add the node back to the load balancer

On large installations with hundreds of shards, multiple shards can be processed in parallel to speed things up. The degree of parallelism is configurable. Upgrading 200–300 nodes at 3–5 minutes each would take forever serially — parallel rolling upgrades make this practical.

Per-replica configuration:

Most configuration is common across all nodes, but the operator allows per-replica customization. A practical example: setting the availability zone for each replica so ClickHouse® knows which AZ it’s in. This uses per-node config.d mappings internally while presenting a clean interface in the CHI definition.


[43:51–47:24] – Maintenance Tasks: Stop, Restart, Suspend, Troubleshoot Mode

Alexander: For day-to-day operations, the operator supports several maintenance attributes:

Stop: Scale down all StatefulSets to zero pods. Storage and all other resources are preserved. When you’re ready to restart, just remove the stop attribute (or set it to null) and the cluster comes back up with all its data intact. Ideal for development systems you don’t want running 24/7.

spec:

  stop: true

Restart: Since the operator processes Kubernetes resources rather than executing commands, triggering a restart uses a task ID pattern:

spec:

  task:

    id: restart-2025-05-13

    type: restart

The operator tracks executed tasks by ID — each task ID is executed exactly once. To trigger a new restart, change the task ID.

Suspend: Stops reconciliation. Any changes you make to the CHI while suspended are not propagated to Kubernetes resources. Useful when you want to batch multiple changes and deploy them all at once — make your changes, then un-suspend.

Troubleshoot mode: When ClickHouse® is in a crash loop and you can’t figure out why, troubleshoot mode keeps the container running after the ClickHouse® process terminates. You can then kubectl exec into the pod, check logs, and try to start the process manually to see what’s crashing.

Advanced operator configuration:

The operator itself has an extensive set of configuration options — concurrency settings, inclusion/exclusion rules for reconciliation, probe configuration, various timeouts. These allow you to tune how the operator processes its reconciliation loop to make it as safe and efficient as possible for your specific environment. Full documentation with comments is available in the operator repository.


[47:25–50:05] – Monitoring: Prometheus Metrics Exporter and Grafana Dashboards

Alexander: Monitoring is critical for production operations. The operator includes a metrics exporter sidecar that exposes a Prometheus endpoint on port 8888. This sidecar queries ClickHouse® directly and collects a wide range of data: metrics, event counts, table sizes, query statistics, and much more.

ClickHouse® itself has its own Prometheus endpoint (added more recently), but it provides less information than our metrics exporter — so we continue to rely on our own approach.

Custom metrics:

Users who want application-specific metrics can create a custom_metrics table (or any table matching the structure of system.metrics). Typically this is implemented as a SQL view that extracts application-specific information from your data and exposes it through the same Prometheus endpoint. This allows you to monitor business-level metrics alongside infrastructure metrics in the same Grafana dashboards.

Grafana dashboards:

We provide ready-made Grafana dashboards covering:

  • ClickHouse® server metrics (CPU, memory, I/O, connections)
  • Query analysis (running queries, query throughput, query latency)
  • ZooKeeper/Keeper health metrics
  • Insert throughput (rows per second, inserts in flight)

Here’s an example of what one of our dashboards shows on a loaded system: 3 million rows inserted per second, along with queries started per second, concurrent inserts, and many other signals. These dashboards were built by our support engineers based on years of production operations experience and are available in the operator repository.


[50:06–54:18] – Deployment Automation: Terraform and Helm

Robert: There are several ways to automate the deployment of the Altinity Kubernetes Operator for ClickHouse® and your ClickHouse® clusters. The four most common are Ansible, Terraform, Helm, and plain manifests with shell scripts. Let me focus on Terraform and Helm.

Terraform

Terraform can deploy both your cloud infrastructure and the Kubernetes resources inside it — making it the most versatile option. If you’re going to pick one automation tool, Terraform is the one.

We have two key Terraform resources:

  • Altinity EKS Terraform module — built with help from our friends at AWS. Automates the full EKS setup for ClickHouse®, including all the wiring needed for cluster autoscaling and storage provisioning. Currently AWS-focused; GKE and Azure coverage is on the roadmap.
  • Altinity.Cloud® BYOC Terraform module — for organizations that want Altinity.Cloud® to manage their ClickHouse® environment but within their own cloud account. This sets up the bring-your-own-cloud model: ClickHouse® runs in your infrastructure on a Kubernetes cluster we provision, and you manage it through the Altinity.Cloud® UI or API.

Helm

Helm is very popular and we fully support it. Key points:

We have a Helm chart for installing the operator itself:

helm repo add altinity https://altinity.github.io/clickhouse-operator

helm repo update

helm install clickhouse-operator altinity/altinity-clickhouse-operator

  • We also have a sample application chart for ClickHouse® clusters. Helm is a convenient way to manage CHI manifests, especially when paired with Argo CD for GitOps-style deployments.
  • Important Helm limitation: Helm does not handle operator upgrades well. The problem is that upgrading an operator requires updating its Custom Resource Definitions (CRDs), which change the behavior of the API that Helm is trying to upgrade — Helm can’t handle this correctly. The workaround: manually apply the new CRDs first, then run helm upgrade. This is a known issue with all Kubernetes operators and Helm, not specific to ours.

[54:19–59:20] – Altinity Operator vs. ClickHouse, Inc. Operator: A Comparison

Alexander: For many years, the Altinity Kubernetes Operator for ClickHouse® was the only operator for ClickHouse®. It remains the most popular operator for any database on GitHub after the PostgreSQL operator — with around 2,500 GitHub stars. We introduced it in 2019 and have been running it in production ever since.

ClickHouse, Inc. has recently started developing their own operator, which they introduced as the “official” ClickHouse® operator for Kubernetes in January 2026. We were naturally curious, and many of our users are asking about the differences. It’s hard for us to be entirely unbiased, but here’s what we see:

Maturity and production use:

The Altinity Kubernetes Operator for ClickHouse® is used in production by companies including eBay, OpenAI, Globals, and many others — managing thousands of clusters. The ClickHouse, Inc. operator is early in its development. Notably, it does not appear to be the operator they use for their own cloud, which is an important signal — for us, running our own cloud on the operator has been a critical forcing function for quality.

Platform certification:

The Altinity operator is certified across a wide range of Kubernetes platforms — not just EKS, GKE, and AKS, but also DigitalOcean, Linode, OpenShift, and Oracle Cloud Kubernetes. Certification here means we have production deployments on those platforms.

ClickHouse® version support:

The Altinity operator supports any ClickHouse® distribution — Altinity Stable® Builds, official ClickHouse, Inc. builds, or Project Antalya builds. The ClickHouse, Inc. operator appears to be focused on official builds only.

Configuration management:

The Altinity operator has deep configuration management capabilities — per-replica settings, installation templates with inversion-of-control, comprehensive lifecycle hooks. The ClickHouse, Inc. operator appears to prioritize simplicity (“fire and forget”), which may be sufficient for some use cases but falls short for complex enterprise deployments.

Database engine support:

The Altinity operator supports all database engines: Ordinary, Atomic, Replicated, and others. The ClickHouse, Inc. operator enforces the use of the Replicated database engine, which is convenient but is a limitation for some use cases.

Keeper support:

Both operators support ClickHouse® Keeper. One interesting note: Keeper has been production-ready for several years, but we recently benchmarked it and found it is still slower than ZooKeeper on high-rate transactions. We’re preparing a pull request with changes to ClickHouse® to improve Keeper concurrency.

Backup:

The Altinity operator provides full backup management via Altinity Backup for ClickHouse® — including table-level backup and restore, multiple storage backends, and a complete operational workflow. Backup support is on the ClickHouse, Inc. operator roadmap for 2025.

Both operators are Apache 2.0 licensed.


[59:21–1:01:53] – 2025 Roadmap for the Altinity Kubernetes Operator for ClickHouse®

Alexander: We’re not always great at communicating our roadmap, so let’s use this opportunity to share what’s coming in 2025:

Post-start and pre-stop hooks — We need these to better support Project Antalya. For example, when swarm nodes are shut down, we want to do it gracefully using SQL hooks injected into the shutdown sequence. We’re adding special SQL hook support to the operator for this.

CHI and CHK integration — Better integration between ClickHouseInstallation (CHI) and ClickHouseKeeperInstallation (CHK) resources is coming later this year.

Faster replica provisioning via volume snapshots — Adding new replicas today requires replicating all data from an existing node, which is slow for large datasets. A better approach: take a persistent volume snapshot of an existing replica, spin up the new replica from that snapshot, and then only catch up replication for the delta between when the snapshot was taken and when the new volume was created. Some users have already tried this approach in the wild and found it very effective.

Plugin system — We’re working on making the operator extensible via plugins. First candidates: a backup plugin (configure and manage Altinity Backup for ClickHouse® through operator manifests) and a resharding plugin (heavy resharding logic that runs outside the operator but is controlled by it).


[1:01:54–1:09:52] – Q&A Session

Robert: We have time for some questions. Thank you all for listening — we’ve been working on this since 2019, so it’s coming up on seven and a half years of development. A lot to cover.

The Altinity operator GitHub is at the QR code on the slide — that’s the place to find information. If you have questions, join our Altinity Slack for interactive help. For feature requests or bugs, please file issues on GitHub — or better still, submit PRs. We’ve had many great community contributions.


Q: ClickHouse, Inc. is working on their own operator. What’s the path forward for Altinity? (from Paul Julian)

Robert: We are 100% committed to the Altinity Kubernetes Operator for ClickHouse®. Our entire business depends on it, as do hundreds — if not thousands — of companies. About 80% of our roughly 300 customers are using the operator in some form. We’ll keep pushing as hard as we can. As Alexander pointed out, having the operator underneath production systems that we actively support — including our own cloud — is a real forcing function for quality. Rolling upgrades at scale, for example, are a very hard problem, and years of production use is what gets that right.


Q: Can you use NVMe local disks with EKS and the operator?

Alexander: NVMe drives are local disks — they’re not network-attached — so you can’t easily separate compute and storage the way you can with EBS. This is actually one of the main reasons we don’t use NVMe as primary storage in our cloud, even though we experimented with it five years ago when we started operator development.

That said, we now have a specific use case for NVMe: as a high-performance cache for Parquet data in Project Antalya. We’ll be developing expertise and support for this use case. The main challenge is how to mount local disks properly in Kubernetes — standard provisioners assume storage is cluster-wide, not node-local. Once we figure out the best approach for our cloud, we’ll share it with the community.


Q: Does the ClickHouse® Helm chart support KEDA for scale-up?

Robert: Currently, no.

Alexander: We’ve thought about it, but ClickHouse® is typically a heavy stateful database. If you have a lot of data, automatic scaling mechanisms that add new replicas don’t make sense — replicating terabytes of data to a new replica isn’t the kind of thing you want to trigger automatically.

However, this changes significantly with Project Antalya swarm clusters. Swarm nodes are stateless compute servers that only query data from object storage — they don’t own any data themselves. That makes them ideal candidates for KEDA-style autoscaling. We may very well add KEDA support for swarms in the future. Thank you for the reminder!

Robert: This is exactly the gap we’re filling with Project Antalya. Normal ClickHouse® doesn’t have built-in separation of storage and compute — that’s what we’re adding. Swarm clusters let you scale query compute up and down dynamically, including using spot instances to reduce costs. Once swarms are available, dynamic autoscaling becomes much more practical.


Q: Where can I find documentation on custom metrics?

Robert: The best place is the Altinity Kubernetes Operator for ClickHouse® GitHub repository. Another option: clone the repository and use an LLM like Claude to ask questions about the code directly. We’re actually starting to use that ourselves when we need to remember how specific parts of the code work.


Alexander: One final note: the operator is an open source project with over 100 contributors. If you have something to contribute, you are absolutely welcome. We accept contributions and we love them.

Robert: Thank you all so much for attending. We hope this was useful. Feel free to ping either of us directly on LinkedIn, join our Altinity Slack, try the operator out, and let us know what you think. Thanks!


FAQ Section

Q1: What is the Altinity Kubernetes Operator for ClickHouse® and why should I use it?

The Altinity Kubernetes Operator for ClickHouse® is an open source Kubernetes operator that automates the deployment, scaling, configuration, and lifecycle management of ClickHouse® clusters on Kubernetes. It has been in production since 2020, powers Altinity.Cloud®, and is used by companies including eBay and OpenAI. It eliminates the complexity of manually configuring ClickHouse® replication, Keeper, storage, rolling upgrades, and more — reducing cluster setup from hours to minutes.

Q2: How does the operator handle rolling upgrades without downtime?

The operator performs rolling upgrades in a carefully sequenced way: it removes a node from the load balancer, removes it from remote_servers (so distributed queries skip it), waits for any running queries to complete, performs the restart or pod recreation, and then re-adds the node to remote_servers and the load balancer. On large clusters, multiple shards can be upgraded in parallel (configurable). This means even upgrading 200–300 node clusters can be done without service interruption.

Q3: How do I add a new replica or shard to a running ClickHouse® cluster?

Simply update the replicasCount or shardsCount in your CHI definition and apply it with kubectl apply. The operator handles the rest: creating pods, provisioning storage, creating schema on the new nodes, and — critically — waiting for replication to complete before adding the new node to the load balancer. No manual steps, no downtime.

Q4: What is a CHI installation template and when should I use it?

A ClickHouseInstallation Template (CHIT) is a reusable configuration fragment you can apply to one or many CHI installations. Use cases include: managing ClickHouse® version across all clusters from a single template, defining shared pod or service configurations, and automatically injecting behavior into clusters matching a label selector without modifying the CHI itself. This inversion-of-control pattern is especially powerful in multi-tenant environments like Altinity.Cloud®.

Q5: How do I back up ClickHouse® clusters managed by the operator?

The recommended approach is Altinity Backup for ClickHouse® — deployed as a sidecar container within the ClickHouse® pod. Because the backup container runs in the same pod, it shares the filesystem and can use hard links for efficient backups. Backups are written to object storage (S3 or equivalent). The operator also supports ClickHouse®’s built-in BACKUP command as an alternative that doesn’t require a sidecar.

Q6: How does the Altinity operator compare to the ClickHouse, Inc. operator?

The Altinity Kubernetes Operator for ClickHouse® has been in production since 2019 and is used by thousands of clusters across major companies. The ClickHouse, Inc. operator was introduced in January 2026 and is early in its development. Key Altinity operator advantages: it powers Altinity.Cloud® itself (a critical quality driver), supports all ClickHouse® builds and database engines, has deep configuration management capabilities including installation templates and per-replica settings, and provides full backup management. Both operators are Apache 2.0 licensed. See the Altinity blog for detailed comparisons.


Copyright Notice: This content is © Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.


Share

ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.

Related:

Leave a Reply

Your email address will not be published. Required fields are marked *