Petabyte-Scale Data in Real-Time: ClickHouse®, S3 Object Storage, and Data Lakes

Recorded: May 22 @ 08:00 am PDT
Presenter: Alexander Zaitsev, Co-Founder & CTO @Altinity and Robert Hodges, CEO @Altinity

This webinar presents a practical, end-to-end framework for building real-time analytics systems at petabyte scale, delivered by Altinity CEO Robert Hodges and Altinity CTO Alexander Zaitsev. The session is grounded in real customer experience and is organized around three major capabilities: elastic compute scaling, affordable S3-backed MergeTree storage, and direct query of Parquet data lakes.

The session opens with a concrete requirements exercise using an application monitoring system as a canonical example: 300 million records per second at 200 bytes each, retained for 365 days with 25% annual growth. Working through the math shows that this easily approaches three petabytes of stored data, making traditional approaches to storage expensive and unmanageable.

The first technical section covers flexible compute. Alexander Zaitsev explains vertical and horizontal scaling strategies for ClickHouse clusters in public cloud environments, emphasizing that Kubernetes provides a convenient management layer for both compute and storage. The Altinity Kubernetes Operator for ClickHouse® is highlighted as the mechanism that turns complex multi-shard, multi-replica provisioning into simple YAML edits: changing an instance type or shard count triggers automatic reconciliation.

The second section covers scaling MergeTree tables with S3 storage. Robert Hodges explains how ClickHouse storage configurations organize disks, volumes, and policies. The session covers S3 disk definitions using macros for per-server path isolation, the S3 filesystem cache that uses local disk and the OS page cache to reduce API calls and improve query latency, and tiered storage with TTL MOVE so that fresh data lands on block storage for merging and migrates to S3 on a configurable schedule. Two critical settings are highlighted: perform_ttl_move_on_insert = 0 to prevent merges from happening in S3, and the avoidance of zero-copy replication for all but expert users.

The third section covers querying Parquet data lakes. Alexander Zaitsev explains how Parquet’s columnar structure is analogous to MergeTree’s, how ClickHouse can query individual or multiple Parquet files on S3 using the s3() table function and glob patterns, how virtual columns like _file and _path help filter by filename, how schema inference works with DESCRIBE, and how to write Parquet back to S3 with partition-based file splitting. The section also covers parallel querying with s3Cluster(), delta and Iceberg catalog support, and the important trade-offs between MergeTree-on-S3 and the Parquet data lake approach. The session closes with a cheatsheet for building petabyte-scale ClickHouse clusters and a Q&A covering zero-copy replication, Iceberg catalog performance, the ALTER TABLE MODIFY TTL command, Altinity Backup for ClickHouse®, and ClickHouse’s potential as a vector database.

Here are the slides:

Petabyte-Scale-Data-in-Real-Time-ClickHouse-S3-Object-Storage-Data-Lakes-2024-05-22 Download

Key Moments (Timestamps)

Key moments generated with AI assistance.

00:49 – Welcome and housekeeping
01:59 – Speaker introductions: Robert Hodges and Alexander Zaitsev; about Altinity
03:04 – Challenges of real-time big data: application monitoring requirements and storage math
05:05 – Overview of ClickHouse: architecture, columnar storage, scale
06:15 – System design approach: test on a single server, then scale
08:18 – The capacity testing loop: deploy, load, analyze, iterate
11:44 – Key optimizations at single-server scale
12:52 – Three pillars for going big: flexible compute, S3 MergeTree, Parquet data lakes
14:16 – Section 1: Implementing flexible compute; vertical scaling on cloud VMs
16:36 – Horizontal scaling: replicas vs. shards
18:56 – Kubernetes as a cloud management layer
20:22 – The Altinity Kubernetes Operator for ClickHouse®: custom resource definitions
22:09 – Scaling compute with a YAML instance type change
23:02 – Horizontal scaling with a shard count change
25:01 – Section 2: Scaling MergeTree with S3 storage; block storage vs. object storage
29:19 – Best practice: separate S3 endpoints per server using macros
31:47 – S3 filesystem cache: setup and benefits
34:52 – Tiered storage with TTL MOVE; why to avoid merges on S3
37:04 – Key S3 settings: perform_ttl_move_on_insert and prefer_not_to_merge
39:49 – Trade-offs of MergeTree on S3
41:10 – Section 3: Querying Parquet in read-only data lakes
43:52 – Parquet vs. MergeTree: structural similarities
44:53 – Querying single and multiple Parquet files with the s3() function
46:54 – Performance considerations: partitioning, glob filters, min/max metadata
49:50 – Writing Parquet back to S3 from ClickHouse
50:46 – Parallel reads with s3Cluster(); catalog support
53:44 – Trade-offs of the Parquet data lake approach
54:05 – Cheatsheet for petabyte-scale ClickHouse clusters
57:06 – Q&A: open-source shared storage, Iceberg catalogs, ALTER TTL, backup, vector search

Webinar Transcript

[00:49] – Welcome and Housekeeping

Robert: Welcome everybody. We are going to be talking about petabyte-scale data in real time using ClickHouse®, S3 object storage, and data lakes. This is a subject near and dear to our hearts. With me today is Alexander Zaitsev, and we’ll be presenting some of the latest experience we have working with customers on this topic.

A couple of housekeeping things: this talk is being recorded on Zoom. You will get a link to the recording as well as the slides within 12 to 24 hours. We do have a chat box and a question and answer box. If you have questions during the presentation, put them in either location and we’ll take them during the talk if they’re directly relevant, or we’ll have some time at the end.

[01:59] – Introductions and About Altinity

Robert: My name is Robert Hodges. I am a database geek and have been working on databases for about 40 years, on Kubernetes since 2018. I’m CEO of Altinity. With me today is Alexander Zaitsev, Altinity CTO as well as a founder of Altinity. He is an expert in high-scale analytics system design and implementation, and did one of the first major ClickHouse implementations in the United States.

Altinity is an enterprise provider for ClickHouse. We focus on support and services for building real-time analytic applications. We have Altinity.Cloud, which was the first cloud for ClickHouse in the Amazon market as well as GCP and Azure. We wrote the Altinity Kubernetes Operator for ClickHouse®, we maintain Altinity Stable® Builds for ClickHouse®, and we also maintain Altinity Backup for ClickHouse® and a number of other useful open-source tools.

[03:04] – The Challenge: Real-Time Big Data at Scale

Robert: Let’s talk about the problems now arising in systems we see. Application monitoring is one of the most popular use cases for ClickHouse these days. You may be building a system where you collect metrics from applications running in your own data centers or VPCs, ingesting them into a database like ClickHouse, and enabling queries for visualization, alerting, and other types of consumption.

If we look at the requirements for a typical application monitoring system: 200 bytes per record, 300 million records per second ingested, retained for 365 days, with some query response time requirements, and 25% annual growth. Put these together and run the math on storage alone and you get something close to three petabytes of data being stored and queried at the end of a year of operation, assuming 90% compression.

This makes storage a dominating cost in the operation of analytic clusters. The shared-nothing architecture that ClickHouse originally used requires block storage attached to nodes. To provide high availability and allow more users to query the data, ClickHouse replicates this storage. Block storage is very expensive. On Amazon, EBS is over 10 times more expensive per byte than S3 object storage once you factor in replication. And compute is growing proportionally as data sets grow, because heavier queries on larger data sets require more CPU and RAM.

[05:05] – Overview of ClickHouse

Robert: ClickHouse is a real-time analytic database designed to give rapid, near-real-time responses on very large data sets, often with data arriving very quickly. It’s kind of like MySQL in the sense that it understands SQL, runs practically everywhere, and is open source. It’s kind of like a traditional data warehouse like Vertica in the sense that it was originally designed with a shared-nothing architecture, uses columnar storage, parallelizes practically everything, and scales to many petabytes. We’re currently working on a system with 11 petabytes of compressed data, and there are systems larger than that on ClickHouse.

[06:15] – A Process for Designing Petabyte-Scale Systems

Robert: The first time I saw a good process for designing these systems was in a presentation from Cloudflare, who’ve been doing this for a long time. The eventual system is much larger than you can conveniently set up on a laptop or a few hosts, so you need a way to attack this problem.

The first step is to figure out your requirements. Know your approximate ingest rate, how long you want to hold the data, and what data growth looks like. Data doesn’t stand still. You add some data, the system becomes successful, and it’s going to grow.

The next step is to test the core of your system on a single ClickHouse server. If you have requirements that involve end-to-end latency or upstream and downstream systems, add ingest and query facilities to simulate the end-to-end operation. But make this as small as possible: one server that you can hammer and optimize before scaling out.

Once you’ve got that single server working and understood, you can multiply it. A lot of this talk is about how to scale things out once you’ve got one node working.

[08:18] – The Capacity Testing Loop

Robert: Here’s the capacity testing loop we recommend. Deploy an initial schema and perhaps an application. Put load on the system: ingest data, run queries, ideally simultaneously. Use real data, not test data, that is truly representative of what you’re going to process in production. That includes realistic ingest rates and realistic data sizes. Put that load on, run the test, analyze the results, identify bottlenecks, adjust the design, and repeat. Keep looping through this as fast as you can until you’ve done all the optimizations you reasonably can. At that point you have an established capacity number for a single ClickHouse host.

What optimizations to look for? The first is always reducing IO. Fix your schema: use codecs, change sort orders, play around with compression to get to that 90% or better compression rate. Other optimizations include reducing query joins to single scans where possible, and then going further to query optimization and ultimately adding more CPUs and RAM for queries that need more memory. Work these out while you have a single server, and once you’ve got the core working you can begin to think about going big.

[12:52] – Three Pillars for Going Big

Robert: Once you’ve hit maximum single-server capacity, there are three major things to look at. First, ensure flexible scaling of both vertical and horizontal compute, because as more load gets put on the system you’ll need to add more compute to access it. Second, extend MergeTree tables by allowing them to write data to S3, giving you access to cheap and effectively unlimited storage. Third, look at options for reading data out of data lakes. Data doesn’t have to live inside ClickHouse to be queried. It can live in Parquet on S3, and data stored that way is accessible to other applications as well.

[14:16] – Section 1: Implementing Flexible Compute

Alexander: Since we want to go big, we need to make sure we can provide enough resources to handle load as data grows. We cannot know in advance what kind of resources will be required. We may start with enough, but in a few months the data is so large that our server or cluster is not sufficient and we need to scale. That scaling capability needs to be accessible and easy, because moving ClickHouse to different hardware is a difficult exercise.

Public cloud is immediately convenient for scaling compute. You just request a new machine, and if you use network-attached block storage you don’t need to migrate your data at all. You can reattach the storage to a new instance. In AWS, the full vertical scaling process can take just two minutes: request a VM, install ClickHouse, reattach the volume. When you scale up a VM you can add more vCPUs, more RAM, increase storage size, and increase IOPS. All of these increase your performance and ability to handle more data.

Vertical scaling is often not sufficient on its own as data grows. ClickHouse provides two dimensions of horizontal scaling. Adding replicas is more traditional: it handles more queries per second and more concurrency, but doesn’t help with data size directly. Adding shards is more interesting. Every shard stores its own portion of the data. If your full data set is one petabyte, you might deploy 100 shards and store 10 terabytes per shard. When you add shards, you don’t just add space: you also increase overall CPU power and throughput, because even with very fast storage, if one node reads one gigabyte per second, 100 shards read 100 gigabytes per second. For the Big Data sizes we’re talking about, sharding is the main way to go.

[18:56] – Kubernetes as a Cloud Management Layer

Alexander: We’ve discovered over the last years that Kubernetes is a great technology for managing cloud resources. Inside Kubernetes there are abstractions for different things. A pod is an abstraction for compute, and in public cloud Kubernetes maps this abstraction to an actual VM. When you create a pod, Kubernetes communicates with the cloud to provision a virtual machine and deploys the pod to it. Similarly, a PersistentVolume abstraction represents block storage: Kubernetes provisions the volume from the cloud provider and mounts it to the pod. Kubernetes in this scheme is a management layer that provides a convenient API to manage cloud resources and makes it very easy to scale both computing and storage.

The challenge for ClickHouse specifically is that a ClickHouse cluster consists of multiple shards and multiple replicas, so you need to create a lot of pods, a lot of persistent volumes, mount them correctly, deploy configuration, and manage all of those Kubernetes objects. To make this easier, Kubernetes provides custom resource definitions (CRDs), and developers may build operators that implement special logic to process them.

[20:22] – The Altinity Kubernetes Operator for ClickHouse®

Alexander: This is exactly what we do with the Altinity Kubernetes Operator for ClickHouse®. You have a definition of your cluster written in JSON or YAML. The operator understands this definition and, based on it, deploys all the Kubernetes resources that represent your ClickHouse cluster. If you change the definition, the operator applies the change. You don’t need to care about all the wiring.

Here’s an example: a ClickHouseInstallation manifest defines that we’re deploying ClickHouse with an m6i.large instance type. To scale up the compute, we simply change the instance type to something bigger and apply the change to Kubernetes. The operator sees it and reschedules ClickHouse on a larger node. This process can take about two minutes for a single shard. If you have 100 shards, it takes longer, but it’s still the same one-line change.

Similarly, horizontal scaling: a ClickHouseInstallation that defines one shard and two replicas can be changed to two shards and three replicas. The operator sees it and provisions new pods, new instances, and new volumes. The new shards don’t have any data initially, but replicas will replicate automatically. This process takes around 10 minutes for a modest-sized change.

Kubernetes also allows mixing and matching different compute and storage patterns: local storage, S3 object storage, even object storage in a totally different cloud account. This allows full separation of storage and compute. We strongly advocate using Kubernetes in public cloud. It’s much easier than managing infrastructure manually.

[25:01] – Section 2: Scaling MergeTree Tables with S3 Storage

Robert: Now let’s talk about scaling the size of MergeTree tables using S3 storage. ClickHouse normally stores data on file systems that may consist of one volume or multiple volumes. It consists of metadata (the table definition and the names of files) and actual table data in binary columnar format, two files per column. The operating system page cache is a key feature here: when we read from storage, Linux automatically populates pages, so if we re-read them shortly after, we get them out of RAM rather than from storage. The data has encodings based on data types, delta functions, compression, sparse indexes, and secondary indexes, all of which make queries very fast.

What’s changed over the last few years is that we can now store not only on block storage but on object storage. Object storage has big attractions. First, it has unlimited capacity. No need to worry about volumes filling up. Second, it’s four to five times cheaper than block storage for the storage cost itself. Note, though, that cloud providers also bill for API calls. If you make a lot of calls to bring data down repeatedly, that will drive the cost up. Third, it’s shareable across many servers. And fourth, it’s the predominant format for organizing data lakes.

S3-compatible storage includes many things beyond Amazon: Google Cloud Storage, MinIO, Ceph, and modern ClickHouse versions can also use native GCS and Azure APIs.

[29:19] – Best Practice: Separate S3 Endpoints Per Server

Robert: A key best practice that may seem counterintuitive: even though S3 is shareable, we don’t recommend sharing it across servers. Give each ClickHouse server its own location to write and read S3 files. This means separate copies of the data, which doesn’t give you the full sharing benefit, but at the current state of the implementation it’s vastly easier to manage and has fewer problems.

The alternative, zero-copy replication, stores data in a shared S3 bucket. It requires very careful coordination, has bugs, and is still under active development. The people who use it successfully are experts who understand the code well and their use cases thoroughly. We don’t recommend it for the general case.

For per-server isolation, a simple trick using macros keeps all servers separate. The Altinity Kubernetes Operator automatically defines {installation} and {replica} macros for every replica. Using these macros in your S3 endpoint path means that in a cluster with 100 replicas, or even multiple ClickHouse clusters, every server writes to a different part of the S3 bucket.

[31:47] – S3 Filesystem Cache

Robert: The S3 filesystem cache is very easy to set up and has important benefits. After defining an S3 disk, you define a cache disk that wraps it: give it a path, a maximum size, and a few other options. By setting this up, all reads go through the cache first. If data isn’t in the cache, ClickHouse fetches it from S3, but subsequent reads for that data come from local disk. This reduces API calls, and because the cache is stored on the local file system, it benefits from the OS page cache: popular data ends up in RAM automatically, without any extra configuration.

[34:52] – Tiered Storage with TTL MOVE

Robert: The final best practice around storage organization is tiered storage. Data arrives and goes onto a volume called default, which is block storage. At some later time, it moves to the S3 cache tier. There’s a move_factor setting that says when the default disk hits 90% capacity, start moving data off. But there’s a better way to control this: the TTL MOVE clause on the table definition. You can say that data older than seven days should be moved to S3. Data arrives, lands on default storage, sits there for up to seven days, merges, and then gets pushed to S3. By the time it arrives in S3, it’s done merging and it doesn’t change anymore.

Why is this important? Because of merges. ClickHouse background merges take smaller insert blocks and combine them into larger parts that are easier and faster to query. If you put data straight into S3 without this tiered approach, those merges happen in S3, which incurs expensive API calls and is slower than local storage. Tiered storage lets the merge work happen on local block storage where it belongs.

Two key settings: perform_ttl_move_on_insert = 0 tells ClickHouse not to move fresh data to S3 immediately on insert even if its TTL has already expired, allowing it to sit and merge a bit first. This is important because failing to set this correctly can get you rate-limited by Amazon, which is a common failure mode. The prefer_not_to_merge setting on S3 volumes is more controversial: it advises ClickHouse not to merge parts in the S3 volume, but setting it aggressively can lead to TOO_MANY_PARTS errors, so use it carefully.

The trade-offs of MergeTree on S3 are well-understood at this point. On the good side: it extends existing MergeTree tables transparently, data can still be updated, tiered storage handles merging cleanly, caching boosts performance and reduces API costs, and best practices are now well-known. On the bad side: data inside ClickHouse’s S3 MergeTree cannot be accessed by other applications. You need to set up and manage storage carefully. Backup requires special attention, and shared S3 data between servers remains a work in progress.

[41:10] – Section 3: Querying Parquet in Read-Only Data Lakes

Alexander: In many cases, data is already stored somewhere in Parquet format on an S3 bucket. It comes from other sources, other logs, other pipelines. There are two ways to handle it: load it into ClickHouse, or access it from ClickHouse directly. Direct access is often more convenient because you don’t need to move data, and it’s cheaper because you don’t need additional ClickHouse storage. You can use object storage that’s already there, and someone else may already be paying for it.

If you actually look into the Parquet file format, it’s very similar to MergeTree in many respects: both are columnar, both use column encodings and compression, and both have column metadata for efficient queries. In Parquet that means min/max statistics in row group headers. In MergeTree that means the order by and sparse index. The S3 bucket containing Parquet files roughly corresponds to a MergeTree table: the bucket or prefix is the table, and individual Parquet files roughly correspond to MergeTree parts.

[44:53] – Querying Parquet Files from ClickHouse

Alexander: ClickHouse can query a single Parquet file if you have its URL using the s3() table function. For multiple files, ClickHouse supports two types of glob patterns. The first allows enumeration of expressions to filter multiple files. The second is the ** pattern that gets all Parquet files from a prefix or even a hierarchy of prefixes.

In addition to WHERE clauses on columns in Parquet files, ClickHouse can also use WHERE clauses on virtual columns. The _file virtual column maps to the object name representing a Parquet file, and _path is the full path. You can reference them directly in queries to filter down which files to read.

ClickHouse can also do schema inference. If you don’t know what data is stored in a Parquet file, just use DESCRIBE on the bucket and ClickHouse will display the structure. You can also create a ClickHouse table directly from a Parquet file structure using CREATE TABLE AS SELECT.

[46:54] – Performance Considerations for Parquet Queries

Alexander: Performance of Parquet queries in ClickHouse is quite fast, but it very much depends on how the data is organized. A single large Parquet file is relatively slow. With partitioned Parquet data or a bucket with multiple Parquet files, ClickHouse can read objects in parallel and get much better performance.

The most important thing is to apply all kinds of filtering to reduce the data being processed. With multiple terabytes or petabytes of Parquet data, you don’t want to read it all across the network every query. First, Parquet is columnar: if you select certain columns, ClickHouse only reads those columns and nothing else. Second, apply glob filters that include date patterns if your data is organized that way, to skip entire files without reading them. Third, ClickHouse uses min/max values stored in Parquet row group headers to check whether query conditions fall inside or outside those ranges, skipping entire row groups without reading their data.

Additionally, ClickHouse can apply filters using the _file and _path virtual columns to further reduce overhead.

In development at the time of this webinar: Bloom filter support for Parquet, and a metadata cache so ClickHouse can apply min/max filtering without even making an API call to read the file header.

[49:50] – Writing Parquet Back to S3 from ClickHouse

Alexander: It’s also possible to write Parquet back from ClickHouse. If you have data in ClickHouse and want to export it, for example to offload old data or share it with other applications, you can use an INSERT INTO FUNCTION s3(...) statement. Make sure to use partitioning in the PARTITION BY clause when writing, because this is very important: you want to write multiple files organized by partition rather than a single massive file. A typical partitioning scheme is by date, writing one file per date.

[50:46] – Parallel Reads with s3Cluster() and Catalog Support

Alexander: In addition to single-node queries, ClickHouse has the s3Cluster() table function that utilizes the full cluster when working with data on S3. If you have a cluster with 100 nodes, you can use s3Cluster() and read Parquet data efficiently with all 100 servers. ClickHouse lists the bucket, finds the Parquet files, distributes them across the processing nodes, and every node loads its assigned files. The performance scales nearly linearly with the number of nodes.

ClickHouse also supports data catalogs: Delta Lake, Hudi, and Iceberg table functions allow you to connect to data organized in those catalog formats. We confirmed that Delta Lake and Hudi work out of the box, and Iceberg works when the necessary metadata file is present.

The trade-offs of the Parquet data lake approach are significant. On the good side: data that’s already in S3 can be accessed directly and efficiently. Parquet compression can actually be better than MergeTree, in part because ClickHouse’s MergeTree is optimized for speed and uses fast codecs, while Parquet’s file organization allows different codecs and compression within a single file. The same Parquet data can also be used by Spark and other tools, eliminating data silos. On the bad side: you cannot update or delete data from ClickHouse via Parquet. There’s no built-in merge mechanism for Parquet files, so compaction and reorganization requires external processes. And if you want tiered storage similar to what MergeTree provides, you have to implement it yourself upstream.

[54:05] – Cheatsheet for Petabyte-Scale ClickHouse Clusters

Alexander: Here’s a summary of things to apply when building at this scale.

First, understand design patterns for ClickHouse storage and compute. This webinar and other Altinity resources cover that. Understanding how ClickHouse works is extremely important to building large systems efficiently.

Second, benchmark on a small system. Benchmark on a single node. Don’t try to deploy a 100-shard cluster at the very beginning. Start with a single node, get maximum performance from it, then build a small sharded cluster of around five shards, understand how distributed queries work, and only then go bigger.

Third, S3-backed MergeTree works especially well when you have a lot of data. For smaller amounts, say several terabytes per node, block storage is still recommended. Beyond that, S3-backed MergeTree becomes the only practical option.

Fourth, Parquet is an interesting alternative when data already exists somewhere in that format. It’s not particularly interesting to write data from ClickHouse to Parquet and then query it back, but if data comes from other sources it can be accessed from ClickHouse quite efficiently.

Fifth, Kubernetes is the way to go to manage compute and storage in a flexible, controllable way. Strongly recommend it if you haven’t already.

Sixth, ClickHouse is not static. It evolves very rapidly. Every release brings new features related to storage, compute, and object storage. Some things discussed here may be greatly improved in three to six months and new features will definitely appear.

Seventh, design for growth. One petabyte today could be 10 petabytes in one or two years. At a previous company of Alexander’s, data size tripled every year for almost 10 years in a row. Design for that initially, otherwise it will be very hard to scale later.

[57:06] – Q&A

Robert: There’s a great question about open-source shared storage options similar to SharedMergeTree from ClickHouse Cloud.

Alexander: SharedMergeTree is a great feature of ClickHouse Cloud. A few months ago we started a community project to improve ClickHouse’s support for object storage in a way that would allow users to use it efficiently. SharedMergeTree itself is not necessarily needed, but some features it provides are needed, particularly the clear separation of storage and compute, and robust zero-copy replication. The zero-copy replication that currently exists in ClickHouse is not very reliable. We started in September to work on a replacement for zero-copy replication, which is not yet complete. We’re also working on improving the S3 Plain storage type, which stores data in S3 in a much simpler way. If the S3 Plain type is used, separation of storage and compute becomes trivial because any ClickHouse compute node can read the data directly without needing a shared metadata coordination layer. We hope to deliver something by end of year.

Robert: On the question of whether ClickHouse can auto-detect nodes in a cluster and allow shared S3 paths like Spark: this is partially available in ClickHouse Cloud with dynamic cluster configuration. In open-source ClickHouse, the s3Cluster() table function effectively does this for reading Parquet: it grabs as many resources as it can and reads files across a bunch of hosts without additional setup. ClickHouse also has a feature called dynamic sharding where, if you have 10 replicas, it can execute a query in a sharded way across those replicas because each has a copy of the data. The next step would be 10 replicas looking at the same S3 data with no local copies, which is closer but limited in current implementations.

Robert: On changing TTL for an existing table: you can do ALTER TABLE ... MODIFY TTL directly without freezing the table. That’s a pretty flexible operation.

Robert: On backups for S3-backed MergeTree: we recommend Altinity Backup for ClickHouse®. It has two ways to back up an S3-backed table: server-side copy or using the embedded BACKUP/RESTORE functionality. The built-in BACKUP command also works, but Altinity Backup provides infrastructure around it so you know where your backups are located and can manage retention and automation. We maintain it and are somewhat biased, but it is very widely used.

Robert: On ClickHouse as a vector database for RAG and LLM applications: ClickHouse does have nearest-neighbor search capabilities and vector storage. We haven’t benchmarked them extensively, and this use case is not yet very prominent among our users. ClickHouse Incorporated has published at least one article about this that would be worth checking out.

FAQ Section

Q: When should I use S3-backed MergeTree versus querying Parquet data directly from S3?

A: The right answer depends on where your data comes from and what you need to do with it. S3-backed MergeTree is the best choice when you are ingesting data into ClickHouse, need to update or delete records, or want the full performance benefits of MergeTree’s sparse indexes and merge-time optimizations. It extends existing ClickHouse tables transparently and has well-understood best practices for tiered storage and caching. Direct Parquet querying is the best choice when data already exists in Parquet format in S3 from another system and you want to query it without copying it. It saves storage costs and eliminates data movement, and the same files can simultaneously be used by other tools like Spark. The main limitation is that Parquet data cannot be updated or deleted from the ClickHouse side, and there is no built-in compaction process.

Q: What is tiered storage and why is it strongly recommended for S3-backed MergeTree?

A: Tiered storage means configuring ClickHouse so that new data lands on fast local block storage first, and then migrates to S3 after a configurable time period using the TTL MOVE clause. This is strongly recommended because ClickHouse’s background merge process, which compacts small insert blocks into large efficient parts, performs far better on local block storage than on S3. Merges in S3 are slower and generate many more API calls, which drives up both latency and AWS cost. With tiered storage, data merges on local disk during its hot phase, and by the time it migrates to S3 it is already well-compacted and rarely changes. For more detail, see the example of a table at S3 with cache in the Altinity Knowledge Base.

Q: Why should each ClickHouse server have its own separate S3 path instead of sharing one bucket?

A: Sharing a single S3 bucket across multiple ClickHouse servers requires zero-copy replication, a feature that is still under active development, has known bugs, and requires expert-level understanding of ClickHouse internals to use safely. With separate paths per server, each server manages its own copy of the data independently. This is simpler to configure, easier to debug, and avoids the complex metadata coordination that shared storage requires. A practical way to implement per-server separation is to use ClickHouse macros in the S3 endpoint path. The Altinity Kubernetes Operator for ClickHouse® automatically provides {installation} and {replica} macros that make this straightforward in Kubernetes environments.

Q: How can I optimize ClickHouse queries on Parquet files in S3 to avoid reading too much data?

A: There are several layers of optimization. First, use SELECT to name only the columns you need: since Parquet is columnar, ClickHouse only reads those column files. Second, apply glob patterns to limit which files are read at all, for example a date-based pattern. Third, ClickHouse can use min/max statistics stored in Parquet row group headers to skip entire row groups where the query condition falls outside the stored range. Fourth, you can filter on virtual columns _file and _path to exclude entire files from a read. Altinity engineers have also contributed Parquet performance improvements to upstream ClickHouse and are working on a Parquet metadata cache that would allow row group filtering without making a network call to read the file header.

Q: How does the Altinity Kubernetes Operator for ClickHouse® simplify scaling?

A: The Altinity Kubernetes Operator for ClickHouse® defines a ClickHouseInstallation custom resource that describes the desired state of a ClickHouse cluster in YAML. When you submit or change this resource, the operator reconciles the actual cluster to match: provisioning or deprovisioning pods, persistent volumes, services, and configuration files as needed. To scale vertically, change the instance type in one place and the operator reschedules ClickHouse on larger nodes. To scale horizontally, increment the shard or replica count and the operator provisions new shards automatically. New replicas synchronize data automatically through ClickHouse’s replication mechanism. This abstraction removes the need to manually manage the dozens of Kubernetes objects that a multi-shard, multi-replica ClickHouse cluster requires.

Q: What is the recommended approach for backing up S3-backed MergeTree tables?

A: The recommended tool is Altinity Backup for ClickHouse®, which is open-source and maintained by Altinity. It supports two methods for backing up S3-backed tables: server-side copy within S3, and using ClickHouse’s embedded BACKUP/RESTORE commands internally. Using Altinity Backup gives you infrastructure around the process: you can track where backups are stored, configure retention policies, automate scheduling, and restore reliably. The built-in BACKUP SQL command also works, but Altinity Backup wraps it with tooling that makes it easier to manage at scale.

© Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE^® SOLUTIONS

Get in touch with ClickHouse experts.

Related:

Leave a Reply Cancel reply