Using S3 Storage and ClickHouse®: Basic and Advanced Wizardry

Recorded: August 29 @ 09:00 am PT
Presenter: Robert Hodges, CEO @Altinity

In this deep-dive webinar, Altinity CEO Robert Hodges and CTO Alexander Zaitsev cover the full spectrum of S3 object storage integration with ClickHouse®, from fundamental concepts through advanced production considerations. The session begins with a review of why the traditional shared-nothing, block-storage architecture has capacity and cost limitations at scale, and why S3, at four to five times cheaper than EBS, is an attractive alternative for large data volumes.

The bulk of the presentation focuses on how ClickHouse MergeTree tables can be configured to use S3. Robert walks through the three key storage configuration layers: disks, volumes, and storage policies, demonstrating how to set up an S3 disk, how to add a local disk cache to recover near-EBS query performance, and how to use tiered storage with a TTL rule to automatically move data from hot block storage to cold S3 after a defined interval. He also covers the two other data movement options: the move_factor volume setting and the manual ALTER TABLE … MOVE PARTITION command.

A dedicated section addresses the important operational topic of background merges and S3 API costs. Because S3 charges per API call, heavy merge activity on S3-backed tables can generate unexpected costs and even rate-limit errors from Amazon. The recommended mitigation is tiered storage with prefer_not_to_merge and perform_ttl_move_on_insert=0 set on the S3 volume, so data is fully merged on block storage before it ever touches S3.

The webinar then examines S3 in clustered, replicated deployments. Robert explains the difference between the default model, where each ClickHouse server maintains its own copy of S3 data, and zero-copy replication, where servers share a single copy of the data by managing reference pointers. Alexander provides additional context on the complexity of reference counting and why zero-copy replication was still maturing at the time of this webinar, with ongoing fixes landing in recent releases.

A significant section covers what Alexander describes as the design differences between the current open-source S3 implementation and ClickHouse Inc.’s closed-source SharedMergeTree engine, which solves metadata separation more fundamentally by moving data references to ZooKeeper or Keeper and allowing them to be managed across servers in a unified way. Alexander emphasizes that SharedMergeTree is closed-source and there are no plans to open-source it, making ongoing community improvement of the open-source S3 path an important priority.

The final segment covers ClickHouse as a data lake reader and writer using Parquet on S3. Robert demonstrates both the S3 table engine (for simple reads and writes) and the more powerful s3() table function, which supports glob patterns for reading multiple files and handles partitioned writes well. The s3Cluster() function, which distributes query work across all servers in a cluster, rounds out the data lake discussion.

The session closes with an extended Q&A where Alexander addresses questions about backup approaches for S3-backed tables, API rate limits, zero-copy replication production readiness, CPU impact of S3-backed storage, and the current state and roadmap for S3 capabilities.

Here are the slides:

Using S3 Storage and ClickHouse–Basic and Advanced Wizardry-2023-08-29 Download

Key Moments (Timestamps)

Key moments generated with AI assistance.

0:03 – Introduction and housekeeping
1:54 – Speaker introductions: Robert Hodges and Alexander Zaitsev
3:44 – ClickHouse overview and the MergeTree table engine
7:05 – Performance of block-storage MergeTree tables
8:24 – Limitations of shared-nothing architecture at scale
9:33 – Why S3 object storage is attractive: cost, capacity, shareability
11:05 – S3 tradeoffs: different API, no file updates, no automatic caching, API cost
12:00 – Overview of two S3 usage models: MergeTree-backed and data lake
12:26 – Setting up ClickHouse to use S3: disks, volumes, storage policies
13:41 – Storage configuration example: defining an S3 disk and storage policy
15:41 – Tiered storage: layering hot block storage and cold S3
16:50 – Tiered storage configuration walkthrough
17:43 – S3 cache: adding a local cache layer on top of S3
18:37 – Cache configuration example
21:34 – Performance comparison: EBS vs. S3 direct vs. S3 cached vs. tiered
24:09 – Background merges, API costs, and S3 rate limiting
26:40 – S3 in ClickHouse clusters with replication
28:13 – Object storage replication: separate copies vs. zero-copy
30:52 – How to enable zero-copy replication
31:36 – Current state of S3 table storage in clusters: what works and what is hard
35:17 – SharedMergeTree: how it solves the metadata problem (closed-source)
38:19 – Using S3 as a data lake: introduction to Parquet
39:38 – S3 table engine for reading and writing Parquet
40:36 – The s3() table function: reading from multiple files with glob patterns
41:33 – s3Cluster(): distributing Parquet queries across a cluster
41:58 – Performance of Parquet queries versus MergeTree on S3
42:41 – Partitioned writes using the s3() table function
47:18 – How to pass S3 credentials: four methods
49:12 – Trade-offs with Parquet on S3
50:13 – S3 telemetry improvements in ClickHouse
50:58 – system.remote_data_paths table for tracking S3 files
51:39 – Coming attractions: community S3 improvements
54:08 – Contact information and Q&A
55:24 – Q&A: zero-copy replication production readiness (Alexander)
56:46 – Q&A: does larger part size reduce API costs?
58:38 – Q&A: current separation of references and data in open-source S3
1:03:34 – Q&A: CPU impact of S3-backed storage
1:03:40 – Q&A: using s3_plain disk as a write-once storage tier
1:05:31 – Q&A: backup options for S3-backed ClickHouse clusters

Webinar Transcript

[00:03] — Introduction and Housekeeping

Robert: Hello, everybody, and welcome. We’re going to be presenting on S3 storage and ClickHouse® today, and we’re going to be showing you basic and advanced wizardry for using this powerful feature for storing large amounts of data.

My name is Robert Hodges and I’m Altinity CEO. I’m joined today by Alexander Zaitsev, who is the Altinity CTO and also prepared substantial portions of this talk. He’ll be with me to provide color commentary and also to go deep on questions at the end.

Before we get going I’d like to cover a couple of things which will help you enjoy this webinar more. First, let me assure you it’s being recorded. We will send you a link to the recording as well as to these slides after the webinar, probably within about 24 hours.

Second, this webinar is backed by a bunch of examples that are out on GitHub. I’ll provide the link at the end, but the code samples you’re seeing here you can actually go get from GitHub and see how we did some of these things. No need to take frantic notes or copy and paste out of the slides.

Third, we will have time for questions. If you want to pose questions as the talk is proceeding, feel free to put them in the question-and-answer box that’s available on the Zoom console, or you can stick them in the chat. If it’s something directly relevant to the talk I’ll pause and answer it; otherwise we’ll take them at the end.

[01:54] — Speaker Introductions

Robert: So let’s go ahead and dive in and do some introductions. My name once again is Robert Hodges. I’ve been working on databases for 40 years, and I’ve been working with Kubernetes and distributed systems since 2018. My day job is CEO of Altinity.

I’ve also got Alexander Zaitsev with me. Alexander, if you’re able to unmute we can test your audio and let you introduce yourself as well.

Alexander: Hello, hello everybody.

Robert: Alexander is, as you can see on the slide, an expert in analytics. He was the person who really was the motivating force behind founding Altinity as a service provider for ClickHouse. Among the services and software that we provide, we have Altinity.Cloud, which runs on Amazon, GCP, and any Kubernetes. We’re also the producers of Altinity Stable® Builds for ClickHouse®, which some of you on this call use, and we’re the authors of the Altinity Kubernetes Operator for ClickHouse® as well as other software like Altinity Backup for ClickHouse®.

Before I leave the introductions, I want to give a special shout-out to Denis Yurichev. Many of you know him as Denny Crane. He provided some of the key examples we’re using in this talk and is a past master at use of S3 in production situations.

[03:44] — ClickHouse Overview and the MergeTree Engine

Robert: I’m going to start by going into how S3 generally fits into ClickHouse, and we’re going to proceed pretty quickly on this.

We’ll start with the obligatory slide on ClickHouse for those of you who are not familiar with it. It’s a real-time analytic database, kind of like MySQL in its flexibility, portability, and open-source license, but it deals with billions or trillions of rows of analytic data. It started out with a shared-nothing architecture, which means budget nodes connected on a network with their own storage. As you’ll see in this presentation, that architecture is now beginning to evolve into a fully separated compute and storage model. We can do this in multiple ways, but that’s another talk. Like all analytic databases, data is stored in columns, with parallel and vectorized execution, very good scaling, and it’s incredibly popular, one of the most popular analytic databases on GitHub with something like 31,000 stars.

What we’re going to do is start with the basics of putting a table into ClickHouse. We’ll be focused on the MergeTree table engine, which is your workhorse for big data. It is designed to handle billions to trillions of rows distributed potentially over many machines.

In a table definition we’ll also typically have a PARTITION BY clause, which tells us how to split the table up into parts. That’s very important because it gives ClickHouse some guidance about how to divide the table up into reasonable chunks that it can then distribute to different locations. We’ll also have ordering, which is critical in column databases. It enables us to create contiguous stretches of data, useful for both query performance and compression, which as you know are closely linked.

When we’re using traditional shared-nothing storage, here’s where the data goes. You have ClickHouse running as a process, it’s got the MergeTree engine, the data is on a disk, and you have a combination of metadata, which tells you how to find the data, like your database and table structure, and the actual table data itself. You also have the Linux page cache, a very important cache used to speed up operations on block devices.

[07:05] — Performance of Block-Storage MergeTree Tables

Robert: This basic setup has very good performance. We have some test queries we’ll be running again and again throughout the talk. If we set them up so that ClickHouse is forced to read everything from storage, using a setting shown at the bottom, we get about half a second maximum for these queries. The first query, because it uses an index, is extremely fast, just a few milliseconds. Once the data is sitting in the OS page cache it drops by about half, and in other cases you get an even bigger speedup. But in fact the data we’re dealing with is just not that large.

For reference, we’re testing on ClickHouse 23.7, using m6i instances and EBS gp3 storage with 1,000 megabytes per second throughput, a pretty performant setup. This speed is what has made ClickHouse popular.

[08:24] — Limitations of Shared-Nothing Architecture at Scale

Robert: As data grows, and many of you have found this, the shared-nothing architecture has some big limitations.

First, capacity is limited. It’s practical to store far more data than most people suppose using this model, but once you get in the range of 30 to 50 terabytes it becomes difficult to manage. It’s not that you can’t continue to add more devices, but things like backups get complicated. If a server flakes out and you have to replace it, recovery is very slow.

Second, block storage, whether SSDs or EBS, is relatively expensive. And in the replication model used in the shared-nothing architecture, you make multiple copies, so you multiply that cost.

[09:33] — Why S3 Object Storage Is Attractive

Robert: For that reason it’s been pretty attractive for a long time to think about adding S3. S3-backed MergeTree storage gives you some very desirable properties.

S3 has virtually unlimited capacity. Nobody on this call is likely to hit it. For pure storage costs it runs about four to five times cheaper than something like Elastic Block Store on Amazon, and even cheaper on Google because they have higher-priced block storage. It is shareable, meaning many processes can see it. And it is commonly used for data lakes. In fact the majority of data lake information is stored in S3.

One thing I should note is that when talking about costs it’s really important to recognize that the costs are not just the at-rest storage cost. There are four other ways S3 can run up costs, one of the most important being API calls. We’ll talk about that a little later.

[11:05] — S3 Tradeoffs and Differences from Block Storage

Robert: This sounds great, but you can’t just plug it in. S3 object storage differs from block devices in fundamental ways. First, it’s a completely different API. Second, you can’t update files: unlike block storage where you can just rewrite a block, with S3 that option is much more limited. There’s no automated caching. And there are extra costs for traffic, such as egress fees and API calls. It has a completely different cost structure from block storage.

What that means is we’re going to have to do some real work to fit this in. In the next section we’ll talk about how that happens for MergeTree storage.

Just to preview: we’re going to talk about two different ways to use S3. One is to use it for MergeTree storage, which we’ll call S3 MergeTree. The other is to use it in a data lake style, for example reading read-only data in Parquet. Both are powerful models.

[12:26] — Setting Up ClickHouse for S3: Disks, Volumes, and Storage Policies

Robert: Let’s dig into setting up ClickHouse to use S3 for MergeTree storage. You’re going to have to configure a number of things. First you’ll set up what’s called a disk, which defines a storage device that can access S3 and the bucket where you put files. You may also need to configure caching, which as we’ll see from performance numbers is very important. We’ll talk about volumes and storage policies, which make these devices visible in SQL so you can attach them to your MergeTree table.

The basic thing you’re going to be doing is setting up what are called storage configurations. Here’s a picture of how they work. You actually have a storage configuration in every ClickHouse server you run, called default, which writes to your local disk. You can add additional storage configurations. One we call s3direct, which writes to a bucket on S3 instead of local block storage. A third option is an s3_cache storage policy, which uses the S3 disk but passes reads through a cache, so if we’ve read data from S3 before there’s a reasonable chance it’s stored locally and we can grab it without incurring extra costs.

[13:41] — Storage Configuration Example: Defining an S3 Disk

Robert: Here’s an example of how storage configuration works for the simple case where we just want to write to an S3 disk. Inside the storage configuration tag there’s a disk section and a policy section. For the disk we specify type S3, an endpoint, and a metadata path. That metadata path is very important because it provides information to ClickHouse about where the actual files are located in S3 storage. That defines the disk.

Then we have a policy which uses it. The policy sets up a volume, and the volume just points back to that disk we just defined. Now this policy can be attached to a table, and that table will be able to read and write from S3.

[15:41] — Tiered Storage: Layering Hot Block Storage and Cold S3

Robert: Before we actually start using S3 tables, let’s talk about another popular way to organize storage: tiered storage. It’s a feature of ClickHouse that has existed for a number of years. We’ll set up a policy called s3_tiered with two volumes. The default volume is always present; we give it priority 1, which means data lands there first, on local disk or SSD. We then have a second volume pointing to S3 with a lower priority.

The idea is that data lands on hot block storage and then at some later time moves to S3 automatically. There are good reasons to do this, as we’ll see.

Here’s the configuration. We define the policy s3_tiered, list the volumes in order, call one hot and the other cold, say that inserts land on block storage, and then at a later time move to S3 disk when you hit 90% capacity on that disk. That’s what the move_factor means. That’s one of several ways to move data. We’ll dig into others shortly.

[17:43] — S3 Cache Configuration

Robert: The final thing to talk about before we actually apply these storage options is caching. I mentioned that the cache is important to speed up S3 queries. What we can do is enhance our S3 tiered policy so that instead of pointing directly to S3, the second volume points to an s3_cache device, which is a special type that refers to another device, in this case an S3 bucket. That means you’re using the cached device, but it’s backed by S3. So go get your data from there.

Here’s the cache device configuration: you can see it’s type cache, it refers to the S3 disk in the disk tag, it has a path for where the cache will be located, and a size. In this example it’s pretty small but you can make it any size you want, 500 gigabytes, a terabyte, whatever you have available storage for.

[19:02] — Applying Storage Policies to Tables

Robert: Okay, that was a lot of background configuration, and as usual with ClickHouse there’s a fair amount of XML, but it’s pretty easy to set up and the syntax is straightforward. Once you’ve done that you can start using it.

Here’s an example taking that table we introduced early on and creating a new version of it with a storage policy. The storage policy points to s3_direct, which means our metadata will be stored on block storage but the actual data will be written to S3.

If we want to use tiered storage with a TTL rule, here’s how we do it. We set the storage policy to s3_tiered and add what’s called a TTL. A TTL, or time-to-live, is a very powerful feature of ClickHouse tables. It allows you to define a rule based on data in the table for moving things from one storage tier to the next. In this particular case the TTL says: using the D column (which is a date), seven days after today the data will move to the volume s3_cache. What that means is: keep this for a week on block storage and then move it to S3. This happens automatically in the background; essentially a background merge process finds data that matches this predicate at intervals and moves it over to S3.

[20:59] — Two Other Ways to Move Data

Robert: There are other ways to move data. We already showed the move_factor setting in the XML a few minutes ago, which triggers movement when you hit a certain disk capacity threshold. And finally you can also just move data yourself using ALTER TABLE ... MOVE PARTITION, which moves a partition to a specified volume. For example, moving it from the hot volume to the cold volume. You can do this at any time.

[21:34] — Performance Comparison: EBS vs. S3 Configurations

Robert: After setting up all these types of tables, it’s nice to check out the performance. We’re going to try our test queries on tables configured in different ways. Recall that each table contains 100 million rows. We’re using an m6i.2xlarge with eight vCPUs. Let me share what we see:

EBS, particularly when reading from storage, is a clear winner. The queries complete in half a second even when pulling from storage. If we go to a table backed purely by S3 direct, the longest query takes almost three seconds. If we set this up with a cache but force the cache to be empty, the performance is initially the same as the S3 direct table. But once the cache is loaded it has the same performance as EBS. The cache is reasonably smart; it doesn’t pull entire files, it pulls sections of them. As long as you’re querying the same sections of data you’ll have essentially the same performance as EBS.

Finally, tiered storage is in the middle ground. The hot storage is on EBS so that runs fast, but the more data you have in S3, the more that queries look like S3 performance.

The key takeaway: S3 performance is actually pretty reasonable, especially for data you don’t query heavily and where you just want it sitting there ready for use. EBS is going to cost about five times more, so the performance kind of matches that.

[24:09] — Background Merges, API Costs, and S3 Rate Limiting

Robert: Let’s talk about background merges, because this is an important topic for S3. I mentioned that it’s not just storage cost you have to worry about; it’s also API calls.

If you use S3 direct, so your table is purely backed by S3, when merges happen they have to go out and touch a lot of files. They read files, pull them down, sort and merge them in memory, and write them back out again. You’re going to have a lot of activity in S3. One of the things that can happen if you’re doing a lot of merges in S3, for example during a migration from another system where you bring a lot of data across at once, is you’ll actually get errors from Amazon where operations fail because Amazon will rate-limit your API calls.

If you use tiered storage you actually avoid that problem. What will happen is merges typically occur in the default device. Data arrives, sits there for a while, fully merges over hours or days, and by the time it moves to S3 it really doesn’t need to merge anymore. There are two important S3 volume settings to pay attention to. One is prefer_not_to_merge, which says: once we’re in S3, don’t do background merges there unless absolutely forced to. The other is perform_ttl_move_on_insert=0, which says: when data arrives in default, make sure it stays there, don’t try to do the TTL right away. We’d really like it to hang around for a while so it can get properly merged before moving to S3.

[26:40] — S3 in ClickHouse Clusters with Replication

Robert: So far we’ve been talking about S3 storage in single servers, and in many cases that’s how people are running it. But many of you run clusters that replicate data. Here’s what it looks like if you’re using local storage: you have a ClickHouse instance, it has a disk with metadata (your table definitions and pointers to your database structure and parts) and the actual data. The metadata is shared via ZooKeeper. ClickHouse servers point to ZooKeeper, learn about different parts available on different servers, and the ReplicatedMergeTree engine automatically replicates parts by learning about them from ZooKeeper and pulling them over.

[28:13] — Object Storage Replication: Separate Copies vs. Zero-Copy

Robert: It gets more complicated when we use S3. If we’re in a situation where we’re replicating with S3 storage, ZooKeeper still stores shared metadata, but the data is no longer stored on disk. Instead you’ll just have references, and those references in the file system are actually pointers to S3 object storage.

This means that to interpret the object storage we must have the full metadata and data references present on the ClickHouse server. The server then uses those to go look things up in object storage and fetch data when it needs to read it.

Now this brings up a question: for any particular part controlled by a particular server, what is it actually pointing to? There are two ways this can work. By default in current releases, each ClickHouse server gets its own S3 files. The data references point to separate S3 files per server. In effect ClickHouse replication is also replicating parts in S3. This is not very efficient for large data sets.

If you’re storing a petabyte of data, what you’d really like to do is take advantage of the fact that S3 is visible to all servers. Why not just have one copy? This is what we call zero-copy replication: the data references from different servers actually point to the same files, shared across the cluster.

[30:52] — How to Enable Zero-Copy Replication

Robert: To enable this there is a MergeTree property called allow_remote_fs_zero_copy_replication. When you set it at the database level, any table using S3 automatically gets this property, the files are shared, and the amount of S3 storage used drops to a single copy. So for example instead of two copies of a petabyte you just store one. That’s a really powerful feature that people are using increasingly.

[31:36] — Current State of S3 Table Storage in Clusters

Robert: This is a good point to summarize where S3 storage is at right now, because it is in a state of evolution in ClickHouse.

On the good side: S3 is usable for long-tail data. We have a lot of customers using it quite successfully. Many of them have each server maintaining its own copy of the data, which works reasonably well. Performance is not bad and getting better. Replication works. We recommend being on the most recent version of ClickHouse possible, as of this talk that’s 23.7, because constant fixes keep coming in. There’s also a max_parallel_replicas setting that enables scale-out, allowing you to spread queries across all your available replicas and get good parallelization.

On the challenging side: the references and data are separated, meaning the metadata that is necessary to know what you’re looking at is still on the ClickHouse server, and every ClickHouse server needs a full copy of that metadata. This doesn’t scale out to a very large number of nodes. Zero-copy replication until recently was somewhat unstable. Preserving reference counters across a cluster is tricky. Doing backups is hard. And there are still fixes landing that need full verification, though this is getting better really fast.

[35:17] — SharedMergeTree: The Closed-Source Alternative

Robert: What does SharedMergeTree actually do?

Alexander: SharedMergeTree basically solves this problem by managing references in a much more coherent way. It does a couple of key things. First, it allows the S3 object storage files to become easily attachable to any number of servers. It also solves the problem of data references by allowing the servers to contain partial sets of information about the entire set of S3 files, so you don’t have to have this data reflected onto every single host.

If you read the article from ClickHouse Inc. about SharedMergeTree, the fundamental thing they did is move data references out of ClickHouse and into ZooKeeper or Keeper. When data references are stored in a single place, many good things happen almost for free: you can attach multiple servers, you can do dynamic scaling, zero-copy replication becomes essentially unneeded, and so on. They also collapsed metadata into a smaller number of files and objects, which decreases the number of API calls to S3, since we were talking about extra costs due to extra API calls. So they thought about that as well.

The other big thing to know is that SharedMergeTree is closed-source, and ClickHouse Inc. has no plan to open-source it at this time, though they may change that. That’s an important topic we’ll return to at the end of this talk.

[38:19] — Using S3 as a Data Lake: Introduction to Parquet

Robert: Let’s talk about a final way to use S3 inside ClickHouse, which is a data lake approach. MergeTree is great but it’s not the only format for data in the world. In particular, Parquet is a really popular format. It’s columnar, has indexing, has very good compression, and in many ways is a worthy counterpart to MergeTree. The difference being it’s a public format that anybody, for example using Arrow libraries from Python or C++, can read.

ClickHouse has put a lot of effort into reading files directly off S3. This feature started to become really usable around 2020, and there has been particularly a lot of effort in recent months on reading and writing Parquet very quickly.

[39:38] — The S3 Table Engine for Parquet

Robert: Let’s look very briefly at how you can use it. ClickHouse offers what’s called the S3 table engine, which will read files directly from S3. The way you set it up: you define your table, pick S3 as the engine, provide a URL, and specify that the table type is Parquet. It doesn’t need to be told what’s in there because it already has a schema and can figure it out itself.

You can then insert data into this. In the current implementation of the S3 table engine, it would take these rows and create one Parquet file. There is limited support for partitioning. This is a simple, rather straightforward use of read and write of Parquet.

[40:36] — The s3() Table Function and s3Cluster()

Robert: When we’re actually selecting, we see more powerful ways to do it. We can select directly from the table, which is fine as long as it’s a single file. But what we can actually do is select from the s3() table function. This is a really powerful function. We give it a URL, say it’s Parquet, and it’s going to be just like building a table on the fly. The difference is that the s3() table function is very good about reading multiple files: it has very powerful file globbing capabilities, so you can use patterns to select from multiple similar paths.

Then there’s a variation called s3Cluster(), which will run the s3() table function across the current server and distribute the work across all servers in the cluster. If you want to run a query, ClickHouse will dynamically spread it across multiple servers to get you your result back much more quickly.

And yes, to answer the question in chat: this works with any S3-compatible storage, not just Amazon S3. It works on MinIO, on GCS, on Oracle buckets. As long as it’s S3-compatible it should more or less work. You may sooner or later see a bug on newer bucket types, but they do get fixed.

Alexander: Yes, it’s about the S3 API and the bucket. ClickHouse can also work with GCS and Azure block storage directly because the developers integrated those APIs directly. For GCS there’s a wrapper on top of the S3 API provided by Google, so it’s a little bit limited. But it understands specific authorization schemes for GCS and others.

Robert: Let me give a concrete example about the behavior differences. With MinIO, if you’re doing a multi-part upload, there’s a pause in the API after all your parts go up. With Amazon S3, once the last byte arrives, S3 just has it, you can go away. With MinIO there’s actually a pause as it syncs the data. In some cases this caused connections to actually time out. Those are differences in API behavior, not API functionality. It doesn’t mean MinIO doesn’t work. We use it constantly for unit tests in the ClickHouse build pipelines. We just want to be clear that there are some differences.

[47:18] — How to Pass S3 Credentials: Four Methods

Robert: One thing we haven’t talked about yet is how to pass S3 credentials. There are many methods.

The one I personally prefer and find simplest is passing them as environment variables. In Kubernetes, this is super easy to implement: you create a secret and then pass that secret into the process as environment variables. That gives you access.

You can also pass the keys as strings in every one of the S3 endpoint definitions. But that means you have keys lying around, which is not a good thing. Sooner or later you’ll check them into GitHub and then you’ll get an unpleasant message.

Another nice way, in each of the clouds, is to define roles and grant those roles directly or indirectly to ClickHouse, so the ClickHouse process has built-in permission to access whatever S3 bucket you’re looking at. That’s great because there are no keys at all.

Finally, you can use named collections, which are really just a different way of defining data stored in an endpoint. With S3 table functions you don’t have to embed credentials in the SQL directly, you just use a reference which then resolves to something that doesn’t appear in your logs.

[49:12] — Trade-offs With Parquet on S3

Robert: I’m not going to belabor this because I want to make sure we have time for questions, but one thing I’d say: if you have a use case where the archive data is truly read-only, that’s a good pick. For example, observability data that just never changes and you want to keep it: Parquet on S3 is a really good option. The other nice thing is that once you put data out there it’s not just ClickHouse that can read it. Machine learning, data science teams can access those files without going through ClickHouse and putting load on your operational analytics. It’s a really powerful feature.

If data may change or need to be deleted, for example if tenants can request their data be deleted, Parquet on S3 is not a great pick unless you design for it explicitly, for example by splitting up Parquet files by tenant.

[50:13] — S3 Telemetry and the system.remote_data_paths Table

Robert: S3 telemetry has gotten way better in ClickHouse. The next few slides in the deck, which I’m not going to cover, show examples you can read when you get the link. It’s now increasingly easy to get information out about what ClickHouse is doing with S3. For example, if you’re getting a lot of API calls, detecting those easily is now possible. You can fetch data about the file system cache and pretty easily check the allocated storage size.

One trick I want to highlight is the system.remote_data_paths table. You may not know about this. It keeps track of the actual files that ClickHouse is managing in S3. You can see their paths, use this to check data size, and find out individual paths. There’s also a new utility called clickhouse-disks which can read the paths when you’re on the ClickHouse instance. It’s not fully documented right now but looks like an increasingly useful tool to figure out what’s going on with those references on disk.

[51:39] — Coming Attractions: Community S3 Improvements

Robert: Coming attractions. We talked about SharedMergeTree from ClickHouse Inc., which looks like a good design but is closed-source and will likely remain that way for a while. There’s a lot of interest in what everybody can do in the community to improve the existing open-source S3 capabilities. For example: more work on zero-copy replication to make sure it’s fully stable, simplifying storage organization, all the way down to documentation improvements. If these things are of interest to you, I’d invite you to join community efforts or talk to us. We’re very actively involved in this and looking for ways to help make S3 storage better. This is a really critical feature for ClickHouse, and the good news is it’s evolving quickly and becoming increasingly useful.

[54:08] — Contact Information

Robert: With that, we’re getting toward the end. You can read the best practices slides in the deck. Don’t forget the samples; here’s the link. The clickhouse-sql-examples repository is out on GitHub. There’s great documentation. There’s also a great blog by DoubleCloud that explains how S3 storage works under the cover, the best treatment of the topic to date. And for practical production use, Denis Yurichev and other people on the Altinity support team have written great Knowledge Base articles, including an excellent one which I used for the samples for this talk.

Thank you. If you’d like to talk to us more after the Q&A, please visit altinity.com. You can schedule a consultation, join our Slack, try Altinity.Cloud, or use our Altinity Stable® Builds for ClickHouse® and the Altinity Kubernetes Operator for ClickHouse®.

[54:50] — Q&A: Zero-Copy Replication Production Readiness

Robert: Okay, Alexander, I’m going to go ahead and make you the host for the Q&A.

Alexander: Thank you. I’ll try to answer questions both in voice and in the chat. A few are already answered. Let me take the question about zero-copy replication production readiness.

Zero-copy replication has actually been developed and used in production at Yandex for several years. During that time they definitely hit a lot of problems: they were losing customer data, and they were fixing bugs. ClickHouse Inc. folks started using it in their cloud as well before they developed SharedMergeTree, and they were also spending their time fixing bugs. That effort has not been forgotten. Now they claim that zero-copy replication is production-ready if you use the latest ClickHouse versions, like 23.6 or 23.7.

Backup is a different story. ClickHouse still doesn’t have a good internal backup solution. The embedded backup command is not a full solution. It can’t manage lists of backups, backup frequencies, schedules, and so on. The Altinity Backup for ClickHouse® utility is much better in terms of management, but it still doesn’t work very well with S3 storage. We’re still working on that. Every new version improves in this regard but some problems may still occur.

[58:03] — Q&A: Does Larger Part Size Reduce API Costs?

Alexander: Yes and no. It depends on what you do later with your data. When ClickHouse queries the data it sometimes needs to read the full part. In general, having a smaller number of parts is better for S3, because the fewer objects you store in S3 the better it is for performance and management. Even a simple operation like calculating the size of your bucket in the S3 client requires iterating through all your objects, which can take a very long time. So the fewer objects you have, the better.

[59:51] — Q&A: Separation of References and Data in Open-Source S3

Alexander: With the current open-source implementation there is still a fraction of the data, the references, that is not separated from compute. The reason those references exist is that object storage is very limited in operations typical to file systems. You cannot easily move an object from one place to another; you need to do a copy and then delete the old version, which is very slow.

If you know how ClickHouse works internally, it likes to do hard links. If you do some rotation, all the data that doesn’t change is just copied through hard links, which is very efficient in a file system but doesn’t work in object storage. To overcome this limitation the developers created those reference links so you can manipulate at the link level without touching the objects themselves. Objects are immutable in object storage.

But that created a different problem: when there is no clear separation, when the data of those references needs to be protected, if you lose it you cannot restore the data stored in object storage. That and a bunch of other problems is what ClickHouse Inc. tried to solve in SharedMergeTree.

[1:03:34] — Q&A: CPU Impact of S3-Backed Storage

Alexander: I don’t think S3-backed storage is more CPU-intensive compared to SSDs or HDDs. It should be pretty much the same in terms of CPU usage. But if you consider wait times, it will be different. CPU may wait for data to be read from storage or from the network, and those are different numbers. What might happen with ClickHouse and S3 is that ClickHouse creates a much bigger number of threads that try to read from or write to S3, because creating more parallel requests usually increases bandwidth. Even if you have one thread executing the query, ClickHouse may create a lot of background threads for storage I/O, and that may generate somewhat higher CPU load due to context switches.

[1:03:40] — Q&A: Using s3_plain Disk as a Write-Once Storage Tier

Alexander: Almost. There is a special disk type we haven’t discussed called s3_plain. You can put data on an S3 plain disk and then attach it to ClickHouse. The problem is that it’s a write-once solution. You can create data on an S3 plain disk using the backup command and ClickHouse’s restore mechanism, and then attach it to another ClickHouse server for restoration or inspection purposes. One thing we are considering improving in ClickHouse is making s3_plain more generally available for archival cold storage. Once the data is old enough you probably don’t expect it to change. You can imagine moving it to S3 plain in a structure identical to ClickHouse’s own data structure. Once stored there it can be attached to other servers and shared or moved easily. Right now it’s kind of possible but not very convenient.

[1:05:31] — Q&A: Backup Options for S3-Backed ClickHouse Clusters

Alexander: There are actually four options. First is the ClickHouse embedded backup command, which can produce a snapshot of the data and store it on S3. The problem is it still has some bugs with restore functionality, particularly with dependency tracking, and it is not a management solution: you can’t manage a list of backups or set schedules. Second is Altinity Backup for ClickHouse®, the open-source utility that supports different backup backends including all major cloud providers. Third, DoubleCloud open-sourced their backup solution recently. It has been used in production for many years internally at Yandex and DoubleCloud. We have zero experience with it but know it has extensive production use.

Fourth, you can actually build your own using EBS snapshots if you’re running on traditional block storage in the cloud. It’s possible to do EBS snapshots with incremental support. To get a consistent snapshot, stop all inserts, stop all merges so ClickHouse is completely idle, then do the snapshot. This requires quite a lot of automation that may not yet exist as open source.

Alexander: Thank you everybody. Robert has to leave for another important customer meeting, but thank you guys for attending. I think it was interesting. We will share this recording and slides with you, and we’ll continue discussing this topic because this is one of the important features in ClickHouse, and we haven’t covered everything. There is a lot left for future talks.

Thank you and have a good rest of your day.

FAQ

What are the main ways to configure S3 storage for ClickHouse MergeTree tables?

There are three core configurations. An s3_direct policy writes data purely to S3 object storage with no local cache. An s3_cache policy layers a local disk cache on top of S3, so frequently accessed data is served from local storage at near-EBS speed. A tiered policy pairs block storage as the hot tier with S3 as the cold tier, using a TTL rule to move data to S3 after a defined interval. For most production workloads, the tiered approach with a cache is the recommended starting point.

Why is tiered storage recommended over writing directly to S3 in ClickHouse?

Writing directly to S3 means that background merges must read files from S3, sort and merge them, and write them back, generating a large number of API calls. This can result in high S3 API costs and even rate-limit errors from Amazon under heavy load. Tiered storage keeps incoming data on block storage where it can be fully merged efficiently, then moves it to S3 after it has stabilized. Setting prefer_not_to_merge=1 and perform_ttl_move_on_insert=0 on the S3 volume reinforces this pattern.

What is zero-copy replication in ClickHouse and when should I use it?

Zero-copy replication allows multiple ClickHouse servers in a cluster to share the same underlying S3 files rather than each server maintaining its own copy. It is enabled via the allow_remote_fs_zero_copy_replication setting and can halve storage costs for replicated clusters. However, it adds complexity around reference counting and has historically had stability issues. As of ClickHouse 23.6 and 23.7, the ClickHouse community considers it production-ready, but it should still be evaluated carefully and used on recent versions.

How does the s3() table function differ from the S3 table engine?

The S3 table engine is simpler and convenient for straightforward reads and writes but handles partitioning poorly and typically produces a single large file on insert. The s3() table function is far more powerful: it supports glob patterns to read from multiple files, handles partitioned writes well (splitting output into multiple files by partition key), and has a companion s3Cluster() function that distributes query work across all nodes in a ClickHouse cluster. For most production data lake use cases, the table function is the right choice.

What are the recommended approaches for passing S3 credentials to ClickHouse?

The safest and most Kubernetes-friendly approach is to use environment variables, which can be injected from Kubernetes secrets without ever appearing in configuration files or SQL. Granting a cloud IAM role directly to the ClickHouse VM is even better in cloud environments because it eliminates keys entirely. Passing keys as strings in endpoint definitions works but creates a risk of accidental key exposure, for example through log files or version control commits. Named collections provide a middle ground for table functions, allowing credentials to be referenced by name rather than embedded in SQL.

What are the current limitations of open-source ClickHouse S3 support compared to SharedMergeTree?

The current open-source implementation stores file references (metadata pointers) on the ClickHouse server, meaning every server in a cluster must hold a full copy of this metadata. This limits scale-out and makes zero-copy replication complex. ClickHouse Inc.’s closed-source SharedMergeTree engine solves this by storing data references in ZooKeeper or Keeper, allowing partial metadata per server and making file sharing across servers architecturally clean. SharedMergeTree is not open-source and there are no current plans to release it, making ongoing community improvements to the open-source S3 path an important area of development.

© 2023 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE® SOLUTIONS

Get in touch with ClickHouse experts.