ClickHouse® MergeTree on S3 - Administrative Best Practices - Altinity

In the previous article of this series we introduced S3-backed MergeTree tables. We described storage configurations for S3 and showed how to set up a MergeTree table that uses S3 storage. Finally, we traced through the location of data on disk and showed some handy queries for tracking S3 file locations.

In the current article we’ll provide guidance for ClickHouse administration when using S3 with MergeTree tables.

Configuring S3 bucket settings
Setting up storage configurations
Creating and deleting S3-backed merge tree tables
Turning off zero-copy replication

Recommended S3 bucket settings

When configuring a bucket for use with MergeTree, we recommend the following S3 practices to ensure that files remain consistent and buckets do not accumulate orphan data.

Multi-part uploads on AWS – Set a bucket life-cycle policy to terminate multi-part uploads to S3 after a set period of time, for example 3 days. This prevents junk files from accumulating in S3 when multi-part uploads are interrupted.
Do not use life-cycle policies or other mechanisms to delete S3 files. Only ClickHouse or an informed administrator should make that decision. This prevents files from disappearing unexpectedly.
Do not mix S3 buckets for ClickHouse with other applications. Conflicting file management operations in different applications can lead to accidental file deletion.

These are key recommendations from our experience as well that of the ClickHouse core team. See this issue for more recommendations related to S3 buckets.

Recommended storage policies for S3-backed MergeTree

Best practices for ClickHouse storage configuration include separate S3 endpoints for each server, a disk cache for S3 data, and tiered storage with hot data going initially to block storage. We’ll cover each in turn.

Use separate S3 endpoints for each replica

We strongly recommend keeping S3 data for different ClickHouse servers on separate S3 paths. This avoids problems where files are accidentally mixed up between replicas, which in turn requires messy cleanups. It turns out there’s a simple way to arrange paths using macros. This solution works in all environments and is especially helpful when running on Kubernetes.

The Altinity Kubernetes Operator defines the following macros automatically for every replica, as shown in the following table. If you set up your own servers, we recommend using similar conventions.

Name	Meaning	Typical value
cluster	The cluster name	s3
installation	The CHI resource name	demo2
replica	The server replica name	chi-demo2-s3-0-0
shard	The server shard number	0

The simplest and most general approach is to use the installation and replica macros in your s3 disk definition. In particular, replica names constructed by the Altinity Operator include installation names as well as shard and replica indexes, which ensures unique S3 locations for each server’s files.

Here is an example.

<s3_disk_with_replica>
  <type>s3</type>
  <endpoint>https://s3.us-west-2.amazonaws.com/mybucket/clickhouse/{installation}/{replica}/</endpoint>
</s3_disk_with_replica>

The above will split out S3 paths by installation, which is the name of the CHI resource in Kubernetes, and the ClickHouse replica name. Here’s an example of the resulting path:

s3://mybucket/clickhouse/demo2/chi-demo2-s3-0-0/sfk...

It’s now much easier to identify the S3 files belonging to a replica, for example to clean them up. Here’s a sample showing how to delete all S3 files belonging to a single replica.

aws s3 rm –recursive s3://mybucket/clickhouse/demo2/chi-demo2-s3-0-0/

And here’s a command to delete storage for the entire cluster.

aws s3 rm –recursive s3://mybucket/clickhouse/demo2/

You can also include the {shard} macro between {installation} and {shard}. This is helpful for identifying storage belonging to a single shard, but it does not add much information. There are typically only a few replicas per shard and the shard is already embedded in the name.

Finally you can also insert the {cluster} macro, but it’s mostly overhead. We generally recommend having only a single cluster definition in a ClickHouse resource when using the Altinity Kubernetes Operator. For this reason the {cluster} macro should not be necessary.

Use a disk cache for S3 data

ClickHouse provides a disk cache and we strongly recommend using it for S3 data. The disk cache wraps the S3 disk and directs ClickHouse to cache blocks from S3 files on local storage. The disk cache has two main benefits.

It’s fast. The main reason is that it allows ClickHouse reads to benefit from faster local storage as well as the OS page cache, which keeps recently used blocks in memory.
It reduces API calls to S3 for commonly read blocks. S3 API calls are metered and can become costly if there are many queries to S3 storage. Using a cache can reduce the cost substantially.

The following example code demonstrates how to configure the disk cache and incorporate it in a storage policy.

<clickhouse> 
  <storage_configuration> 
    <disks>
      <s3_disk>
        <type>s3</type>
        <endpoint>https://s3.us-west-2.amazonaws.com/mybucket/clickhouse/{installation}/{replica}/</endpoint>
        <use_environment_credentials>1</use_environment_credentials>
      </s3_disk>
      <!-- Local disk cache -->
      <s3_disk_cache>
        <type>cache</type>
        <disk>s3_disk</disk>
        <path>/var/lib/clickhouse/s3_disk_cache/cache/</path>
        <max_size>100Gi</max_size>
        <cache_on_write_operations>1</cache_on_write_operations>
      </s3_disk_cache>
    </disks>
    <policies>
      <!-- S3-backed MergeTree with disk cache -->
      <s3_disk_cached_policy>
        <volumes>
          <main>
          <disk>s3_disk_cache</disk>
          </main>
        </volumes>
      </s3_disk_cached_policy>
    </policies>
    . . .

The following graph shows typical performance differences between cached and uncached results. Results will vary depending on the workload but it’s clear that the improvement is substantial even in small datasets.

The main reason for the performance improvement with the cache in this particular benchmark is that the cache, being located on block storage, can take advantage of the OS page cache. Successive reads on the same blocks will tend to come from RAM instead of the cache. To get the full benefit from using a disk cache it’s best to use hosts with sufficient RAM that there is room to store cached data.

It’s also better to have a large cache than a smaller one, and to use high speed local storage if it is an option.

In all cases we recommend benchmarking performance using production data and adjusting cache and host configuration to get the best response time. Your own workload is always the best guide when optimizing for performance. 😉

Use tiered storage to put hot data on block storage

Along with file system caching, ClickHouse also supports tiered storage. Tiered storage allows you to set up a storage policy so that new data lands on a disk with fast storage, then migrates over time (or based on capacity) to slower, cheaper media like S3. We recommend using it for the following reasons.

New data tends to merge actively. Doing it on block storage is much faster because files are local.
Queries tend to focus on recent data. Keeping this on block storage saves I/O to S3, which in turn reduces API costs.

The following example demonstrates configuration of tiered storage. It builds on the cached disk shown in the previous example.

<clickhouse> 
  <storage_configuration> 
    <disks>
      <s3_disk>
        ...
      </s3_disk>
      <s3_disk_cache>
        . . . 
      </s3_disk_cache>
    </disks>
    <policies>
      <!-- Block storage tiered with S3-backed MergeTree -->
      <s3_tiered_policy>
        <volumes>
          <default>
            <disk>default</disk>
            <move_factor>0.1</move_factor>
          </default>
          <s3_disk_cached>
            <disk>s3_disk_cache</disk>
            <prefer_not_to_merge>false</prefer_not_to_merge>
            <perform_ttl_move_on_insert>false</perform_ttl_move_on_insert>
          </s3_disk_cached>
        </volumes>
      </s3_tiered_policy>
    </policies>
    . . .

The following storage settings deserve special mention:

move_factor – Automatically moves table parts into the next tier when the disk reaches 90%. This preserves space so that merges do not run out of disk space.
max_data_part_size_bytes – Sets a limit for part size on a particular disk. It can be used to move large parts automatically to cold storage on S3 while keeping smaller parts on hot storage until they can merge.
perform_ttl_move_on_insert – It’s best to set this to false. If true, ClickHouse will apply TTL MOVE directly on inserted rows. This can make inserts slow when loading older data that may already have an expired TTL. Also, it means that data will merge in S3 instead of locally, so you may incur additional API costs, not to mention slower merge performance.
prefer_not_to_merge – Advises ClickHouse not to merge parts in the volume. It should generally be false, which is the default. Setting it to true can cause ClickHouse to get TOO_MANY_PARTS errors. If you use tiered storage, data should be properly merged before it reaches S3 and should not need further merging.

The move_factor property might seem like a convenient way to distribute data between hot and cold tiers, but it’s not a complete answer. It’s definitely necessary to ensure that the disk does not fill up, which would prevent merges, but the behavior is unpredictable. For example, ClickHouse does not guarantee which parts will be moved first. It may even move newly inserted data directly to S3.

Recommended MergeTree table management for S3 data

Use TTL MOVE to transfer data between storage tiers

TTL MOVE policies move data based on a time interval. They are easier to reason about than alternatives, which include depending on the storage policy <move_factor> parameter or moving parts manually using ALTER TABLE MOVE commands.

Here’s an example of a TTL MOVE policy that transfers data to cached S3 disk after 7 days.

CREATE TABLE test_s3_tiered
(
    `A` Int64,
    `S` String,
    `D` Date
)
ENGINE = ReplicatedMergeTree
PARTITION BY D
ORDER BY A
TTL D + INTERVAL 7 DAY TO VOLUME 's3_disk_cache'
SETTINGS storage_policy = 's3_tiered_policy';

As mentioned previously, you should still always set <move_factor> on the hot volume to prevent it from becoming full, which will prevent merges and also stop inserts. (That would be bad.) This assumes the hot volume is using block storage, which has a size limit that may require administrative action to extend. There are effectively no limits on the size of S3-backed storage.

Use DROP TABLE SYNC to delete S3-backed MergeTree tables

The SYNC option on DROP TABLE causes ClickHouse to clear storage immediately. It ensures that S3 storage is fully removed before the command returns. Here’s an example of use.

DROP TABLE IF EXISTS test_s3_disk ON CLUSTER `{cluster}` SYNC;
DROP TABLE IF EXISTS test_s3_disk_local ON CLUSTER `{cluster}` SYNC;

You can also use DROP DATABASE SYNC to delete all tables in a single database. Here is another example:

DROP DATABASE IF EXISTS demo ON CLUSTER `{cluster}` SYNC;

If you issue DROP TABLE without using SYNC, ClickHouse does lazy clean-up of storage and deletes files within a few minutes. This is normally not a problem, except that if your server disappears or fails the cleanup may not complete normally, leaving orphan files on S3. For example it can happen if you are dropping a replica or an entire cluster. It can also be a problem if you create and drop tables quickly during development.

The Altinity Kubernetes Operator for ClickHouse does not issue DROP TABLE SYNC when deleting replicas in operator versions prior to 0.24. We plan to fix this in version 0.25. In the meantime you need to remove S3 files manually after dropping a replica. See the next article in this series for guidance on cleaning up orphan S3 files.

Manually delete S3 files after dropping replicas or clusters

If you drop a ClickHouse replica or entire cluster with S3-backed merge tree tables, check for missed files. This is easy to do if you follow our S3 endpoint recommendations and use macros to create separate endpoints for each ClickHouse cluster or replica. You can list files under the appropriate endpoint to see if any are left.

Here’s a sample command to list the S3 data files for a single replica using aws-cli command.

aws s3 ls –recursive s3://mybucket/clickhouse/demo2/chi-demo2-s3-0-0/

The commands to delete files are shown above in the section that describes how to use separate S3 endpoints for each replica.

Ensure zero-copy replication is turned off

Older versions of ClickHouse enabled zero-copy replication by default. Zero-copy replication is a MergeTree setting and can be disabled with the following configuration setting:

<clickhouse>
  <merge_tree>
    <allow_remote_fs_zero_copy_replication>false</allow_remote_fs_zero_copy_replication>
  </merge_tree>
</clickhouse>

Explicitly disabling the option as shown above helps prevent accidents.

Conclusion and pointers to the next article

In this article we delved into recommendations for S3 storage configuration and table management. These recommendations increase performance, reduce cost, and minimize the effort to keep S3 files in good order.

In the final article, we will delve into detailed management of S3 storage. This includes how to detect and clean up orphan S3 files used by MergeTree tables.

If you have questions about any articles in this series please contact us directly or join the Altinity community Slack. We look forward to hearing from you.

Get in touch with ClickHouse experts.

ClickHouse® MergeTree on S3 – Administrative Best Practices