Blog

ClickHouse® MergeTree on S3 – Keeping Storage Healthy and Future Work

S3 and ClickHouse part 3
Contents:

In the previous article of our series on S3-backed MergeTree, we gave practical advice for setting up storage configurations and managing MergeTree tables that use object storage. In this final article, we will dig into cleaning up orphans, which are S3 files that get detached from their metadata. We will end with thoughts in improved tooling and support for S3 object storage. 

What is an orphan S3 file?

As we discussed in the initial article in this series, ClickHouse keeps table file structure and identifying information (metadata) in local storage, and the actual table data in S3 files. Whenever ClickHouse loses metadata that points to an S3 file, it is no longer possible to tell where the file contents came from. We call that file an orphan. 

Orphan S3 files can arise from a variety of causes: replica loss, local filesystem damage on ClickHouse nodes, abrupt restarts, failed multi-part uploads, or race conditions in operations like merging and dropping tables.  

Orphans do no harm but incur charges, especially if there are many of them. It’s better to clean them up. 

Detection and cleanup of S3 orphan files after replica loss

In this section we’ll show how to detect and delete orphan files that occur after losing a replica. This procedure also works for random orphans that occur from other causes. 

IMPORTANT NOTE: File cleanup requires care and knowledge of ClickHouse storage. If you are not sure what you are doing, stop and check with an expert first. You can reach us using the contact information at the bottom of this article or on the Altinity website

Creating some orphans to play with

Let’s make a mess so we have something to clean up. First, we set up all replicas write to the same path in the S3 bucket. If we then lose track of files, they will be mixed together. For this experiment we’ll use a table that stores data with the s3_disk_policy. (Look here to see the policy definition.) The table looks like the following:

CREATE TABLE IF NOT EXISTS test_s3_disk_local ON CLUSTER `{cluster}`
(
    `A` Int64,
    `S` String,
    `D` Date
)
ENGINE = ReplicatedMergeTree
PARTITION BY D
ORDER BY A
SETTINGS storage_policy = 's3_disk_policy';

Next, we need to get more scientific about how much data and how many files we are holding in S3 for each replica. Let’s use an improved query to get statistics on the files that each server is tracking. This query uses the clusterAllReplicas() function to check the system.remote_data_paths table for all servers in the cluster.

SELECT hostname() AS host, disk_name, path,
  count() AS files, sum(size) AS bytes
FROM clusterAllReplicas('s3', system.remote_data_paths)
WHERE disk_name = 's3_disk'
GROUP BY host, disk_name, path ORDER BY host, disk_name ASC, path ASC
   
┌─host───────────────┬─disk_name─┬─path───────────────────────────────┬─files─┬─────bytes─┐
│ chi-demo2-s3-0-0-0 │ s3_disk   │ /var/lib/clickhouse/disks/s3_disk/ │    61 │ 819301701 │
│ chi-demo2-s3-0-1-0 │ s3_disk   │ /var/lib/clickhouse/disks/s3_disk/ │    61 │ 819301701 │
└────────────────────┴───────────┴────────────────────────────────────┴───────┴───────────┘

We can validate the numbers directly on S3. They are exactly double the values from the single replicas.

SELECT count() AS files, sum(_size) as bytes
FROM s3('https://s3.us-west-2.amazonaws.com/mybucket/clickhouse/demo2/s3_disk/**', 'One')

┌─files─┬──────bytes─┐
│   122 │ 1638603402 │
└───────┴────────────┘

Since we are running in Kubernetes we can just drop a replica as follows. We’ll change the replicasCount in demo2-s3-01.yaml to reduce the replicas. The kubectl edit command makes this change quick and easy. After we save the file in the editor, the Altinity Operator will drop the replica.

kubectl edit chi demo2
. . .
spec:
  configuration:
    clusters:
    - layout:
        replicasCount: 1 <-- Change this from 2 to 1.
        shardsCount: 1

Important note! This demonstration depends on behavior in the 0.24 Altinity Operator–it does not properly delete S3 storage when dropping a replica. A fix is planned for version 0.25. At that point we’ll need to come up with more violent ways to drop replicas, such as deleting their block storage. 

Wait for the replica to disappear, then check S3 storage again. You’ll see that there is the same number of files as before.

┌─files─┬──────bytes─┐
│   122 │ 1638603402 │
└───────┴────────────┘

However, things are quite different when we query remote data paths. 

SELECT hostname() AS host, disk_name, path,
  count() AS files, sum(size) AS bytes
FROM clusterAllReplicas('s3', system.remote_data_paths)
WHERE disk_name = 's3_disk'
GROUP BY host, disk_name, path ORDER BY host, disk_name ASC, path ASC

┌─host───────────────┬─disk_name─┬─path───────────────────────────────┬─files─┬─────bytes─┐
│ chi-demo2-s3-0-0-0 │ s3_disk   │ /var/lib/clickhouse/disks/s3_disk/ │    61 │ 819301701 │
└────────────────────┴───────────┴────────────────────────────────────┴───────┴───────────┘

ClickHouse “sees” only 61 of the S3 files. The other 61 files are orphans. 

Deleting replica S3 files the quick way

In the second blog article in this series, we recommended using different S3 endpoints for each replica. If you had done this, you could just delete the orphan files from the second replica with a command like the following and call it a day.

aws s3 rm --recursive s3://mybucket/clickhouse/demo2/chi-demo2-s3-0-1/

Otherwise, keep reading. Even if all files are under the same path we can still find orphans and delete them. 

Locating orphan file paths

We can locate orphans by joining the information from S3 with the contents of system.remote_data_paths and look for missing entries in the ch_remote_path output column. The _path value from the s3() function and the remote_path column from system.remote_data_paths do not quite match but a regex expression fixes things. 

Any row that does not have a remote_path value could be an orphan, as illustrated by the sample rows of output.

-- Base query to compare data.
SELECT s3_subpath, s3_size, (ch_remote_path = '') as is_orphan, ch_remote_path, ch_size FROM
(
  SELECT _size as s3_size, regexpExtract(_path, '([a-zA-Z0-9\\-])/(.*)$', 2) AS s3_subpath
  FROM s3('https://s3.us-west-2.amazonaws.com/rhodges-us-west-2-mergetree-s3/clickhouse/demo2/s3_disk/**', 'One')
) AS s3
LEFT OUTER JOIN
(
  SELECT remote_path as ch_remote_path, size as ch_size
  FROM clusterAllReplicas('s3', system.remote_data_paths)
  WHERE disk_name = 's3_disk'
) ch
ON s3_subpath = ch_remote_path
ORDER BY s3_subpath LIMIT 10
FORMAT Vertical
;
Row 1:
──────
s3_subpath:     clickhouse/demo2/s3_disk/adm/emztazsgtflgrwslskummaqmnqjcf
s3_size:        510
is_orphan:      1
ch_remote_path:
ch_size:        0
…
Row 4:
──────
s3_subpath:     clickhouse/demo2/s3_disk/amy/vbwwpcstmifiufrelytjtiljedykz
s3_size:        8
is_orphan:      0
ch_remote_path: clickhouse/demo2/s3_disk/amy/vbwwpcstmifiufrelytjtiljedykz
ch_size:        8

Let’s now convert our query to a CTE and check the number and size of the orphans.  As the output shows, the orphans and non-orphans exactly match the previous sizes of each replica. 

-- Run as CTE to demonstrate stats.
WITH refs AS (
 SELECT s3_subpath, s3_size, (ch_remote_path = '') as is_orphan, ch_remote_path, ch_size FROM
 (
   SELECT _size as s3_size, regexpExtract(_path, '([a-zA-Z0-9\\-])/(.*)$', 2) AS s3_subpath
   FROM s3('https://s3.us-west-2.amazonaws.com/rhodges-us-west-2-mergetree-s3/clickhouse/demo2/s3_disk/**', 'One')
 ) AS s3
 LEFT OUTER JOIN
 (
   SELECT remote_path as ch_remote_path, size as ch_size
   FROM clusterAllReplicas('s3', system.remote_data_paths)
   WHERE disk_name = 's3_disk'
 ) ch
 ON s3_subpath = ch_remote_path
)
SELECT is_orphan, count(), sum(s3_size) FROM refs
GROUP BY is_orphan
;
┌─is_orphan─┬─count()─┬─sum(s3_size)─┐
│         0 │      61 │    819301701 │
│         1 │      61 │    819301701 │
└───────────┴─────────┴──────────────┘

This gives us confidence we are on the right track. However, there is a problem! ClickHouse does not guarantee that the queries on system.remote_data_paths and the s3() function return consistent values. Depending on the order of execution the s3() function might return a file path that has just been written while system.remote_data_paths results might not show it. You could be fooled into thinking the file is an orphan. That would be bad. 

Luckily, there’s a simple solution. We can read the S3 files first into a table, wait a bit, and then join against system.remote_data_paths. We’ll show that in the next section. 

Generating commands to delete orphan S3 files

Since joining directly against S3 contents from the s3() function can be ambiguous, we will instead put the list of S3 contents in a table, rest a spell, and then join against system.remote_data_paths to generate commands that delete orphan files. 

Let’s start by creating a materialized view to fetch S3 file paths into a table. Starting with ClickHouse 23.12, we can use refreshable materialized views for this. They rerun the query for the materialized view on a regular basis and transparently update information in the view.

-- Create target table.
CREATE TABLE s3_files (
  s3_file String,
  s3_size UInt64,
  s3_subpath String,
  s3_refresh_datetime DateTime DEFAULT now()
)
ENGINE=MergeTree
PARTITION BY tuple()
ORDER BY s3_subpath;

-- Must set allow_experimental_refreshable_materialized_view = 1
CREATE MATERIALIZED VIEW s3_files_mv
REFRESH EVERY 1 HOUR TO s3_files
AS
  SELECT
    _file as s3_file,
    _size as s3_size,
    regexpExtract(_path, '([a-zA-Z0-9\\-])/(.*)$', 2) AS s3_subpath,
    now() as s3_refresh_datetime
  FROM s3('https://s3.us-west-2.amazonaws.com/rhodges-us-west-2-mergetree-s3/clickhouse/demo2/s3_disk/**', 'One');

-- Refresh the view contents immediately.
SYSTEM REFRESH VIEW s3_files_mv

The s3() function unfortunately does not have a way to select creation dates from the S3 files. It would be nice to have a _date virtual column, because then we could tell how old they are. (Looks like an easy PR.) Instead we use a hack:  add an s3_refresh_datetime column and set it to the current time every time the view refreshes. This ensures that any file in the table is at least as old as the s3_refresh_datetime value. 

Let’s wait a minute and then demonstrate we can find orphan files from our new s3_files table. 

WITH refs AS (
 SELECT s3_subpath, s3_refresh_datetime, s3_size, (ch_remote_path = '') as is_orphan, ch_remote_path, ch_size FROM
 (
   SELECT s3_size, s3_subpath, s3_refresh_datetime
   FROM s3_files
 ) AS s3
 LEFT OUTER JOIN
 (
   SELECT remote_path as ch_remote_path, size as ch_size
   FROM clusterAllReplicas('s3', system.remote_data_paths)
   WHERE disk_name = 's3_disk'
 ) ch
 ON s3_subpath = ch_remote_path
)
SELECT is_orphan, count(), sum(s3_size) FROM refs
WHERE s3_refresh_datetime < now() - INTERVAL 1 MINUTE
GROUP BY is_orphan
;
┌─is_orphan─┬─count()─┬─sum(s3_size)─┐
│         0 │      61 │    819301701 │
│         1 │      61 │    819301701 │
└───────────┴─────────┴──────────────┘

The results match our previous queries, so we are making good progress. We can now generate commands to delete the orphan files as follows. 

WITH refs AS (
 SELECT s3_subpath, s3_refresh_datetime, s3_size, (ch_remote_path = '') as is_orphan, ch_remote_path, ch_size FROM
 (
   SELECT s3_size, s3_subpath, s3_refresh_datetime
   FROM s3_files
 ) AS s3
 LEFT OUTER JOIN
 (
   SELECT remote_path as ch_remote_path, size as ch_size
   FROM clusterAllReplicas('s3', system.remote_data_paths)
   WHERE disk_name = 's3_disk'
 ) ch
 ON s3_subpath = ch_remote_path
)
SELECT 'aws s3 rm s3://mybucket/' || s3_subpath AS cmd FROM refs
WHERE is_orphan AND s3_refresh_datetime < now() - INTERVAL 1 MINUTE
ORDER BY cmd
;
aws s3 rm s3://mybucket/clickhouse/demo2/s3_disk/adm/emztazsgtflgrwslskummaqmnqjcf
aws s3 rm s3://mybucket/clickhouse/demo2/s3_disk/agu/ucsbbhqyzilwkyjmnkusroovtjcxz
aws s3 rm s3://mybucket/clickhouse/demo2/s3_disk/ahn/xiiexmzqpvetqpzcexlgxchsnbzwb
...

Run the commands and the orphan files will disappear. Very slowly. Each ‘aws’ command incurs extra overhead to connect to AWS and delete a single file. If you have 50TiB of orphan files after deleting a replica, this could take many hours or even days if you run the commands one after the other. 

If you have a lot of files you might therefore want to instead consider generating an ‘aws s3api delete-objects’ command instead, as it can take a list of files. We leave this as an exercise for the reader. 

Better support for MergeTree and S3

This article went deep on managing orphan S3 files in ClickHouse to reveal a rich set of queries to understand the state of MergeTree S3 storage and to correct problems. Here are a couple areas for further work related to S3 in general. 

First, we can make improvements to ClickHouse itself. Over time ClickHouse has widened the scope of things that can be fully managed in SQL. Role-based access control (RBAC) and dictionaries are just two examples. Here are some ideas to get S3 to the same level. 

  • Expose S3 endpoint URLs in system tables so that users can locate and manage external storage. URLs are in some cases sensitive so it probably makes sense to add a privilege to restrict access to administrators. 
  • Add a _datetime virtual column to the s3() table function so that you can tell when a file was written to S3. 
  • Align S3 paths in system tables to allow convenient joins with metadata from object storage itself. 
  • Enhance system tables to enable users to trace storage back to the tables and databases that use it. This will make it easier to track down data corruption problems.
  • Add features to manipulate storage from within ClickHouse itself rather than depending on external tools. ClickHouse has commands like SYSTEM DROP REPLICA to clean up metadata in [Zoo]Keeper. Analogous commands like SYSTEM DROP S3 ENDPOINT or SYSTEM DROP S3 PATH would be natural and convenient. 

A second focus is database management. At Altinity we are developing tools to automate scanning and repair of problems with S3 files as well as other storage consistency issues. We’re also designing features in Altinity.Cloud that will help users lay out best-practice configurations automatically. We’ll come back to you when they are ready, so watch this space.  

For a longer view of S3 work, look at the MergeTree over S3 improvements (RFC) issue logged on the ClickHouse repo as well as our recent Building a Better ClickHouse article. Along with many others in the ClickHouse community, Altinity is working on both near-term and long-term features to make S3 storage an even more powerful tool for big data. 

Conclusion

S3-backed MergeTree is a major extension to ClickHouse that allows MergeTree tables to take advantage of cheap and infinitely scalable object storage. It’s now used in many installations and best practices are becoming well-established. 

This series has covered a lot of ground starting with an overview of how MergeTree uses S3 storage and working down to details like cleaning up orphan files. There’s more to be said, of course, on topics like backup and monitoring. Watch out for future articles on these topics. 

ClickHouse support for S3 is improving rapidly. We’ve discussed a number of the improvements in the Building a Better ClickHouse article referenced above. If you would like to join our efforts to improve S3 support in ClickHouse or just find out more about using object storage, please contact us or track us down on Slack. We look forward to hearing from you!

Share

Related:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.