Do-It-Yourself Multi-Volume Storage in ClickHouse

 

Mar 5, 2019

Many applications have very different requirements for acceptable latencies / processing speed on different parts of the database. In time-series use cases most of your requests touch only the last day of data (‘hot’ data). Those queries should run very fast. Also a lot of background processing actions happen on the ‘hot’ data–inserts, merges, replications, and so on. Such operations should likewise be processed with the highest possible speed and without significant latencies.

If the ‘hot’ data doesn’t fit into RAM, then you probably will consider using some faster storage for that. SSD / NVM offers a significant performance improvement compared with usual HDD.

Meanwhile you have much more data from last month or last year (‘cold’ data).  This is not requested as often, and bigger delays or slower access are acceptable. Most of the time that data is just stored and never queried. Using SSD and other fast and expensive storages for that ‘cold storage’ sounds like overkill, not to mention a waste of money.

Typically databases have a concept of ‘table space’ to deal with storage volumes for different purposes. ClickHouse does not have this useful feature. Nevertheless, it is possible to implement something similar with a little bit of ingenuity.  

Setting up multiple storage tiers

We have seen that ClickHouse as of Q1 2019 lacks native support for table spaces and multi-volume storage. We will therefore need to do some OS-level tweaking. Here is a step-by-step guide to setting up tables on different storage types.

1. Create 2 separate tables for cold and hot data.

CREATE TABLE my_db.mytable_hot (
  date Date, number UInt64 
) 
ENGINE = MergeTree 
PARTITION BY date 
ORDER BY number;

CREATE TABLE my_db.mytable_cold AS my_db.mytable_hot;

2. Symlink (or mount) folder /var/lib/clickhouse/data/my_db/mytable_cold to slow storage.  If the table is not empty, you will need to stop ClickHouse and then move contents of the folder.

mv /var/lib/clickhouse/data/my_db/mytable_cold/ /mnt/hdd/slowdb/mytable_cold
ln -s /mnt/hdd/slowdb/mytable_cold /var/lib/clickhouse/data/my_db/mytable_cold

3. Create a table with Engine=Merge above mytable_hot and mytable_cold:

CREATE TABLE my_db.mytable as my_db.mytable_hot ENGINE=Merge(my_db, '^mytable_');

4. Make your inserts into mytable_hot:

INSERT INTO my_db.mytable_hot SELECT toDate('2019-02-28') as date, number FROM numbers(200);
INSERT INTO my_db.mytable_hot SELECT toDate('2019-03-01') as date, number FROM numbers(200);

and query my_db.mytable:

SELECT date, count() FROM my_db.mytable GROUP BY date;
┌───────date─┬─count()─┐
│ 2019-02-28 │     200 │
│ 2019-03-01 │     200 │
└────────────┴─────────┘
2 rows in set. Elapsed: 0.004 sec.

As a result there is a single access point to both hot and cold data stored on different devices.

Our cold storage is empty, though, so how should we move data from hot to cold storage now?  Here are the steps to move a partition for a single date.

1. (Optional) Do optimize final on that partition.  That way it will be finished on fast storage, and no background merges will happen on slow storage.

OPTIMIZE TABLE my_db.mytable_hot PARTITION '2019-02-28' FINAL;

2. Create a backup of the partition:

ALTER TABLE my_db.mytable_hot FREEZE PARTITION '2019-02-28';

3. Move partition from backup to the new location.

mv /var/lib/clickhouse/shadow/$(cat /var/lib/clickhouse/shadow/increment.txt)/data/my_db/mytable_hot/* /var/lib/clickhouse/data/my_db/mytable_cold/detached

4. Detach partition from hot storage and attach it to new storage.

ALTER TABLE my_db.mytable_cold ATTACH PARTITION '2019-02-28';
ALTER TABLE my_db.mytable_hot DETACH PARTITION '2019-02-28';

Finally we’ve got the desired setup: hot data and cold data are stored on different storage types and can be selected from one logical table. You can check it by listing data folders or by selecting the virtual column named _table introduced by Engine=Merge table:

SELECT _table, date, count() FROM my_db.mytable GROUP BY _table, date
┌─_table───────┬───────date─┬─count()─┐
│ mytable_cold │ 2019-02-28 │     200 │
│ mytable_hot  │ 2019-03-01 │     200 │
└──────────────┴────────────┴─────────┘
2 rows in set. Elapsed: 0.012 sec.

Problems & pitfalls:

1. Currently there is no option to move the partitions atomically.  If a SELECT executes between ATTACH and DETACH it you will see doubled data for the partition. Moving should only happen then there are no requests to that data fragment, so the data is really in ‘cold’ state.

2. Simpler alternatives for copying partitions (ATTACH PARTITION FROM, REPLACE PARTITION FROM ) do not work across storage as hardlinks are used during the process:

ALTER TABLE my_db.mytable_cold ATTACH PARTITION '2019-02-28' FROM my_db.mytable_hot
Received exception from server (version 19.3.5):
Code: 424. DB::Exception: Received from localhost:9000, ::1. 
DB::ErrnoException. DB::ErrnoException: Cannot link 
/var/lib/clickhouse/data/my_db/mytable_hot/20190228_1_1_0/date.mrk to 
/var/lib/clickhouse/data/my_db/mytable_cold/tmp_replace_from_20190228_1_1_0/date.mrk, 
errno: 18, strerror: Invalid cross-device link.

3. A similar problem occurs when you try to backup partition located on a separate storage (FREEZE tries to create hardlink to your data in /var/lib/clickhouse/shadow folder):

ALTER TABLE my_db.mytable_cold FREEZE PARTITION '2019-02-28'
Received exception from server (version 19.3.5):
Code: 424. DB::Exception: Received from localhost:9000, ::1. 
DB::ErrnoException. DB::ErrnoException: Cannot link 
/var/lib/clickhouse/data/my_db/mytable_cold/20190228_4_4_0/date.mrk to 
/var/lib/clickhouse/shadow/9/data/my_db/mytable_cold/20190228_4_4_0/date.mrk, 
errno: 18, strerror: Invalid cross-device link.

Conclusion

If there is a missing feature in ClickHouse it usually means there is also a workaround. As we demonstrated in this article, ClickHouse can work with multiple storage types, and the scenario of ‘tiered storage’, when ‘hot’ and ‘cold’ data is stored on different devices, can be implemented. It requires some manual work, though automation would make it simpler.

It is worth mentioning that support for multiple storage volumes is a ClickHouse 2019 roadmap feature.  We expect native ClickHouse support to appear in a few months, at which point tiered storage will be a feature. You in turn can expect a detailed article on this blog describing exactly how it works.

  

Share

2 Comments

  1. ClickHouse release 19.15.2.2 from 2019-10-01 includes formal support for tiered storage over multiple volumes.

    1. Yes, we were working on that together with Yandex folks. This an early release, so issues are possible. We are testing it extensively now to make it is production ready.

Comments are closed.