Putting Things Where They Belong Using New TTL Moves

By Vitaliy Zakaznikov on March 23rd, 2020

Putting Things Where They Belong Using New TTL Moves

vzaltinity Multi-Volume StorageTiered StorageTutorial

Multi-volume storage is crucial in many use cases. It helps to reduce storage costs as well as
improves query performance by allowing placement of the most critical application data on the fastest storage devices. Monitoring data is a classic use case. The value of data degrades rapidly over time. The last day, last week, last month, and previous year data have very different access patterns, which in turn correspond to various storage needs.

Suitable placement of data based on its age is therefore of great importance. ClickHouse TTL moves now provide a mechanism to achieve this. The best part about ClickHouse time-based TTL moves is that they are intuitive and correspond directly to human notions of calendar time. TTL moves make it much easier to set up multi-volume storage that corresponds to business requirements.

In the previous blog articles, we introduced multi-volume storage capability in ClickHouse.
The multi-volume feature significantly increases ClickHouse server capacity. It enables tiered storage by providing storage policies to arrange disks into volumes and set up relationships between them. However, storage policies by themselves do not provide much flexibility in controlling where ClickHouse places the data. Either the user has to do it manually using ALTER TABLE [db.]table MOVE PART|PARTITION commands or rely on the move factor parameter to enable simple distribution of data between volumes based on the used space ratio.

In this article, we will look at the new table TTL moves that allow a user to define expressions
that can move data automatically to a specific disk or volume that the user has specified inside a storage configuration. The new TTL moves greatly enhance multi-volume storage capability and provide the fine-grain control that is needed to put multi-volume storage capability into full use.

TTL Expressions

The MergeTree is currently the only family of engines that support TTL expressions.
ClickHouse first added support for TTL expressions to enable automatic delete mutations. A TTL expression is simply an SQL expression that must evaluate to Date or DateTime data type.
The expression could use explicit time intervals using INTERVAL keyword or use toInterval conversion functions. For example,

TTL date_time + INTERVAL 1 MONTH
TTL date_time + INTERVAL 15 HOUR

or using toInterval conversion functions we could have the following expressions

TTL date_time + toIntervalMonth(ttl)
TTL date_time + toIntervalHour(ttl)

or simply

TTL date_time + INTERVAL ttl MONTH
TTL date_time + INTERVAL ttl HOUR

We can assign these expressions to a table and, when assigned, are known as table TTL expressions. A table can have only one expression for delete, and multiple expressions for the automatic move of parts to disks or volumes. For example, assuming we have a storage policy that defines a volume slow_volume and a disk slow_disk table TTL expressions could look like the following

TTL date_time + INTERVAL 1 MONTH DELETE,
   date_time + INTERVAL 1 WEEK TO VOLUME 'slow_volume',
   date_time + INTERVAL 2 WEEK TO DISK 'slow_disk';

A word of caution. Because ClickHouse first added support for delete TTL expressions if TO DISK or
TO VOLUME clauses are not specified, then the DELETE clause is assumed.
Therefore, we recommend that you always be explicit and use the DELETE clause to
identify which TTL expression you want to use for deletions.

Setting Up the Demo

If you haven’t looked at multi-volume storage yet or played around with TTL delete or move expressions we recommend you use the latest ClickHouse version 20.3.2.1. We will use it for the rest of this article.

$ clickhouse-client
ClickHouse client version 20.3.2.1 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 20.3.2 revision 54433.

For demo purposes, we will use the OnTime database (USA civil flight data from 1987 till 2018).
As a shortcut, we will download and use pre-made partitions.

$ curl -O https://clickhouse-datasets.s3.yandex.net/ontime/partitions/ontime.tar
$ tar xvf ontime.tar -C /var/lib/clickhouse # path to ClickHouse data directory
$ # check permissions of unpacked data, fix if required
$ sudo service clickhouse-server restart
$ clickhouse-client --query "select count(*) from datasets.ontime"

Of course, without using multi-volume storage, the TTL move expressions do not make sense. Therefore, we will simulate multiple storage devices by creating different folders that will represent mounted storage devices with distinct speed and capacity.

/data/fast
/data/med0
/data/med1
/data/slow

Adding Storage Configuration Without Server Restart

If we check right now, our server uses only the default disk. We can find this out by looking into the system.disks table.

:) select * from system.disks

SELECT *
FROM system.disks

┌─name────┬─path─────────────────┬──free_space─┬──total_space─┬─keep_free_space─┐
│ default │ /var/lib/clickhouse/ │ 37705834496 │ 468514799616 │               0 │
└─────────┴──────────────────────┴─────────────┴──────────────┴─────────────────┘

We need more storage and can add new disks without restarting the server. We’ve recently added this feature to ClickHouse. We can eagerly give it a try. The storage configuration will be defined by placing storage.xml into our /etc/clickhouse-server/config.d/ folder.

<yandex>
 <storage_configuration>
   <disks>
     <default>
        <keep_free_space_bytes>1024</keep_free_space_bytes>
     </default>
     <fast>
        <path>/data/fast/</path>
     </fast>
     <med0>
         <path>/data/med0/</path>
     </med0>
     <med1>
         <path>/data/med1/</path>
     </med1>
     <slow>
         <path>/data/slow/</path>
     </slow>
   </disks>
   <policies>
     <default>
       <volumes> 
         <default>
           <disk>default</disk>
         </default>
         <fast>
           <disk>fast</disk>
         </fast>
         <med>
           <disk>med0</disk>
           <disk>med1</disk>
         </med>
         <slow>
           <disk>slow</disk>
         </slow>
       </volumes>
     </default>
   </policies>
 </storage_configuration>
</yandex>

If we reload the configuration using the SYSTEM RELOAD CONFIG command, then we should see new disks in the system.disks table.

:) SYSTEM RELOAD CONFIG
:) select * from system.disks

┌─name────┬─path─────────────────┬──free_space─┬──total_space─┬─keep_free_space─┐
│ default │ /var/lib/clickhouse/ │ 37993152512 │ 468514799616 │               0 │
│ fast    │ /data/fast/          │ 37993152512 │ 468514799616 │               0 │
│ med0    │ /data/med0/          │ 37993152512 │ 468514799616 │               0 │
│ med1    │ /data/med1/          │ 37993152512 │ 468514799616 │               0 │
│ slow    │ /data/slow/          │ 37993152512 │ 468514799616 │               0 │
└─────────┴──────────────────────┴─────────────┴──────────────┴─────────────────┘

We can retrieve the storage policy by peaking into the system.storage_policies table.

:) select * from system.storage_policies

┌─policy_name─┬─volume_name─┬─volume_priority─┬─disks───────────┬─max_data_part_size─┬─move_factor─┐
│ default     │ default     │               1 │ ['default']     │                  0 │         0.1 │
│ default     │ fast        │               2 │ ['fast']        │                  0 │         0.1 │
│ default     │ med         │               3 │ ['med0','med1'] │                  0 │         0.1 │
│ default     │ slow        │               4 │ ['slow']        │                  0 │         0.1 │
└─────────────┴─────────────┴─────────────────┴─────────────────┴────────────────────┴─────────────┘

Creating a Table That Uses TTL Moves

Let’s now look at how TTL moves can be defined when creating a new table. We will use flight data until 2010. The data for the last three years we’ll keep on the fast volume, between 3-5 years on the med, between 5-7 years should go to the slow storage, and anything older we want to delete. We can achieve such a scheme with the following table definition.

CREATE TABLE ontime_chunk
ENGINE = MergeTree()
PARTITION BY Year
ORDER BY FlightDate
TTL FlightDate TO VOLUME 'fast',
   FlightDate + INTERVAL 3 YEAR TO VOLUME 'med',
   FlightDate + INTERVAL 5 YEAR TO VOLUME 'slow',
   FlightDate + INTERVAL 7 YEAR DELETE
AS
SELECT *
FROM datasets.ontime
WHERE Year >= 2010

← Progress: 55.09 million rows, 40.13 GB (472.19 thousand rows/s., 344.00 MB/s.) ██████████████████████████████████████████████████████▌  96%Ok.
0 rows in set. Elapsed: 116.659 sec. Processed 55.09 million rows, 40.13 GB (472.19 thousand rows/s., 344.00 MB/s.)

Note that old functional style syntax for MergeTree tables is not supported with TTL moves.

Once the table is created and populated, we can look at where table parts are stored. You can do it by looking in the system.parts table. The query below can give us some basic statistics.

SELECT
   partition,
   disk_name,
   count(),
   sum(rows),
   max(bytes_on_disk),
   avg(bytes_on_disk)
FROM system.parts
WHERE active AND (table = 'ontime_chunk')
GROUP BY
   partition,
   disk_name
ORDER BY partition DESC

┌─partition─┬─disk_name─┬─count()─┬─sum(rows)─┬─max(bytes_on_disk)─┬─avg(bytes_on_disk)─┐
│ 2018      │ fast      │       7 │   6033426 │          264508145 │  86241164.71428572 │
│ 2017      │ fast      │       4 │   5674621 │          274419521 │       138846953.25 │
│ 2016      │ med1      │       3 │   2169891 │          147475124 │  70784017.66666667 │
│ 2016      │ med0      │       3 │   3447767 │          316232407 │ 112364801.66666667 │
│ 2015      │ med1      │       2 │   5265093 │          260304509 │          259205244 │
│ 2015      │ med0      │       1 │    553986 │           54923408 │           54923408 │
│ 2014      │ slow      │       2 │   5819811 │          315955553 │        288535865.5 │
│ 2013      │ slow      │       7 │   5089209 │          266864685 │           71928783 │
│ 2012      │ slow      │       6 │    436874 │            9818520 │  7203465.166666667 │
│ 2011      │ slow      │       2 │     62029 │            5946491 │            2973249 │
│ 2010      │ slow      │       3 │    113398 │            8838400 │ 3741370.6666666665 │
└───────────┴───────────┴─────────┴───────────┴────────────────────┴────────────────────┘

As shown above, ClickHouse has applied TTL moves on the insert, and parts are almost where we
expect them to be. Why mostly? Because TTL delete expressions are a different beast. ClickHouse does not evaluate these during insert, and thus we still see data on the slow disk for years from 2013 to 2010 that we want to delete.

Digression About TTL Deletes

ClickHouse does not handle TTL moves and deletes in the same manner. I have on purpose included a TTL delete expression in my example above to make this point. The reason being is that
TTL deletes cause an expensive mutation operation. Therefore, they may be more expensive compared to TTL moves that just copy parts between disks. So keep this in mind when working with TTLs.

But given that most users will use TTL deletes and moves together, it is essential to point out that ClickHouse controls the frequency of TTL deletes by the ‘merge_with_ttl_timeout’ MergeTree table setting. By default, it is set to 24 hours and defines the minimal time in seconds, when merge with TTL can be repeated. This setting actually means that TTL delete will be performed once every 24 hours on only one partition or if a background merge occurs. So in the worst case, at maximum right now ClickHouse will delete one partition every 24 hours that matches the TTL delete expression.

This behavior might not be ideal, so in case you want a TTL delete expression to perform deletes faster you can modify the merge_with_ttl_timeout setting for the table. For example, we can set it to one hour as follows.

ALTER TABLE [db.]table MODIFY SETTING merge_with_ttl_timeout = 3600

Now you should see that ClickHouse is deleting parts according to your TTL delete expression every one hour. Of course, when your table is not too big, you could force it using the
OPTIMIZE TABLE [db.]table FINAL statement. However, for large tables, it is not recommended.

Adding TTL Moves To An Existing Table

We’ve seen how one can create a table with TTL moves defined up-front. However, if you already have a table or you would like to change existing TTL move expressions, you have to resort to using the ALTER TABLE [db.]table MODIFY TTL command.

Note that when modifying TTL expressions, you have to re-list all TTLs. All the move expressions and
the delete expression if present.

Let’s re-use the table above and change TTL expressions. We now want everything except for the data for the last three years to be placed on the slow volume or deleted at some point in time if it is older than seven years.

ALTER TABLE ontime_chunk
   MODIFY TTL FlightDate TO VOLUME 'fast', FlightDate + toIntervalYear(3) TO VOLUME 'slow', FlightDate + toIntervalYear(7)
Ok.
0 rows in set. Elapsed: 0.037 sec.

That was fast! But wait, did it move anything? No, it didn’t. The new TTL expressions will be assigned only to the new parts, either when an insert comes in or because a new part is created
as a result of a background merge operation. For the existing parts, the new TTL can be applied by materializing TTLs using the ALTER TABLE [db.]table MATERIALIZE TTL statement. If we execute it on our table, the command will return quickly.

ALTER TABLE ontime_chunk
   MATERIALIZE TTL
Ok.
0 rows in set. Elapsed: 0.044 sec.

What it does, it just rewrites ttl.txt files that you can find inside the part’s folder. For example, we can take a look at what a random part has.

$ sudo cat /data/fast/data/default/ontime_chunk/2017_113_113_0/ttl.txt
ttl format version: 1
{
 "table": {
    "min":1710457200,
    "max":1710975600
 },
 "moves":[
     {"expression":"FlightDate","min":1489532400,"max":1490050800},
     {"expression":"plus(FlightDate, toIntervalYear(3))","min":1584226800,"max":1584745200},
     {"expression":"plus(FlightDate, toIntervalYear(5))","min":1647298800,"max":1647817200}
 ]
}

After you execute the MATERIALIZE TTL command, ClickHouse will start moving the parts
to their new location in the next background cycle. For our example it did not take long. Looking again into the system.parts table I find that parts got moved to their new places and some got deleted because of a background merge.

SELECT
   partition,
   disk_name,
   count(),
   sum(rows),
   max(bytes_on_disk),
   avg(bytes_on_disk)
FROM system.parts
WHERE active AND (table = 'ontime_chunk')
GROUP BY
   partition,
   disk_name
ORDER BY partition DESC

┌─partition─┬─disk_name─┬─count()─┬─sum(rows)─┬─max(bytes_on_disk)─┬─avg(bytes_on_disk)─┐
│ 2018      │ fast      │       8 │   6033426 │          372476291 │         75689482.5 │
│ 2017      │ fast      │       8 │   5674621 │          304683038 │       69514551.875 │
│ 2016      │ slow      │       3 │   5617658 │          396503243 │ 183260415.33333334 │
│ 2015      │ slow      │       2 │   5819079 │          329783074 │        286661116.5 │
│ 2014      │ slow      │       7 │   5819811 │          244641752 │           85566643 │
│ 2013      │ slow      │       5 │   5089209 │          383141486 │        102324330.2 │
│ 2012      │ slow      │       1 │         0 │                  3 │                  3 │
│ 2011      │ slow      │       2 │         0 │                  3 │                  3 │
│ 2010      │ slow      │       3 │         0 │                  3 │                  3 │
└───────────┴───────────┴─────────┴───────────┴────────────────────┴────────────────────┘

Some How’s and When’s

After looking at how you can use TTL move expressions, let’s look at how and when ClickHouse evaluates TTL move expressions.

How Are TTL Moves Evaluated?

ClickHouse evaluates TTL move expressions by using a dedicated pool of background threads.
The behavior of the pool is controlled by the following parameters that can be defined either in the config.xml or in a separate configuration file inside the /etc/clickhouse-server/config.d/ folder.

Here is the list of parameters and their current default values:

  • background_move_pool_size: 8
  • background_move_processing_pool_thread_sleep_seconds: 10
  • background_move_processing_pool_thread_sleep_seconds_random_part: 1.0
  • background_move_processing_pool_thread_sleep_seconds_if_nothing_to_do: 0.1
  • background_move_processing_pool_task_sleep_seconds_when_no_work_min: 10
  • background_move_processing_pool_task_sleep_seconds_when_no_work_max: 600
  • background_move_processing_pool_task_sleep_seconds_when_no_work_multiplier: 1.1
  • background_move_processing_pool_task_sleep_seconds_when_no_work_random_part: 1.0

For testing purposes, we use the following configuration file to enable instant moves.

<yandex>
   <background_move_processing_pool_thread_sleep_seconds>0.5</background_move_processing_pool_thread_sleep_seconds>
   <background_move_processing_pool_task_sleep_seconds_when_no_work_max>0.5</background_move_processing_pool_task_sleep_seconds_when_no_work_max>
</yandex>

This simple configuration highlights the two most important parameters which you might want to adjust if needed. These are the ‘background_move_processing_pool_task_sleep_seconds_when_no_work_max’ and the ‘background_move_processing_pool_thread_sleep_seconds’. Where the ‘background_move_processing_pool_task_sleep_seconds_when_no_work_max’ defines the maximum time the pool can sleep when there is no work (moves) to do. By default, ClickHouse sets it to 600 sec. It means that after a TTL move expression is triggered, the actual
move can start to take place maximum within 10 min. The time it takes for the move to complete depends on the number of parts that ClickHouse needs to move and the disk I/O performance. The ‘background_move_processing_pool_thread_sleep_seconds’ parameter defines number of seconds the worker thread sleeps before picking up another task.

Based on these parameters, when the background move process pool wakes up, it scans TTL expressions for all the parts and determines if anything needs moving.

Note that when moving parts to a disk or a volume, the background move processing pool checks the
constraints defined by the storage policy. If ClickHouse can’t move some part according to the TTL move
expression, then the move will be attempted at a later time.

When Are TTL Moves Evaluated?

ClickHouse evaluates TTL move expressions in the following cases:

  • when you insert a part
  • when a background move pool processing task wakes up
  • after ClickHouse creates a new part as a result of a background merge process
  • when replica downloads a new part

Some Known Corner Cases

Nothing is perfect, so here is a list of some known corner cases related to TTL moves that you should keep in mind.

  • No SQL statement to force TTL move without parts merge.
  • Differences in behavior between TTL move and delete.
  • Multi-threaded movements of a large number of parts inside
    the same physical disks can hurt performance due to the I/O bottleneck.

Conclusion

In this article, we have looked at a new TTL move feature and how it extends the usage of new storage policies. With TTL moves, ClickHouse has a powerful tool to manage how data is stored using multi-volume storage. There are still some corner cases that we are working to iron out. However, it should not keep you away from trying to see how storage policies and TTL moves can save you money by significantly reducing your storage costs. New TTL moves will help you put things where they belong.

Leave a Reply

Your email address will not be published. Required fields are marked *