There are only two hard things in Computer Science:
cache invalidation and naming things.
— PHIL KARLTON
One of the first things that every database administrator learns is that the slowest part of the database is storage. Database designers put a lot of effort into mitigating this. Indices, columnar data format, data compression and other techniques reduce the amount of data to be read. But one can not beat the laws of physics, so reading from storage will always be slower than reading from RAM.
For this reason, every database has a comprehensive set of caches that mitigate slow storage. It is very important to understand how different layers of caching work in the database. In this article, we will give an overview of ClickHouse caches. The article consists of two parts: the first part explains read caches, and the second part focuses on write caches and other internal caches used by ClickHouse. Let’s start!
Read Caches Overview
There are multiple caches that are populated and used during query execution:
The majority of caches are in-memory. Exceptions are the disk cache for object storage and metadata cache. Those will be covered in the second part.
Raw Data Caches
OS Page Cache
This is the first and among the most important caches. The Linux page cache is used by the Linux kernel. It is totally transparent to applications. When the data is read from the storage it is placed in the page cache. The next access to the same data will read it from RAM. The size of the cache is not configured; it just uses free RAM. When the RAM is small, and the data changes fast, the cache is not very effective. In ClickHouse, data parts are immutable, and therefore can be cached quite efficiently. That said, if your query takes a lot of RAM it pushes data out of the page cache. In a well tuned ClickHouse system queries typically take less than 50% of available RAM leaving the rest for the page cache.
While this cache is transparent, it is very easy to test how it works. There is a ClickHouse setting that allows you to bypass page cache, and use ‘direct I/O’ mode:
min_bytes_to_use_direct_io. When this is set to ‘1’ the page cache is not used. Let’s test it!
We will use Altinity demo server in order to test caches: https://github.demo.altinity.cloud:8443/play, The credentials are demo/demo for read-only access.
The server contains the ‘github_events’ dataset, so it is easy to use for experiments. Let’s run the following query:
SELECT repo_name, count() stars FROM github_events WHERE event_type = 'WatchEvent' AND repo_name ilike '%-operator' AND repo_name not like '%pipeline%' AND repo_name not like 'tc39%' GROUP BY repo_name ORDER BY count() desc LIMIT 20
It shows the top 20 operators. Altinity/clickhouse-operator for Kubernetes with almost 1400 stars is #9, and #2 after PostgreSQL if only database operators are considered.
The query runs almost instantly in 600ms. Now, let’s force a direct I/O mode:
SELECT repo_name, count() stars FROM github_events WHERE event_type = 'WatchEvent' AND repo_name ilike '%-operator' AND repo_name not like '%pipeline%' AND repo_name not like 'tc39%' GROUP BY repo_name ORDER BY count() desc LIMIT 20 SETTINGS min_bytes_to_use_direct_io=1
This time it takes 5.6s!!! Almost 10 times slower than before! You can play around with adding and removing this setting and see how the performance changes.
The page cache is also used for writes. Therefore, writes can be a reason that the page cache is invalidated. ClickHouse assumes that new data (INSERTS) should be placed in the cache, since there is a big chance this data can be read right after, e.g. for merges. In order to control if merges are using page cache or not there is a special MergeTree setting
min_merge_bytes_to_use_direct_io. The default is 10GB. This ensures big merges, which typically happen for historical data, do not invalidate the cache and do not affect query performance.
Uncompressed Blocks Cache
The data on disk and in the page cache is stored in compressed blocks. It is LZ4 by default, but can be changed to ZSTD. While de-compression is fast, it takes time and CPU. In some cases, when there is a small dataset, and very low latency is required, it makes sense to enable the cache for uncompressed blocks. The size of this cache is configured by the server setting
uncompressed_cache_size. It is 8GB by default. Let’s try:
SELECT uniq(creator_user_login) FROM github_events WHERE event_type = 'PullRequestEvent' and repo_name = 'ClickHouse/ClickHouse'
This query shows that 1200+ people created pull requests to ClickHouse. It runs in 25ms. What if we enable the uncompressed cache?
SELECT uniq(creator_user_login) FROM github_events WHERE event_type = 'PullRequestEvent' AND repo_name = 'ClickHouse/ClickHouse' SETTINGS use_uncompressed_cache=1
In reality, the uncompressed blocks cache makes sense to try if your query already runs under 50ms, scans a small number of parts, and you want to run it even faster.
The current size of the cache can be viewed in system.asynchronous_metrics:
SELECT metric, value FROM system.asynchronous_metrics WHERE metric LIKE 'UncompressedCache%'; ┌─metric─────────────────┬────value─┐ │ UncompressedCacheBytes │ 67145309 │ │ UncompressedCacheCells │ 539 │ └────────────────────────┴──────────┘
You will see that 67MB of cache has been allocated for the previous query.
Cache hit/miss ratio can be viewed in system.events:
SELECT event, value FROM system.events WHERE event LIKE 'UncompressedCache%'; ┌─event───────────────────┬─value─┐ │ UncompressedCacheHits │ 1617 │ │ UncompressedCacheMisses │ 539 │ └─────────────────────────┴───────┘
It is only measured when an uncompressed cache has been enabled for the query. Same data is written into ProfileEvents of a query_log record.
Finally, in order to drop the cache and release memory, ClickHouse has a special management SQL command:
SYSTEM DROP UNCOMPRESSED CACHE
Before going to the next two caches, let’s look into MergeTree internals for a minute:
MergeTree tables consist of parts. Every part consists of a primary key index and several column files that are sorted with the primary key order. When ClickHouse executes the query, it uses a primary key index in order to read the fraction of the data from columns. While reading column data is a sequential read operation, index and extra helper structures called ‘marks’ may require random access. Caching index and marks reduces the number of I/O operations, therefore increasing the query performance.
Primary Key Index Cache
The ClickHouse primary key index is used in order to locate data when running queries. The index is sparse which allows it to be fully resident in RAM even for big datasets. The primary key index is always read in RAM for MergeTree tables, and users can not control it. The current size of the RAM used for primary key indexes can be checked with the following query:
SELECT sum(primary_key_bytes_in_memory), sum(primary_key_bytes_in_memory_allocated) FROM system.parts
For the demo server it is 58MB and 128MB respectively, which is very low. Is there a case when it may grow big and become a problem? Yes, there is. The typical situation is the use of Summing- or Aggregating- MergeTree tables with a lot of columns in the ORDER BY. Since the primary key defaults to ORDER BY, it may result in high RAM usage for the primary key index. The workaround is to define PRIMARY KEY explicitly, as a prefix of ORDER BY columns, like this:
CREATE TABLE my_aggr_table ( /* columns */ ) Engine = SummingMergeTree ORDER BY col1, col2, col3, col4, col5, col6, col7, col8, col9, col10 PRIMARY KEY col1, col2, col3
Another possible reason for the big primary key size is changing the
index_granularity setting for MergeTree tables. In rare cases it makes sense to get a lower granule size for faster point queries. For example, if
index_granularity is reduced from default 8192 to 1024, that increases primary key index size by 8 times, which requires more RAM to cache.
Marks in ClickHouse are references from the primary key index to other table columns. Therefore, there is one mark file per table column. In query time, reading this file is an extra random I/O, that is better to be minimized. ClickHouse uses a special mark cache for that. Unlike primary key index, marks are not stored in RAM fully, as that would require too much RAM. The size of cache is controlled by the
mark_cache_size server setting, and it is 5GB by default. The current usage can be checked with the following query:
SELECT metric, value FROM system.asynchronous_metrics WHERE metric LIKE 'Mark%';
Usually, even the default 5GB is sufficient. It is possible to get size of all mark files with this query:
SELECT sum(marks_bytes) FROM system.parts WHERE active
On the demo server with 750GB of data, that returns just 2.5GB, so even if all marks for all columns are cached, it fits into the default cache size.
The 5GB limit should be increased if the size of marks is much bigger, and there are many cache misses that can be checked with the following query:
SELECT event, value FROM system.events WHERE event LIKE 'Mark%'; ┌─event───────────────────────┬────value─┐ │ MarkCacheHits │ 3572 │ │ MarkCacheMisses │ 254 │ │ SelectedMarks │ 9115465 │ │ WaitMarksLoadMicroseconds │ 14042632 │ │ BackgroundLoadingMarksTasks │ 9100 │ │ LoadedMarksCount │ 10054356 │ │ LoadedMarksMemoryBytes │ 30224696 │ └─────────────────────────────┴──────────┘
Keep an eye on cache misses, which should be in single percents. Otherwise ClickHouse performance may be suboptimal. Another important metric is
WaitMarksLoadMicroseconds that reflects the wait time for marks to be loaded when a query is executed.
Note, that maximum size of mark cache and uncompressed cache is capped by
cache_size_to_ram_max_ratio server setting that is 0.5 of server available memory by default.
As usual, the cache can be dropped by
SYSTEM DROP MARK CACHE command.
Query cache is a standard tool in many databases, and ClickHouse is not an exception. Unlike all the previous caches designed to speed up the access to the data, query cache stores query results in RAM and returns the same results for identical queries.
Let’s try a bit heavier query: it shows all merged pull requests since 2020-01-01 contributed by Altinity and ex-Altinity engineers into ClickHouse projects (almost 900 PRs!):
with ['Enmk', 'traceon', 'excitoon', 'quickhouse', 'Slach', 'f1yegor', 'zvonand', 'myrcc', 'filimonov', 'arthurpassos', 'vzakaznikov', 'MyroTk', 'antip00', 'den-crane', 'ilejn'] as altinity_devs SELECT repo_name, sum(merged), uniq(creator_user_login) num_committers, sumMap(map(creator_user_login, merged)) committers FROM github_events WHERE merged_at>='2020-01-01' AND event_type = 'PullRequestEvent' AND repo_name ilike 'ClickHouse%' AND creator_user_login in altinity_devs GROUP BY repo_name WITH TOTALS ORDER BY repo_name
The query runs in 1.5s and results do not improve with further runs. In order to enable the query cache, we need to add a setting to the query:
First run will populate the cache, and for the second run the query returns results in a few milliseconds!
The query cache is configured in two places: the size of the cache and some guard settings are defined in server configuration file as follows:
<query_cache> <max_size_in_bytes>1073741824</max_size_in_bytes> <max_entries>1024</max_entries> <max_entry_size_in_bytes>1048576</max_entry_size_in_bytes> <max_entry_size_in_rows>30000000</max_entry_size_in_rows> </query_cache>
These are the reasonable defaults, though the cache size can be increased for big clusters. On the profile and query level there are parameters for fine tuning the cache usage, that can be viewed by this:
SELECT * from system.settings WHERE name LIKE 'query_cache%' ┌─name─────────────────────────────────────────────────────────────────┬─value─┐ │ query_cache_store_results_of_queries_with_nondeterministic_functions │ 0 │ │ query_cache_max_size_in_bytes │ 0 │ │ query_cache_max_entries │ 0 │ │ query_cache_min_query_runs │ 0 │ │ query_cache_min_query_duration │ 0 │ │ query_cache_compress_entries │ 1 │ │ query_cache_squash_partial_results │ 1 │ │ query_cache_ttl │ 60 │ │ query_cache_share_between_users │ 0 │ └──────────────────────────────────────────────────────────────────────┴───────┘
For example, it is possible to define TTL with
query_cache_ttl setting (default: 60 seconds), or define conditions to put results into the query cache with
System.events table provides some high level statistics of query cache hits and misses, when
use_query_cache has been enabled for the query. It can be viewed as follows:
SELECT * from system.events WHERE event LIKE 'QueryCache%'
As usual, the cache can be dropped by
SYSTEM DROP QUERY CACHE command.
See also the ClickHouse documentation on this topic. It is pretty complete.
Dictionaries are one of the oldest and most efficient mechanisms in ClickHouse. The very first article in Altinity Blog was about dictionaries. They are not caches per se, but since the dictionary data is stored in RAM, those can be used as a cache as well.
ClickHouse Dictionary is an efficient in-memory structure for fast key-value lookups. Those are often used as replacements for joins. Dictionaries can read data from ClickHouse itself or external sources, and they can refresh it by schedule or by checking an update timestamp. There are multiple dictionary layouts with different RAM usage. You may check the Altinity Knowledge Base article that compares performance and memory usage for Hashed, SparseHashed and HashedArray dictionaries.
Let’s consider an example when a Dictionary is used as a cache for heavy query. We will modify the previous query and measure how much Altinity developers contribute to ClickHouse compared to other projects. There are multiple ways to do it in ClickHouse, here is one option:
with ['Enmk', 'traceon', 'excitoon', 'quickhouse', 'Slach', 'f1yegor', 'zvonand', 'myrcc', 'filimonov', 'arthurpassos', 'vzakaznikov', 'MyroTk', 'antip00', 'den-crane', 'ilejn'] as altinity_devs SELECT creator_user_login, sum(merged), any(total_merged), sum(merged)/any(total_merged) FROM github_events JOIN ( SELECT creator_user_login, sum(merged) total_merged FROM github_events GROUP BY 1 ) totals using creator_user_login WHERE merged_at>='2020-01-01' AND event_type = 'PullRequestEvent' AND repo_name ilike 'ClickHouse%' AND creator_user_login in altinity_devs GROUP BY creator_user_login
We used a subquery in order to calculate total statistics for all github users. The query is expensive: it runs in 14 seconds. Let’s instead create a dictionary to replace the subquery:
CREATE DICTIONARY github_user_merges ( user_login String, total_merged UInt32 ) PRIMARY KEY user_login SOURCE (CLICKHOUSE (QUERY "SELECT creator_user_login, sum(merged) total_merged FROM github_events WHERE merged>0 GROUP BY 1" )) LAYOUT (COMPLEX_KEY_HASHED_ARRAY()) LIFETIME (MIN 600 MAX 900)
Here, we defined a dictionary as a result of SQL query, that should be refreshed every 600 to 900 seconds.
We can check in system.dictionaries table that it took 37 seconds to load it (for all github users), there are more than 9M elements allocating 344MB of RAM. Now let’s use it in a query:
with ['Enmk', 'traceon', 'excitoon', 'quickhouse', 'Slach', 'f1yegor', 'zvonand', 'myrcc', 'filimonov', 'arthurpassos', 'vzakaznikov', 'MyroTk', 'antip00', 'den-crane', 'ilejn'] as altinity_devs SELECT creator_user_login, sum(merged), any(total_merged), sum(merged)/any(total_merged) FROM github_events JOIN github_user_merges on (user_login = creator_user_login) WHERE merged_at>='2020-01-01' AND event_type = 'PullRequestEvent' AND repo_name ilike 'ClickHouse%' AND creator_user_login in altinity_devs GROUP BY creator_user_login
The query runs just in 5 seconds now! This is 3 times faster than before but it is still slow compared to the one without join. Can we make it even faster? Of course. Let’s call the dictionary directly, instead of making a JOIN:
with ['Enmk', 'traceon', 'excitoon', 'quickhouse', 'Slach', 'f1yegor', 'zvonand', 'myrcc', 'filimonov', 'arthurpassos', 'vzakaznikov', 'MyroTk', 'antip00', 'den-crane', 'ilejn'] as altinity_devs SELECT creator_user_login, sum(merged) merged, any(dictGet('default.github_user_merges', 'total_merged', tuple(creator_user_login))) total_merged, merged/total_merged FROM github_events WHERE merged_at>='2020-01-01' AND event_type = 'PullRequestEvent' AND repo_name ilike 'ClickHouse%' AND creator_user_login in altinity_devs GROUP BY creator_user_login
Now ClickHouse is smart enough to call the dictionary once per user, so the query takes only half a second!
If the dictionary has more than 10 million rows, the RAM usage may become a concern.
What if the dictionary is so big that it does not fit in RAM? Usually, this should be avoided, because big dictionaries do not perform well, and updating them may take a lot of time. For rare cases where very big dictionaries are actually necessary, ClickHouse supports Cached dictionary layouts: Cache, ComplexKeyCache, SSDCache, and SSDComplexKeyCache.
Cache and ComplexKeyCache allow to define the number of dictionary cells stored in RAM, and therefore control the size of the cache. If the key is not found in the cache, ClickHouse will make a lookup call to the dictionary storage. This may be useful, if the number of active lookups is low compared to the source size.
SSDCache and SSDComplexKeyCache store data on local SSD disk and the index in RAM. This allows us to use very big dictionaries efficiently, for example looking up user sessions. However, the bigger cache size is, the more difficult it becomes to update it.
In this part we introduced ClickHouse caches used to speed up SELECT query processing. Some of them are enabled by default, others require an explicit setting to turn on. Caches help to reduce I/O pressure and therefore increase database system performance and capacity.
In the second part of the article, we will explain caches that ClickHouse uses in order to speedup data insertion, as well as some other more exotic internal caches. Stay tuned!