ClickHouse Black Magic: Skipping Indices

Written with Lead Support Engineer Mikhail Filimonov.

And in such indices, although small pricks
To their subsequent volumes, there is seen
The baby figure of the giant mass
Of things to come at large.

William Shakespeare

ClickHouse indices are different from traditional relational database management systems (RDMS) in that:

  • Primary keys are not unique.
  • There are no foreign keys and traditional B-tree indices.

Instead, ClickHouse uses secondary ‘skipping’ indices. Those are often confusing and hard to tune even for experienced ClickHouse users. However, when used properly they can dramatically improve query performance. In this article we will dive into skipping indices and learn some of ClickHouse black magic.

Introduction 

Let’s start with an overview of the ClickHouse storage model.

The data in MergeTree is organized as parts

  • Every part is a folder on the storage device that contains files for every column and a few additional files with metadata. 
  • The column is stored as an array of values, where position is mapped to a row number. 
  • The data in columns is sorted using the ORDER BY key of the table. 
  • Every 8192 rows in a part are called a granule

The granule size is defined by the index_granularity parameter of the MergeTree table definition, but ClickHouse sometimes may adjust it through adaptive granularity. Every granule is referenced by the ‘main’ table index aka the PRIMARY KEY that is typically the same as ORDER BY. Since the granule is 8192 rows, the index is sparse. The index also references the special ‘mark’ files, per column, that contain pointers to the data in the column files themselves.

If we look for a real world analogy, the use of a PRIMARY KEY is similar to the search in an Oxford dictionary, for example. Since all words in the dictionary are sorted alphabetically, we quickly find a page (granule) with a word we are looking for using a ‘sparse index’ – a single word printed in the page header:

… and then scan the page for the word itself.

If we take another example, Encyclopedia Britannica, then we might also associate a volume with a table partition.

However, if we need to find something that is not based on a sort order then the alphabetical index can not be used. Say we need to find all articles about Antique culture. In this case we have to do a full scan, or use a special topic index, that sometimes can be found at the end of books. Such an index references multiple pages where the topic can be found. It is not as effective as an alphabetical one but still allows us to avoid the full scan.

The main question: is it possible to make the search better but to avoid making extra copies of the data? 

Getting back to the dictionary analogy — what if we put some extra information to the top or bottom of every page, that would quickly annotate the content? For example, if the page contains an article about Antique culture we could add a small icon of an Antique temple. When looking for Antique culture articles, we still have to list all the pages, but we can quickly skip those without an icon. This is much faster than a full scan.

ClickHouse skipping indices utilize a very similar idea — they encapsulate some sort of “condensed knowledge” about the content of the granule. This knowledge allows quickly finding pages/granules that need to be scanned and skip everything else.

Index Definition

A Skipping Index can be defined when creating a table, or it can be added to an existing table when needed. In both cases the main part of the definition is the same:

INDEX index_name expr TYPE type(...) GRANULARITY granularity_value
  • index_name — any string, unique per table
  • expr – expression over table columns. Usually, it is just a column name.
  • type – index type, that defines this ‘extra knowledge’ that is stored in the index. We will discuss it in the next chapter

The last parameter GRANULARITY is an interesting one. To make it even more efficient, the index can accumulate the information about multiple granules at once. That allows ClickHouse to skip granules faster. So GRANULARITY 1 means that index is stored for every single granule, GRANULARITY 10 means that index is stored for 10 granules and so on. The optimal value depends on data distribution.

ClickHouse has several different types of skipping indices that vary by the kind of information stored on the granule. Let’s explore them!

Index Types

Minmax is the simplest one. It stores minimum and maximum values of the column (or expression) for a particular granula. During query execution ClickHouse can quickly check if column values are out of range without scanning the column. It works the best when the value of a column changes slowly over the sort order.

For example, let’s take a following table:

CREATE TABLE temperature_sensors(
    sensor_id Int32,
    datetime DateTime,
    value Decimal(5,2)
) Engine = MergeTree 
   PARTITION BY toYYYYMM(datetime) 
   ORDER BY (sensor_id, datetime);

The temperature for the sensor changes slowly over time, so if we need to run queries ‘WHERE value > 100’, then a minmax index would help a lot.

ALTER TABLE temperature_sensors ADD INDEX temp value TYPE(minmax) GRANULARITY 1

Note: When an index is added to an existing table it is not automatically updated. ALTER TABLE changes metadata, so the index is only calculated for the newly inserted data. In order to update it retroactively the additional DDL command is needed:

ALTER TABLE temperature_sensors MATERIALIZE INDEX temp

Set(N) is another index type. It stores all distinct values of a column or an expression in a granule. When an indexed column is used in a WHERE cloud, ClickHouse can read a small set instead of a full column. It works well if a column contains a small number of distinct values in a granule but values change over table order. For example, let’s consider the following table:

CREATE TABLE stats(
    source_ip IPv4,
    datetime DateTime,
    country String,
    ...
    INDEX country_idx country Set(100) GRANULARITY 100
) Engine = MergeTree
   PARTITION BY toYYYYMM(datetime)
   ORDER BY (source_ip, datetime);

Ordered IP addresses are usually close to each other geographically, therefore it makes sense to have a ‘Set’ index for column ‘country’. One granule will contain one or a small number of countries, so queries with a country filter could efficiently utilize the index.

The parameter of a Set index limits the maximum number of distinct values stored in an index. ‘0’ means unlimited. The index size needs to be significantly smaller than a column itself, otherwise it does not give any benefit.

Bloom filter indices. Bloom filter index is a tricky one, and to make it even trickier ClickHouse supports three different types of bloom filter index:

  • tokenbf_v1(size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed): An input string is split into alphanumeric tokens, and then tokens are stored in a bloom filter (see below). It works for the cases when an exact match on a string is searched. For example, if we have a ‘url’ column and are looking for a specific part of the URL or a query parameter.
  • ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed): In this case an input string is split into n-grams (first parameter – n-gram size, usually 3 or 4), and then stored in a bloom filter. That can work for full text search.
  • bloom_filter([false_positive]): This is the most generic case.  Input strings are stored in a bloom filter “as is”. That works for exact matches in strings and also in arrays.

What the Heck is a Bloom Filter?

What are Bloom filters? This is a probabilistic data structure to speed up searches. Probabilistic means it is not accurate, but it is accurate enough for skipping indices. How is it possible at all? Let’s figure it out.

Consider we need to search in a list of values with the length n. In order to build a bloom filter we do the following:

  1. First we create a bit-array, it is usually named a vector. Let’s denote the array length as m, m<n. Initially all bits in a vector are off.
  2. For every input value we calculate a hash function that returns a hash in [0.. m -1] range. Let’s turn on bits in the vector correspondingly.

After the list of values is processed, some bits in the vector will be “on”. That means an element with such a hash has been observed in a list. 

In order to search for an element in the list we need to calculate a hash and check if the corresponding bit is “on” or “off”. If it is “off” this is a 100% guarantee that the element is missing in a list. If the bit is “on”, we will have to fail over and scan the whole list. False positives are possible because of hash collisions. So this method works the best to answer if an element is missing in the list or it is probably contained in the list. In the best case, instead of scanning the initial dataset of n elements we only need to scan a much smaller vector of m/8 bytes.

If you think about it for a bit you may conclude that if we take a smaller m value it is faster to scan a vector, but there could be more collisions. The probability of collisions depends on the number of distinct values in the original set, and it is also proportional to 1/m. In order to reduce collisions multiple hash functions can be used.

In ClickHouse, the bloom filter vector is built for and stored for a granule (or several granules). So for every granule we can quickly detect if the searched value is missing, and the granule can be skipped completely. On the other hand, false positives are useless, and we need to reduce the chance of those as much as possible.

Now one can question how to choose optimal parameters for the vector size and number of hash functions in order to keep the bloom filter compact and reduce false positives at the same time? This is a very good question. The answer very much depends on the dataset and data distribution. We will dig into this and provide some practical bloom filter tuning advice in the next article. Before then let’s get back a little bit.

When NOT TO use Skipping Indices

Skipping indices are great, but as any other type of magic they come at a price. For example, calculating bloom filter vectors can be quite expensive on insert and during merge operations. So those need to be applied only if there is a benefit in a query time.

The index is efficient if it is significantly smaller than the column itself, otherwise it may slow down the query execution. Remember, that ClickHouse can just load the full column, apply a filter and decide what granules to read for the remaining columns. It is called the PREWHERE step in the query processing. If you want to confirm the skipping index size, check the system.data_skipping_indices table and compare it with an indexed column.

Also remember that the index should really ‘skip’. If for every granule it is positive then analyzing an index is just a waste of cpu cycles. 

Using skipped indices for columns that are already in the PRIMARY KEY is also useless. It is always faster to use the PRIMARY KEY when it can be applied for a query.

Next Steps

In this article we have introduced ClickHouse skipping indices. Indices in traditional databases are used in order to locate data that match filter conditions. In contrast, skipping indices contain additional information that enables ClickHouse to quickly decide that a certain part of the data does not match filter conditions and skip over it. In order to be efficient however, they need careful design and tuning that utilizes knowledge of the data and data distribution in the indexed columns. This is where the real magic begins. We will tell you the spells soon. Stay tuned!

Share

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.