ClickHouse Black Magic, Part 2: Bloom Filters

In our previous article about ClickHouse skipping indices, we talked about how such indices work and what types are available. The idea was to provide a general overview to help users understand the role of skipping indices in general and use specific index types appropriately.

This time let’s take a closer look at skipping indices based on bloom filters. Bloom filter indices are a particularly important class of index that enables users to run “needle-in-a-haystack” searches to seek specific values scattered over potentially large tables. Bloom filters are powerful but require careful parameter tuning for best results. This article explains the theory behind bloom filters, then demonstrates parameter selection using queries on a test dataset. We hope this approach will help users use one of the most powerful indexing tools available in ClickHouse today.

Picking parameters for bloom filters

As we remember from the previous article in this series, to add a certain element to the bloom filter, we need to calculate one or more hash functions from it and turn on corresponding bits in the bit vector of the bloom filter.

Obviously, if the length of the bit vector (m) is too short, then after adding a small number of elements all the bits in this vector will be turned on, and the bloom filter will not help to skip the data.

If, on the contrary, you make the bloom filter too long, then its reading/processing cost becomes very high and the benefit of using the bloom filter becomes insignificant or even may lead to slowing queries down.

Similarly, if the cardinality of the set of elements added to one bloom filter (n) is very high, then again all or most of the bits will be turned on, and the bloom filter will work poorly.

It is also possible to calculate several (k) different hash functions from one element, so every element added will turn on several bits in the bloom filter. This allows you to significantly reduce the probability of false-positive responses of the bloom filter (for example, if 2 hash functions are used, then the probability that they BOTH return the same value for DIFFERENT elements is much lower than in the case of a single hash function).

On the other hand, the calculation of hash functions is not free, and the more we use hash functions, the more expensive the use of the bloom filter will be. And with too big a value of k even a single element added to the bloom filter can enable a lot of bits. So too high a k may again lead to the situation when all bits in the Bloom filter are turned on, and it will not work.

Changing these 3 parameters (m,k,n) changes the main characteristic of the bloom filter: the probability of false positives (p). The closer this probability is to zero, the better the bloom filter works, but usually the more expensive it is to use.

Thus, we are faced with the task of selection of optimal parameters (k,m) of the bloom filter, knowing the cardinality of the set n to obtain a satisfactory probability of false positives p.

I will skip a piece of math here that demonstrates the point. If you are curious, you can check a great Wikipedia article and just say that it comes down to a fairly simple formula connecting these 4 parameters for getting optimal performance.

You can calculate these parameters in a very convenient and visual calculator: https://hur.st/bloomfilter/

The formulas work great, but there is one disadvantage: in real life there are costs of calculating hashes, reading / writing the bloom filters, and some other costs. So practical results may differ from the theoretical expectations.

Let’s take a deeper look at what it looks like in practice and how it will work in ClickHouse.

Test dataset design

Because real datasets tend to have uneven/complex data distribution, interpretation of the results can be difficult. Therefore, for simplicity and in order to obtain understandable and predictable results, we will conduct a study on a synthetic dataset with ‘ideal’ and very simple characteristics.

Let’s create a table with a single column containing 32-character long random strings:

create table source
engine = MergeTree 
ORDER BY tuple()
as select
 substring(replaceRegexpAll(randomPrintableASCII(100), '[^a-zA-Z0-9]', ''), 1, 32) as s
from numbers_mt(20500000)
where length(s) = 32
LIMIT 20000000;

Let’s check its properties:

SELECT
    min(length(s)),
    max(length(s)),
    count(),
    uniqExact(s)
FROM source
min(length(s))max(length(s))count()uniqExact(s)any(s)
32322000000020000000j4M5pMncRhnuOnaQeBS4nICs0BUU4GFu

Our table has exactly 20 million unique random strings, each exactly 32 characters long and consisting of only alphanumerical characters.

Because each value is unique, within each granule of 8192 rows, there will be exactly 8192 unique elements in each granule.

We will be using tokenbf_v1 index, because it allows us to tune all parameters of bloom filters. It actually tokenizes the string, but since our strings contain only alphanumeric characters, every row / string will have exactly 1 token.

In our tests we will be creating a skipping index with GRANULARITY 1, i.e. covering single granule. Our table is using the default index_granularity (8192) so there will be 8192 rows in each granule, with each row being unique. Therefore, the number of elements added to the bloom filter (n) will be exactly 8192.

Using a formula relating the probability of false positives to the optimal bloom filter size and the number of hash functions, let’s display a table for several different p:

WITH
    8192 AS n,
    number + 1 AS k,
    [0.0001, 0.001, 0.005, 0.01, 0.025, 0.05, 0.1] AS p_all,
    arrayMap(p -> ((-k) / ln(1 - exp(ln(p) / k))), p_all) AS bits_per_elem,
    arrayMap(r -> ceil(ceil(r * n) / 8), bits_per_elem) AS m_all
SELECT
    k,
    m_all[1] AS `m for p=0.0001`,
    m_all[2] AS `m for p=0.001`,
    m_all[3] AS `m for p=0.005`,
    m_all[4] AS `m for p=0.01`,
    m_all[5] AS `m for p=0.025`,
    m_all[6] AS `m for p=0.05`,
    m_all[7] AS `m for p=0.1`
FROM numbers(10)
km for p=0.0001m for p=0.001m for p=0.005m for p=0.01m for p=0.025m for p=0.05m for p=0.1
110239488102348820428810188840446199649720
22037756373427927194391190080925388
364637291581638212661888266864924
438877209191325110776808163974957
529672177001203310086787264255137
62532216163115149848789665805374
72295015368113219824803267945636
82155114959113009914822770405912
920696147721138110073845773046192
1020171147231152610273870875786475

Here the rows are the number of hash functions, and the columns are the probability of a false positive. And in the number in the table is the optimal size of the vector (in bytes) to obtain the desired probability with such a number of hash functions.

It is clearly seen that when using a single hash function, much longer vectors must be used. And the lower the probability of false positives that is needed, the longer the bloom filter must be.

It seems that the filter is about 8Kb in size and 2-5 hash functions should give good results.

Let’s now look the opposite: how the probabilities (p) will look like for different sizes of bloom filters (m) and different numbers of hash functions (k):

SET output_format_decimal_trailing_zeros=1;

WITH
    8192 AS n,
    number + 1 AS k,
    [128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536] AS m_all,
    arrayMap(m -> ((m * 8) / n), m_all) AS bits_per_elem,
    arrayMap(r -> toDecimal64(power(1 - exp((-k) / r), k), 3), bits_per_elem) AS p_all
SELECT
    k,
    p_all[1] AS `p for m=128`,
    p_all[2] AS `p for m=256`,
    p_all[3] AS `p for m=512`,
    p_all[4] AS `p for m=1024`,
    p_all[5] AS `p for m=2048`,
    p_all[6] AS `p for m=4096`,
    p_all[7] AS `p for m=8192`,
    p_all[8] AS `p for m=16384`,
    p_all[9] AS `p for m=32768`,
    p_all[10] AS `p for m=65536`
FROM numbers(10);
kp for m=128p for m=256p for m=512p for m=1024p for m=2048p for m=4096p for m=8192p for m=16384p for m=32768p for m=65536
10.99960.98160.86460.63210.39340.22110.11750.06050.03070.0155
20.99990.99930.96370.74760.39950.15480.04890.01380.00360.0009
30.99990.99990.99250.85790.46880.14680.03050.00490.00070.0000
40.99990.99990.99860.92870.55890.15960.02390.00230.00010.0000
51.00000.99990.99970.96670.65160.18490.02160.00130.00000.0000
61.00000.99990.99990.98520.73600.21980.02150.00090.00000.0000
71.00000.99990.99990.99360.80680.26280.02290.00070.00000.0000
81.00000.99990.99990.99730.86250.31240.02540.00050.00000.0000
91.00000.99990.99990.99880.90430.36690.02920.00050.00000.0000
101.00001.00000.99990.99950.93460.42460.03410.00040.00000.0000

As you can see the expectations are that vectors shorter than 2048 will not help at all (false-positivite rate is very close to 100%).

Now let’s check those expected results.

Trial 1. Impact of number of hashes

OK, here we check in practice how such a bloom filter will behave. Let’s create tables with the same bloom filter size and different number of hash functions.

Note: As mentioned above, I use the tokenbf_v1 index here, because it allows you to flexibly configure all the bloom filter parameters. Since the rows in our synthetic table contain only alphanumeric values, each row has exactly 1 token. Therefore, all calculations above apply to the token_bf filter in our synthetic test. Read more about filter types below.

CREATE TABLE bf_no_index ( `s` String ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_1 ( `s` String, index bf s TYPE tokenbf_v1(8192, 1, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_2 ( `s` String, index bf s TYPE tokenbf_v1(8192, 2, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_3 ( `s` String, index bf s TYPE tokenbf_v1(8192, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_4 ( `s` String, index bf s TYPE tokenbf_v1(8192, 4, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_5 ( `s` String, index bf s TYPE tokenbf_v1(8192, 5, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_6 ( `s` String, index bf s TYPE tokenbf_v1(8192, 6, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_7 ( `s` String, index bf s TYPE tokenbf_v1(8192, 7, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_8 ( `s` String, index bf s TYPE tokenbf_v1(8192, 8, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();

Let’s check how these indices are displayed in the system.data_skipping_indices table:

SELECT table, data_compressed_bytes, data_uncompressed_bytes
FROM system.data_skipping_indices
WHERE table LIKE 'bf%'
tabledata_compressed_bytesdata_uncompressed_bytes
bf_8192_11747413520013056
bf_8192_22008253820013056
bf_8192_32009763220013056
bf_8192_42009822920013056
bf_8192_52009872720013056
bf_8192_62009931320013056
bf_8192_72009947720013056
bf_8192_82009949420013056

You can see that the compression ratio for them is quite poor. That is expected, since bloom filter vectors have random properties if they have optimal size.

Let’s check how the insert speed looks like in these tables:

INSERT INTO bf_no_index SELECT * FROM source; -- Elapsed: 1.998 sec.
INSERT INTO bf_8192_1 SELECT * FROM source;   -- Elapsed: 2.888 sec.
INSERT INTO bf_8192_2 SELECT * FROM source;   -- Elapsed: 3.051 sec.
INSERT INTO bf_8192_3 SELECT * FROM source;   -- Elapsed: 3.234 sec.
INSERT INTO bf_8192_4 SELECT * FROM source;   -- Elapsed: 3.413 sec.
INSERT INTO bf_8192_5 SELECT * FROM source;   -- Elapsed: 3.567 sec.
INSERT INTO bf_8192_6 SELECT * FROM source;   -- Elapsed: 3.800 sec.
INSERT INTO bf_8192_7 SELECT * FROM source;   -- Elapsed: 4.035 sec.
INSERT INTO bf_8192_8 SELECT * FROM source;   -- Elapsed: 4.221 sec.

As you can see, the insertion speed predictably becomes lower with the addition of a bloom filter (we need to do more when inserting) and with an increase in the number of hash functions (more hash-functions need to be calculated). It is noticeable that the addition of the index itself slows down storage of the column by almost 45%, and each additional hash function adds another 8% slowdown

Note: in this case, only one column is indexed, so the difference is more clearly visible here. In non-synthetic tests with more columns, it will be less noticeable.

Now let’s look at the read speed:

Reading speed was tested with the following script:

echo "select count() from bf_tests.bf_no_index where s = 'YMBTGV0R9N3RInZFOYcNkky4UOeiY2dH'" | clickhouse-benchmark -i 100 -c 1

The number of skipped granules was taken from the query execution log.

tablegranulas skipped (from 2443)best query time (sec)row read (thousands)MB readQPS
bf_no_index00.11020000820.008.001
bf_8192_121580.031233095.7223.873
bf_8192_223190.016102041.6554.591
bf_8192_323700.012598.0224.5273.524
bf_8192_423730.010573.4423.5180.132
bf_8192_523800.011516.1021.1680.615
bf_8192_623830.011491.5220.1583.641
bf_8192_723770.010540.6722.1786.980
bf_8192_823710.011589.8224.1874.192

As you can see, the reading speed directly depends on the number of “skipped” granules and the number of skipped granules behaves in accordance with the theoretical predictions obtained from the formulas.

Now we understand how the number of hash functions affects the speed of selects / inserts. Now let’s try to figure out the effect of changing the size of the bloom filter vector.

Trial 2. Impact of vector length

This time let’s create a set of tables that differ only in the length of the bloom filter vector:

CREATE TABLE bf_no_index ( `s` String ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_2048_3 ( `s` String, index bf s TYPE tokenbf_v1(2048, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_4096_3 ( `s` String, index bf s TYPE tokenbf_v1(4096, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_3 ( `s` String, index bf s TYPE tokenbf_v1(8192, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_16384_3 ( `s` String, index bf s TYPE tokenbf_v1(16384, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_32768_3 ( `s` String, index bf s TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_65536_3 ( `s` String, index bf s TYPE tokenbf_v1(65536, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_131072_3 ( `s` String, index bf s TYPE tokenbf_v1(131072, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_262144_3 ( `s` String, index bf s TYPE tokenbf_v1(262144, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();

Let’s check how it impact inserts:

INSERT INTO bf_no_index SELECT * FROM source; -- Elapsed: 1.998 sec.
INSERT INTO bf_2048_3 SELECT * FROM source;   -- Elapsed: 2.864 sec.
INSERT INTO bf_4096_3 SELECT * FROM source;   -- Elapsed: 2.866 sec.
INSERT INTO bf_8192_3 SELECT * FROM source;   -- Elapsed: 2.904 sec.
INSERT INTO bf_16384_3 SELECT * FROM source;  -- Elapsed: 2.969 sec.
INSERT INTO bf_32768_3 SELECT * FROM source;  -- Elapsed: 3.244 sec.
INSERT INTO bf_65536_3 SELECT * FROM source;  -- Elapsed: 3.438 sec.
INSERT INTO bf_131072_3 SELECT * FROM source; -- Elapsed: 3.798 sec.
INSERT INTO bf_262144_3 SELECT * FROM source; -- Elapsed: 4.299 sec.

Obviously a larger bloom filter means “more data needs to be written”, but the slowdown for each doubling of the bloom filter size is not very significant.

Let’s look at the actual size of these bloom filters inside data_skipping_indices:

SELECT table, data_compressed_bytes, data_uncompressed_bytes
FROM system.data_skipping_indices
WHERE table LIKE 'bf%3'
ORDER BY data_uncompressed_bytes ASC
tabledata_compressed_bytesdata_uncompressed_bytes
bf_2048_350222705003264
bf_4096_31004967010006528
bf_8192_32009763220013056
bf_16384_33924111340026112
bf_32768_36348876480052224
bf_65536_3101119994160104448
bf_131072_3133740272320208896
bf_262144_3183368525640417792

You can see here that for very long (and far from optimal) bloom filter vector sizes the compession becomes better, because there are a lot ’empty’ segments of the vector in that case.

Just for reference, here is the size of the column itself:

SELECT
    database,
    table,
    name,
    data_compressed_bytes,
    data_uncompressed_bytes,
    marks_bytes
FROM system.columns
WHERE table = 'bf_262144_3'
databasetablenamedata_compressed_bytesdata_uncompressed_bytesmarks_bytes
defaultbf_262144_3s66265381966000000058776

As you can see, even for the largest bloom filter selected, the size is still smaller than the column size.

Now let’s look at the speed of requests. We can check it this way:

echo "select count() from bf_tests.bf_2048_3 where s = 'YMBTGV0R9N3RInZFOYcNkky4UOeiY2dH'" | clickhouse-client --send_logs=trace   2>&1 | grep dropped
echo "select count() from bf_tests.bf_2048_3 where s = 'YMBTGV0R9N3RInZFOYcNkky4UOeiY2dH'"  | clickhouse-benchmark -i 100 -d 0 -c 1

The results:

tabledropped granulas (from 2443)QPSfastest query (sec.)
bf_no_index08.0010.110
bf_2048_3129312.5880.068
bf_4096_3207428.4670.029
bf_8192_3237078.2340.011
bf_16384_3243459.6850.014
bf_32768_3244222.8440.033
bf_65536_2244210.8800.076
bf_131072_324427.6350.116
bf_262144_324424.5430.205

As expected, a longer bloom filter means fewer false positives, which is clearly seen from the increase in the number of “skipped” granules in accordance with the expectations from the formulas.

But at the same time, it is obvious that longer bloom filters are not ‘free’. And when the length of the vector grows, the overhead of reading and processing the bloom filter itself begins to significantly exceed the benefit that it gives. It is clearly visible as a degradation of QPS with vector length growth.

              poor quality of bf (lot of false positives), but bf is small
                    │                                     and cheap to use
                    │
                    │        ┌──────── optimal value
             ▲      │        │
  query speed│      │        │
             │      │        │         high quality of bf (less of false positives)
             │      │        │               │  but bf itself is huge and expensive
             │      │        ▼               │     to use
             │      ▼      xxxxxxxxxxx       │
             │          xxxx         xxxx    │
             │        xxx                xxx ▼
             │      xxx                    xxxx
             │     xx                         xxx
             │   xx                             xxx
             │  xx                                xx
             │  x
          ───┼───────────────────────────────────────────►
             │                               bf vector size

Summary

Bloom filter indices can be tricky to tune but our experiments have revealed a number of lessons that help set parameters correctly in production systems.

  • Knowing the number of elements in the set (m), you can accurately predict the behavior of the bloom filter, and choose the optimal parameters using the formulas from the article above. The next article will provide practical guidelines for evaluating m in real data.
  • When changing the index_granularity of a table or the GRANULARITY of an index, you need to change the bloom filter parameters, because it directly impacts the m parameter.
  • The more hash functions we use, the more expensive the bloom filter calculation will be. Therefore, it is undesirable to use values​​greater than 5, since this significantly slows down inserts, and also leads to “oversaturation” of the vector with included bits, thus performance degradation.

That being said, a 1 hash function would require a fairly long vector to get a satisfactory false positive rate. Therefore, it is preferable to use from 2 to 5 hash functions, while increasing the size of the vector proportionally.

  • Vectors that are too long despite the low probability of false positives are very expensive to use. Therefore, it is preferable to use bloom filters that are orders of magnitude smaller than the column size.
  • Internally in the formulas for calculation of bloom filter parameter called bits_per_element is used. And because of that one of the simplest ‘rough’ ways of picking good bloom filter parameters is: “take 3 hash functions, and the bloom filter of the size (in bytes) equal to the cardinality of the set”. It can be used as a simple ‘rule of thumb’, but of course, in real life, you may need to do more checks/estimations.

In the next part of the article, we will look at how bloom filters work and can be used as well as tuned with real data in a non-synthetic example. Stay tuned!

Share

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.