# ClickHouse Black Magic, Part 2: Bloom Filters

In our previous article about ClickHouse skipping indices, we talked about how such indices work and what types are available. The idea was to provide a general overview to help users understand the role of skipping indices in general and use specific index types appropriately.

This time let’s take a closer look at skipping indices based on bloom filters. Bloom filter indices are a particularly important class of index that enables users to run “needle-in-a-haystack” searches to seek specific values scattered over potentially large tables. Bloom filters are powerful but require careful parameter tuning for best results. This article explains the theory behind bloom filters, then demonstrates parameter selection using queries on a test dataset. We hope this approach will help users use one of the most powerful indexing tools available in ClickHouse today.

## Picking parameters for bloom filters

As we remember from the previous article in this series, to add a certain element to the bloom filter, we need to calculate one or more hash functions from it and turn on corresponding bits in the bit vector of the bloom filter.

Obviously, if the length of the bit vector (`m`

) is too short, then after adding a small number of elements all the bits in this vector will be turned on, and the bloom filter will not help to skip the data.

If, on the contrary, you make the bloom filter too long, then its reading/processing cost becomes very high and the benefit of using the bloom filter becomes insignificant or even may lead to slowing queries down.

Similarly, if the cardinality of the set of elements added to one bloom filter (`n`

) is very high, then again all or most of the bits will be turned on, and the bloom filter will work poorly.

It is also possible to calculate several (`k`

) different hash functions from one element, so every element added will turn on several bits in the bloom filter. This allows you to significantly reduce the probability of false-positive responses of the bloom filter (for example, if 2 hash functions are used, then the probability that they BOTH return the same value for DIFFERENT elements is much lower than in the case of a single hash function).

On the other hand, the calculation of hash functions is not free, and the more we use hash functions, the more expensive the use of the bloom filter will be. And with too big a value of `k`

even a single element added to the bloom filter can enable a lot of bits. So too high a `k`

may again lead to the situation when all bits in the Bloom filter are turned on, and it will not work.

Changing these 3 parameters (`m`

,`k`

,`n`

) changes the main characteristic of the bloom filter: the probability of false positives (`p`

). The closer this probability is to zero, the better the bloom filter works, but usually the more expensive it is to use.

Thus, we are faced with the task of selection of optimal parameters (`k`

,`m`

) of the bloom filter, knowing the cardinality of the set `n`

to obtain a satisfactory probability of false positives `p`

.

I will skip a piece of math here that demonstrates the point. If you are curious, you can check a great Wikipedia article and just say that it comes down to a fairly simple formula connecting these 4 parameters for getting optimal performance.

You can calculate these parameters in a very convenient and visual calculator: https://hur.st/bloomfilter/

The formulas work great, but there is one disadvantage: in real life there are costs of calculating hashes, reading / writing the bloom filters, and some other costs. So practical results may differ from the theoretical expectations.

Let’s take a deeper look at what it looks like in practice and how it will work in ClickHouse.

## Test dataset design

Because real datasets tend to have uneven/complex data distribution, interpretation of the results can be difficult. Therefore, for simplicity and in order to obtain understandable and predictable results, we will conduct a study on a synthetic dataset with ‘ideal’ and very simple characteristics.

Let’s create a table with a single column containing 32-character long random strings:

create table source engine = MergeTree ORDER BY tuple() as select substring(replaceRegexpAll(randomPrintableASCII(100), '[^a-zA-Z0-9]', ''), 1, 32) as s from numbers_mt(20500000) where length(s) = 32 LIMIT 20000000;

Let’s check its properties:

SELECT min(length(s)), max(length(s)), count(), uniqExact(s) FROM source

min(length(s)) | max(length(s)) | count() | uniqExact(s) | any(s) |
---|---|---|---|---|

32 | 32 | 20000000 | 20000000 | j4M5pMncRhnuOnaQeBS4nICs0BUU4GFu |

Our table has exactly 20 million unique random strings, each exactly 32 characters long and consisting of only alphanumerical characters.

Because each value is unique, within each granule of 8192 rows, there will be exactly 8192 unique elements in each granule.

We will be using `tokenbf_v1`

index, because it allows us to tune all parameters of bloom filters. It actually tokenizes the string, but since our strings contain only alphanumeric characters, every row / string will have exactly 1 token.

In our tests we will be creating a skipping index with `GRANULARITY 1`

, i.e. covering single granule. Our table is using the default `index_granularity`

(8192) so there will be 8192 rows in each granule, with each row being unique. Therefore, the number of elements added to the bloom filter (`n`

) will be exactly 8192.

Using a formula relating the probability of false positives to the optimal bloom filter size and the number of hash functions, let’s display a table for several different `p`

:

WITH 8192 AS n, number + 1 AS k, [0.0001, 0.001, 0.005, 0.01, 0.025, 0.05, 0.1] AS p_all, arrayMap(p -> ((-k) / ln(1 - exp(ln(p) / k))), p_all) AS bits_per_elem, arrayMap(r -> ceil(ceil(r * n) / 8), bits_per_elem) AS m_all SELECT k, m_all[1] AS `m for p=0.0001`, m_all[2] AS `m for p=0.001`, m_all[3] AS `m for p=0.005`, m_all[4] AS `m for p=0.01`, m_all[5] AS `m for p=0.025`, m_all[6] AS `m for p=0.05`, m_all[7] AS `m for p=0.1` FROM numbers(10)

k | m for p=0.0001 | m for p=0.001 | m for p=0.005 | m for p=0.01 | m for p=0.025 | m for p=0.05 | m for p=0.1 |
---|---|---|---|---|---|---|---|

1 | 10239488 | 1023488 | 204288 | 101888 | 40446 | 19964 | 9720 |

2 | 203775 | 63734 | 27927 | 19439 | 11900 | 8092 | 5388 |

3 | 64637 | 29158 | 16382 | 12661 | 8882 | 6686 | 4924 |

4 | 38877 | 20919 | 13251 | 10776 | 8081 | 6397 | 4957 |

5 | 29672 | 17700 | 12033 | 10086 | 7872 | 6425 | 5137 |

6 | 25322 | 16163 | 11514 | 9848 | 7896 | 6580 | 5374 |

7 | 22950 | 15368 | 11321 | 9824 | 8032 | 6794 | 5636 |

8 | 21551 | 14959 | 11300 | 9914 | 8227 | 7040 | 5912 |

9 | 20696 | 14772 | 11381 | 10073 | 8457 | 7304 | 6192 |

10 | 20171 | 14723 | 11526 | 10273 | 8708 | 7578 | 6475 |

Here the rows are the number of hash functions, and the columns are the probability of a false positive. And in the number in the table is the optimal size of the vector (in bytes) to obtain the desired probability with such a number of hash functions.

It is clearly seen that when using a single hash function, much longer vectors must be used. And the lower the probability of false positives that is needed, the longer the bloom filter must be.

It seems that the filter is about 8Kb in size and 2-5 hash functions should give good results.

Let’s now look the opposite: how the probabilities (`p`

) will look like for different sizes of bloom filters (`m`

) and different numbers of hash functions (`k`

):

SET output_format_decimal_trailing_zeros=1; WITH 8192 AS n, number + 1 AS k, [128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536] AS m_all, arrayMap(m -> ((m * 8) / n), m_all) AS bits_per_elem, arrayMap(r -> toDecimal64(power(1 - exp((-k) / r), k), 3), bits_per_elem) AS p_all SELECT k, p_all[1] AS `p for m=128`, p_all[2] AS `p for m=256`, p_all[3] AS `p for m=512`, p_all[4] AS `p for m=1024`, p_all[5] AS `p for m=2048`, p_all[6] AS `p for m=4096`, p_all[7] AS `p for m=8192`, p_all[8] AS `p for m=16384`, p_all[9] AS `p for m=32768`, p_all[10] AS `p for m=65536` FROM numbers(10);

k | p for m=128 | p for m=256 | p for m=512 | p for m=1024 | p for m=2048 | p for m=4096 | p for m=8192 | p for m=16384 | p for m=32768 | p for m=65536 |
---|---|---|---|---|---|---|---|---|---|---|

1 | 0.9996 | 0.9816 | 0.8646 | 0.6321 | 0.3934 | 0.2211 | 0.1175 | 0.0605 | 0.0307 | 0.0155 |

2 | 0.9999 | 0.9993 | 0.9637 | 0.7476 | 0.3995 | 0.1548 | 0.0489 | 0.0138 | 0.0036 | 0.0009 |

3 | 0.9999 | 0.9999 | 0.9925 | 0.8579 | 0.4688 | 0.1468 | 0.0305 | 0.0049 | 0.0007 | 0.0000 |

4 | 0.9999 | 0.9999 | 0.9986 | 0.9287 | 0.5589 | 0.1596 | 0.0239 | 0.0023 | 0.0001 | 0.0000 |

5 | 1.0000 | 0.9999 | 0.9997 | 0.9667 | 0.6516 | 0.1849 | 0.0216 | 0.0013 | 0.0000 | 0.0000 |

6 | 1.0000 | 0.9999 | 0.9999 | 0.9852 | 0.7360 | 0.2198 | 0.0215 | 0.0009 | 0.0000 | 0.0000 |

7 | 1.0000 | 0.9999 | 0.9999 | 0.9936 | 0.8068 | 0.2628 | 0.0229 | 0.0007 | 0.0000 | 0.0000 |

8 | 1.0000 | 0.9999 | 0.9999 | 0.9973 | 0.8625 | 0.3124 | 0.0254 | 0.0005 | 0.0000 | 0.0000 |

9 | 1.0000 | 0.9999 | 0.9999 | 0.9988 | 0.9043 | 0.3669 | 0.0292 | 0.0005 | 0.0000 | 0.0000 |

10 | 1.0000 | 1.0000 | 0.9999 | 0.9995 | 0.9346 | 0.4246 | 0.0341 | 0.0004 | 0.0000 | 0.0000 |

As you can see the expectations are that vectors shorter than 2048 will not help at all (false-positivite rate is very close to 100%).

Now let’s check those expected results.

### Trial 1. Impact of number of hashes

OK, here we check in practice how such a bloom filter will behave. Let’s create tables with the same bloom filter size and different number of hash functions.

Note: As mentioned above, I use the `tokenbf_v1`

index here, because it allows you to flexibly configure all the bloom filter parameters. Since the rows in our synthetic table contain only alphanumeric values, each row has exactly 1 token. Therefore, all calculations above apply to the token_bf filter in our synthetic test. Read more about filter types below.

CREATE TABLE bf_no_index ( `s` String ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_8192_1 ( `s` String, index bf s TYPE tokenbf_v1(8192, 1, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_8192_2 ( `s` String, index bf s TYPE tokenbf_v1(8192, 2, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_8192_3 ( `s` String, index bf s TYPE tokenbf_v1(8192, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_8192_4 ( `s` String, index bf s TYPE tokenbf_v1(8192, 4, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_8192_5 ( `s` String, index bf s TYPE tokenbf_v1(8192, 5, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_8192_6 ( `s` String, index bf s TYPE tokenbf_v1(8192, 6, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_8192_7 ( `s` String, index bf s TYPE tokenbf_v1(8192, 7, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_8192_8 ( `s` String, index bf s TYPE tokenbf_v1(8192, 8, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();

Let’s check how these indices are displayed in the `system.data_skipping_indices`

table:

SELECT table, data_compressed_bytes, data_uncompressed_bytes FROM system.data_skipping_indices WHERE table LIKE 'bf%'

table | data_compressed_bytes | data_uncompressed_bytes |
---|---|---|

bf_8192_1 | 17474135 | 20013056 |

bf_8192_2 | 20082538 | 20013056 |

bf_8192_3 | 20097632 | 20013056 |

bf_8192_4 | 20098229 | 20013056 |

bf_8192_5 | 20098727 | 20013056 |

bf_8192_6 | 20099313 | 20013056 |

bf_8192_7 | 20099477 | 20013056 |

bf_8192_8 | 20099494 | 20013056 |

You can see that the compression ratio for them is quite poor. That is expected, since bloom filter vectors have random properties if they have optimal size.

Let’s check how the insert speed looks like in these tables:

INSERT INTO bf_no_index SELECT * FROM source; -- Elapsed: 1.998 sec. INSERT INTO bf_8192_1 SELECT * FROM source; -- Elapsed: 2.888 sec. INSERT INTO bf_8192_2 SELECT * FROM source; -- Elapsed: 3.051 sec. INSERT INTO bf_8192_3 SELECT * FROM source; -- Elapsed: 3.234 sec. INSERT INTO bf_8192_4 SELECT * FROM source; -- Elapsed: 3.413 sec. INSERT INTO bf_8192_5 SELECT * FROM source; -- Elapsed: 3.567 sec. INSERT INTO bf_8192_6 SELECT * FROM source; -- Elapsed: 3.800 sec. INSERT INTO bf_8192_7 SELECT * FROM source; -- Elapsed: 4.035 sec. INSERT INTO bf_8192_8 SELECT * FROM source; -- Elapsed: 4.221 sec.

As you can see, the insertion speed predictably becomes lower with the addition of a bloom filter (we need to do more when inserting) and with an increase in the number of hash functions (more hash-functions need to be calculated). It is noticeable that the addition of the index itself slows down storage of the column by almost 45%, and each additional hash function adds another 8% slowdown

Note: in this case, only one column is indexed, so the difference is more clearly visible here. In non-synthetic tests with more columns, it will be less noticeable.

Now let’s look at the read speed:

Reading speed was tested with the following script:

echo "select count() from bf_tests.bf_no_index where s = 'YMBTGV0R9N3RInZFOYcNkky4UOeiY2dH'" | clickhouse-benchmark -i 100 -c 1

The number of skipped granules was taken from the query execution log.

table | granulas skipped (from 2443) | best query time (sec) | row read (thousands) | MB read | QPS |
---|---|---|---|---|---|

bf_no_index | 0 | 0.110 | 20000 | 820.00 | 8.001 |

bf_8192_1 | 2158 | 0.031 | 2330 | 95.72 | 23.873 |

bf_8192_2 | 2319 | 0.016 | 1020 | 41.65 | 54.591 |

bf_8192_3 | 2370 | 0.012 | 598.02 | 24.52 | 73.524 |

bf_8192_4 | 2373 | 0.010 | 573.44 | 23.51 | 80.132 |

bf_8192_5 | 2380 | 0.011 | 516.10 | 21.16 | 80.615 |

bf_8192_6 | 2383 | 0.011 | 491.52 | 20.15 | 83.641 |

bf_8192_7 | 2377 | 0.010 | 540.67 | 22.17 | 86.980 |

bf_8192_8 | 2371 | 0.011 | 589.82 | 24.18 | 74.192 |

As you can see, the reading speed directly depends on the number of “skipped” granules and the number of skipped granules behaves in accordance with the theoretical predictions obtained from the formulas.

Now we understand how the number of hash functions affects the speed of selects / inserts. Now let’s try to figure out the effect of changing the size of the bloom filter vector.

### Trial 2. Impact of vector length

This time let’s create a set of tables that differ only in the length of the bloom filter vector:

CREATE TABLE bf_no_index ( `s` String ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_2048_3 ( `s` String, index bf s TYPE tokenbf_v1(2048, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_4096_3 ( `s` String, index bf s TYPE tokenbf_v1(4096, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_8192_3 ( `s` String, index bf s TYPE tokenbf_v1(8192, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_16384_3 ( `s` String, index bf s TYPE tokenbf_v1(16384, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_32768_3 ( `s` String, index bf s TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_65536_3 ( `s` String, index bf s TYPE tokenbf_v1(65536, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_131072_3 ( `s` String, index bf s TYPE tokenbf_v1(131072, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE bf_262144_3 ( `s` String, index bf s TYPE tokenbf_v1(262144, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();

Let’s check how it impact inserts:

INSERT INTO bf_no_index SELECT * FROM source; -- Elapsed: 1.998 sec. INSERT INTO bf_2048_3 SELECT * FROM source; -- Elapsed: 2.864 sec. INSERT INTO bf_4096_3 SELECT * FROM source; -- Elapsed: 2.866 sec. INSERT INTO bf_8192_3 SELECT * FROM source; -- Elapsed: 2.904 sec. INSERT INTO bf_16384_3 SELECT * FROM source; -- Elapsed: 2.969 sec. INSERT INTO bf_32768_3 SELECT * FROM source; -- Elapsed: 3.244 sec. INSERT INTO bf_65536_3 SELECT * FROM source; -- Elapsed: 3.438 sec. INSERT INTO bf_131072_3 SELECT * FROM source; -- Elapsed: 3.798 sec. INSERT INTO bf_262144_3 SELECT * FROM source; -- Elapsed: 4.299 sec.

Obviously a larger bloom filter means “more data needs to be written”, but the slowdown for each doubling of the bloom filter size is not very significant.

Let’s look at the actual size of these bloom filters inside `data_skipping_indices`

:

SELECT table, data_compressed_bytes, data_uncompressed_bytes FROM system.data_skipping_indices WHERE table LIKE 'bf%3' ORDER BY data_uncompressed_bytes ASC

table | data_compressed_bytes | data_uncompressed_bytes |
---|---|---|

bf_2048_3 | 5022270 | 5003264 |

bf_4096_3 | 10049670 | 10006528 |

bf_8192_3 | 20097632 | 20013056 |

bf_16384_3 | 39241113 | 40026112 |

bf_32768_3 | 63488764 | 80052224 |

bf_65536_3 | 101119994 | 160104448 |

bf_131072_3 | 133740272 | 320208896 |

bf_262144_3 | 183368525 | 640417792 |

You can see here that for very long (and far from optimal) bloom filter vector sizes the compession becomes better, because there are a lot ’empty’ segments of the vector in that case.

Just for reference, here is the size of the column itself:

SELECT database, table, name, data_compressed_bytes, data_uncompressed_bytes, marks_bytes FROM system.columns WHERE table = 'bf_262144_3'

database | table | name | data_compressed_bytes | data_uncompressed_bytes | marks_bytes |
---|---|---|---|---|---|

default | bf_262144_3 | s | 662653819 | 660000000 | 58776 |

As you can see, even for the largest bloom filter selected, the size is still smaller than the column size.

Now let’s look at the speed of requests. We can check it this way:

echo "select count() from bf_tests.bf_2048_3 where s = 'YMBTGV0R9N3RInZFOYcNkky4UOeiY2dH'" | clickhouse-client --send_logs=trace 2>&1 | grep dropped echo "select count() from bf_tests.bf_2048_3 where s = 'YMBTGV0R9N3RInZFOYcNkky4UOeiY2dH'" | clickhouse-benchmark -i 100 -d 0 -c 1

The results:

table | dropped granulas (from 2443) | QPS | fastest query (sec.) |
---|---|---|---|

bf_no_index | 0 | 8.001 | 0.110 |

bf_2048_3 | 1293 | 12.588 | 0.068 |

bf_4096_3 | 2074 | 28.467 | 0.029 |

bf_8192_3 | 2370 | 78.234 | 0.011 |

bf_16384_3 | 2434 | 59.685 | 0.014 |

bf_32768_3 | 2442 | 22.844 | 0.033 |

bf_65536_2 | 2442 | 10.880 | 0.076 |

bf_131072_3 | 2442 | 7.635 | 0.116 |

bf_262144_3 | 2442 | 4.543 | 0.205 |

As expected, a longer bloom filter means fewer false positives, which is clearly seen from the increase in the number of “skipped” granules in accordance with the expectations from the formulas.

But at the same time, it is obvious that longer bloom filters are not ‘free’. And when the length of the vector grows, the overhead of reading and processing the bloom filter itself begins to significantly exceed the benefit that it gives. It is clearly visible as a degradation of QPS with vector length growth.

```
poor quality of bf (lot of false positives), but bf is small
│ and cheap to use
│
│ ┌──────── optimal value
▲ │ │
query speed│ │ │
│ │ │ high quality of bf (less of false positives)
│ │ │ │ but bf itself is huge and expensive
│ │ ▼ │ to use
│ ▼ xxxxxxxxxxx │
│ xxxx xxxx │
│ xxx xxx ▼
│ xxx xxxx
│ xx xxx
│ xx xxx
│ xx xx
│ x
───┼───────────────────────────────────────────►
│ bf vector size
```

## Summary

Bloom filter indices can be tricky to tune but our experiments have revealed a number of lessons that help set parameters correctly in production systems.

- Knowing the number of elements in the set (
`m`

), you can accurately predict the behavior of the bloom filter, and choose the optimal parameters using the formulas from the article above. The next article will provide practical guidelines for evaluating`m`

in real data. - When changing the
`index_granularity`

of a table or the GRANULARITY of an index, you need to change the bloom filter parameters, because it directly impacts the`m`

parameter. - The more hash functions we use, the more expensive the bloom filter calculation will be. Therefore, it is undesirable to use valuesgreater than 5, since this significantly slows down inserts, and also leads to “oversaturation” of the vector with included bits, thus performance degradation.

That being said, a 1 hash function would require a fairly long vector to get a satisfactory false positive rate. Therefore, it is preferable to use from 2 to 5 hash functions, while increasing the size of the vector proportionally.

- Vectors that are too long despite the low probability of false positives are very expensive to use. Therefore, it is preferable to use bloom filters that are orders of magnitude smaller than the column size.
- Internally in the formulas for calculation of bloom filter parameter called
`bits_per_element`

is used. And because of that one of the simplest ‘rough’ ways of picking good bloom filter parameters is: “take 3 hash functions, and the bloom filter of the size (in bytes) equal to the cardinality of the set”. It can be used as a simple ‘rule of thumb’, but of course, in real life, you may need to do more checks/estimations.

In the next part of the article, we will look at how bloom filters work and can be used as well as tuned with real data in a non-synthetic example. Stay tuned!