ClickHouse® Black Magic, Part 2: Bloom Filters

In our previous article about ClickHouse skipping indices, we talked about how such indices work and what types are available. The idea was to provide a general overview to help users understand the role of skipping indices in general and use specific index types appropriately.

This time let’s take a closer look at skipping indices based on bloom filters. Bloom filter indices are a particularly important class of index that enables users to run “needle-in-a-haystack” searches to seek specific values scattered over potentially large tables. Bloom filters are powerful but require careful parameter tuning for best results. This article explains the theory behind bloom filters, then demonstrates parameter selection using queries on a test dataset. We hope this approach will help users use one of the most powerful indexing tools available in ClickHouse today.

Picking parameters for bloom filters

As we remember from the previous article in this series, to add a certain element to the bloom filter, we need to calculate one or more hash functions from it and turn on corresponding bits in the bit vector of the bloom filter.

Obviously, if the length of the bit vector (m) is too short, then after adding a small number of elements all the bits in this vector will be turned on, and the bloom filter will not help to skip the data.

If, on the contrary, you make the bloom filter too long, then its reading/processing cost becomes very high and the benefit of using the bloom filter becomes insignificant or even may lead to slowing queries down.

Similarly, if the cardinality of the set of elements added to one bloom filter (n) is very high, then again all or most of the bits will be turned on, and the bloom filter will work poorly.

It is also possible to calculate several (k) different hash functions from one element, so every element added will turn on several bits in the bloom filter. This allows you to significantly reduce the probability of false-positive responses of the bloom filter (for example, if 2 hash functions are used, then the probability that they BOTH return the same value for DIFFERENT elements is much lower than in the case of a single hash function).

On the other hand, the calculation of hash functions is not free, and the more we use hash functions, the more expensive the use of the bloom filter will be. And with too big a value of k even a single element added to the bloom filter can enable a lot of bits. So too high a k may again lead to the situation when all bits in the Bloom filter are turned on, and it will not work.

Changing these 3 parameters (m,k,n) changes the main characteristic of the bloom filter: the probability of false positives (p). The closer this probability is to zero, the better the bloom filter works, but usually the more expensive it is to use.

Thus, we are faced with the task of selection of optimal parameters (k,m) of the bloom filter, knowing the cardinality of the set n to obtain a satisfactory probability of false positives p.

I will skip a piece of math here that demonstrates the point. If you are curious, you can check a great Wikipedia article and just say that it comes down to a fairly simple formula connecting these 4 parameters for getting optimal performance.

You can calculate these parameters in a very convenient and visual calculator: https://hur.st/bloomfilter/

The formulas work great, but there is one disadvantage: in real life there are costs of calculating hashes, reading / writing the bloom filters, and some other costs. So practical results may differ from the theoretical expectations.

Let’s take a deeper look at what it looks like in practice and how it will work in ClickHouse.

Test dataset design

Because real datasets tend to have uneven/complex data distribution, interpretation of the results can be difficult. Therefore, for simplicity and in order to obtain understandable and predictable results, we will conduct a study on a synthetic dataset with ‘ideal’ and very simple characteristics.

Let’s create a table with a single column containing 32-character long random strings:

create table source
engine = MergeTree 
ORDER BY tuple()
as select
 substring(replaceRegexpAll(randomPrintableASCII(100), '[^a-zA-Z0-9]', ''), 1, 32) as s
from numbers_mt(20500000)
where length(s) = 32
LIMIT 20000000;

Let’s check its properties:

SELECT
    min(length(s)),
    max(length(s)),
    count(),
    uniqExact(s)
FROM source

min(length(s))	max(length(s))	count()	uniqExact(s)	any(s)
32	32	20000000	20000000	j4M5pMncRhnuOnaQeBS4nICs0BUU4GFu

Our table has exactly 20 million unique random strings, each exactly 32 characters long and consisting of only alphanumerical characters.

Because each value is unique, within each granule of 8192 rows, there will be exactly 8192 unique elements in each granule.

We will be using tokenbf_v1 index, because it allows us to tune all parameters of bloom filters. It actually tokenizes the string, but since our strings contain only alphanumeric characters, every row / string will have exactly 1 token.

In our tests we will be creating a skipping index with GRANULARITY 1, i.e. covering single granule. Our table is using the default index_granularity (8192) so there will be 8192 rows in each granule, with each row being unique. Therefore, the number of elements added to the bloom filter (n) will be exactly 8192.

Using a formula relating the probability of false positives to the optimal bloom filter size and the number of hash functions, let’s display a table for several different p:

WITH
    8192 AS n,
    number + 1 AS k,
    [0.0001, 0.001, 0.005, 0.01, 0.025, 0.05, 0.1] AS p_all,
    arrayMap(p -> ((-k) / ln(1 - exp(ln(p) / k))), p_all) AS bits_per_elem,
    arrayMap(r -> ceil(ceil(r * n) / 8), bits_per_elem) AS m_all
SELECT
    k,
    m_all[1] AS `m for p=0.0001`,
    m_all[2] AS `m for p=0.001`,
    m_all[3] AS `m for p=0.005`,
    m_all[4] AS `m for p=0.01`,
    m_all[5] AS `m for p=0.025`,
    m_all[6] AS `m for p=0.05`,
    m_all[7] AS `m for p=0.1`
FROM numbers(10)

k	m for p=0.0001	m for p=0.001	m for p=0.005	m for p=0.01	m for p=0.025	m for p=0.05	m for p=0.1
1	10239488	1023488	204288	101888	40446	19964	9720
2	203775	63734	27927	19439	11900	8092	5388
3	64637	29158	16382	12661	8882	6686	4924
4	38877	20919	13251	10776	8081	6397	4957
5	29672	17700	12033	10086	7872	6425	5137
6	25322	16163	11514	9848	7896	6580	5374
7	22950	15368	11321	9824	8032	6794	5636
8	21551	14959	11300	9914	8227	7040	5912
9	20696	14772	11381	10073	8457	7304	6192
10	20171	14723	11526	10273	8708	7578	6475

Here the rows are the number of hash functions, and the columns are the probability of a false positive. And in the number in the table is the optimal size of the vector (in bytes) to obtain the desired probability with such a number of hash functions.

It is clearly seen that when using a single hash function, much longer vectors must be used. And the lower the probability of false positives that is needed, the longer the bloom filter must be.

It seems that the filter is about 8Kb in size and 2-5 hash functions should give good results.

Let’s now look the opposite: how the probabilities (p) will look like for different sizes of bloom filters (m) and different numbers of hash functions (k):

SET output_format_decimal_trailing_zeros=1;

WITH
    8192 AS n,
    number + 1 AS k,
    [128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536] AS m_all,
    arrayMap(m -> ((m * 8) / n), m_all) AS bits_per_elem,
    arrayMap(r -> toDecimal64(power(1 - exp((-k) / r), k), 3), bits_per_elem) AS p_all
SELECT
    k,
    p_all[1] AS `p for m=128`,
    p_all[2] AS `p for m=256`,
    p_all[3] AS `p for m=512`,
    p_all[4] AS `p for m=1024`,
    p_all[5] AS `p for m=2048`,
    p_all[6] AS `p for m=4096`,
    p_all[7] AS `p for m=8192`,
    p_all[8] AS `p for m=16384`,
    p_all[9] AS `p for m=32768`,
    p_all[10] AS `p for m=65536`
FROM numbers(10);

k	p for m=128	p for m=256	p for m=512	p for m=1024	p for m=2048	p for m=4096	p for m=8192	p for m=16384	p for m=32768	p for m=65536
1	0.9996	0.9816	0.8646	0.6321	0.3934	0.2211	0.1175	0.0605	0.0307	0.0155
2	0.9999	0.9993	0.9637	0.7476	0.3995	0.1548	0.0489	0.0138	0.0036	0.0009
3	0.9999	0.9999	0.9925	0.8579	0.4688	0.1468	0.0305	0.0049	0.0007	0.0000
4	0.9999	0.9999	0.9986	0.9287	0.5589	0.1596	0.0239	0.0023	0.0001	0.0000
5	1.0000	0.9999	0.9997	0.9667	0.6516	0.1849	0.0216	0.0013	0.0000	0.0000
6	1.0000	0.9999	0.9999	0.9852	0.7360	0.2198	0.0215	0.0009	0.0000	0.0000
7	1.0000	0.9999	0.9999	0.9936	0.8068	0.2628	0.0229	0.0007	0.0000	0.0000
8	1.0000	0.9999	0.9999	0.9973	0.8625	0.3124	0.0254	0.0005	0.0000	0.0000
9	1.0000	0.9999	0.9999	0.9988	0.9043	0.3669	0.0292	0.0005	0.0000	0.0000
10	1.0000	1.0000	0.9999	0.9995	0.9346	0.4246	0.0341	0.0004	0.0000	0.0000

As you can see the expectations are that vectors shorter than 2048 will not help at all (false-positivite rate is very close to 100%).

Now let’s check those expected results.

Trial 1. Impact of number of hashes

OK, here we check in practice how such a bloom filter will behave. Let’s create tables with the same bloom filter size and different number of hash functions.

Note: As mentioned above, I use the tokenbf_v1 index here, because it allows you to flexibly configure all the bloom filter parameters. Since the rows in our synthetic table contain only alphanumeric values, each row has exactly 1 token. Therefore, all calculations above apply to the token_bf filter in our synthetic test. Read more about filter types below.

CREATE TABLE bf_no_index ( `s` String ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_1 ( `s` String, index bf s TYPE tokenbf_v1(8192, 1, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_2 ( `s` String, index bf s TYPE tokenbf_v1(8192, 2, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_3 ( `s` String, index bf s TYPE tokenbf_v1(8192, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_4 ( `s` String, index bf s TYPE tokenbf_v1(8192, 4, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_5 ( `s` String, index bf s TYPE tokenbf_v1(8192, 5, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_6 ( `s` String, index bf s TYPE tokenbf_v1(8192, 6, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_7 ( `s` String, index bf s TYPE tokenbf_v1(8192, 7, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_8 ( `s` String, index bf s TYPE tokenbf_v1(8192, 8, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();

Let’s check how these indices are displayed in the system.data_skipping_indices table:

SELECT table, data_compressed_bytes, data_uncompressed_bytes
FROM system.data_skipping_indices
WHERE table LIKE 'bf%'

table	data_compressed_bytes	data_uncompressed_bytes
bf_8192_1	17474135	20013056
bf_8192_2	20082538	20013056
bf_8192_3	20097632	20013056
bf_8192_4	20098229	20013056
bf_8192_5	20098727	20013056
bf_8192_6	20099313	20013056
bf_8192_7	20099477	20013056
bf_8192_8	20099494	20013056

You can see that the compression ratio for them is quite poor. That is expected, since bloom filter vectors have random properties if they have optimal size.

Let’s check how the insert speed looks like in these tables:

INSERT INTO bf_no_index SELECT * FROM source; -- Elapsed: 1.998 sec.
INSERT INTO bf_8192_1 SELECT * FROM source;   -- Elapsed: 2.888 sec.
INSERT INTO bf_8192_2 SELECT * FROM source;   -- Elapsed: 3.051 sec.
INSERT INTO bf_8192_3 SELECT * FROM source;   -- Elapsed: 3.234 sec.
INSERT INTO bf_8192_4 SELECT * FROM source;   -- Elapsed: 3.413 sec.
INSERT INTO bf_8192_5 SELECT * FROM source;   -- Elapsed: 3.567 sec.
INSERT INTO bf_8192_6 SELECT * FROM source;   -- Elapsed: 3.800 sec.
INSERT INTO bf_8192_7 SELECT * FROM source;   -- Elapsed: 4.035 sec.
INSERT INTO bf_8192_8 SELECT * FROM source;   -- Elapsed: 4.221 sec.

As you can see, the insertion speed predictably becomes lower with the addition of a bloom filter (we need to do more when inserting) and with an increase in the number of hash functions (more hash-functions need to be calculated). It is noticeable that the addition of the index itself slows down storage of the column by almost 45%, and each additional hash function adds another 8% slowdown

Note: in this case, only one column is indexed, so the difference is more clearly visible here. In non-synthetic tests with more columns, it will be less noticeable.

Now let’s look at the read speed:

Reading speed was tested with the following script:

echo "select count() from bf_tests.bf_no_index where s = 'YMBTGV0R9N3RInZFOYcNkky4UOeiY2dH'" | clickhouse-benchmark -i 100 -c 1

The number of skipped granules was taken from the query execution log.

table	granulas skipped (from 2443)	best query time (sec)	row read (thousands)	MB read	QPS
bf_no_index	0	0.110	20000	820.00	8.001
bf_8192_1	2158	0.031	2330	95.72	23.873
bf_8192_2	2319	0.016	1020	41.65	54.591
bf_8192_3	2370	0.012	598.02	24.52	73.524
bf_8192_4	2373	0.010	573.44	23.51	80.132
bf_8192_5	2380	0.011	516.10	21.16	80.615
bf_8192_6	2383	0.011	491.52	20.15	83.641
bf_8192_7	2377	0.010	540.67	22.17	86.980
bf_8192_8	2371	0.011	589.82	24.18	74.192

As you can see, the reading speed directly depends on the number of “skipped” granules and the number of skipped granules behaves in accordance with the theoretical predictions obtained from the formulas.

Now we understand how the number of hash functions affects the speed of selects / inserts. Now let’s try to figure out the effect of changing the size of the bloom filter vector.

Trial 2. Impact of vector length

This time let’s create a set of tables that differ only in the length of the bloom filter vector:

CREATE TABLE bf_no_index ( `s` String ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_2048_3 ( `s` String, index bf s TYPE tokenbf_v1(2048, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_4096_3 ( `s` String, index bf s TYPE tokenbf_v1(4096, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_8192_3 ( `s` String, index bf s TYPE tokenbf_v1(8192, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_16384_3 ( `s` String, index bf s TYPE tokenbf_v1(16384, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_32768_3 ( `s` String, index bf s TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_65536_3 ( `s` String, index bf s TYPE tokenbf_v1(65536, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_131072_3 ( `s` String, index bf s TYPE tokenbf_v1(131072, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE bf_262144_3 ( `s` String, index bf s TYPE tokenbf_v1(262144, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree ORDER BY tuple();

Let’s check how it impact inserts:

INSERT INTO bf_no_index SELECT * FROM source; -- Elapsed: 1.998 sec.
INSERT INTO bf_2048_3 SELECT * FROM source;   -- Elapsed: 2.864 sec.
INSERT INTO bf_4096_3 SELECT * FROM source;   -- Elapsed: 2.866 sec.
INSERT INTO bf_8192_3 SELECT * FROM source;   -- Elapsed: 2.904 sec.
INSERT INTO bf_16384_3 SELECT * FROM source;  -- Elapsed: 2.969 sec.
INSERT INTO bf_32768_3 SELECT * FROM source;  -- Elapsed: 3.244 sec.
INSERT INTO bf_65536_3 SELECT * FROM source;  -- Elapsed: 3.438 sec.
INSERT INTO bf_131072_3 SELECT * FROM source; -- Elapsed: 3.798 sec.
INSERT INTO bf_262144_3 SELECT * FROM source; -- Elapsed: 4.299 sec.

Obviously a larger bloom filter means “more data needs to be written”, but the slowdown for each doubling of the bloom filter size is not very significant.

Let’s look at the actual size of these bloom filters inside data_skipping_indices:

SELECT table, data_compressed_bytes, data_uncompressed_bytes
FROM system.data_skipping_indices
WHERE table LIKE 'bf%3'
ORDER BY data_uncompressed_bytes ASC

table	data_compressed_bytes	data_uncompressed_bytes
bf_2048_3	5022270	5003264
bf_4096_3	10049670	10006528
bf_8192_3	20097632	20013056
bf_16384_3	39241113	40026112
bf_32768_3	63488764	80052224
bf_65536_3	101119994	160104448
bf_131072_3	133740272	320208896
bf_262144_3	183368525	640417792

You can see here that for very long (and far from optimal) bloom filter vector sizes the compession becomes better, because there are a lot ’empty’ segments of the vector in that case.

Just for reference, here is the size of the column itself:

SELECT
    database,
    table,
    name,
    data_compressed_bytes,
    data_uncompressed_bytes,
    marks_bytes
FROM system.columns
WHERE table = 'bf_262144_3'

database	table	name	data_compressed_bytes	data_uncompressed_bytes	marks_bytes
default	bf_262144_3	s	662653819	660000000	58776

As you can see, even for the largest bloom filter selected, the size is still smaller than the column size.

Now let’s look at the speed of requests. We can check it this way:

echo "select count() from bf_tests.bf_2048_3 where s = 'YMBTGV0R9N3RInZFOYcNkky4UOeiY2dH'" | clickhouse-client --send_logs=trace   2>&1 | grep dropped
echo "select count() from bf_tests.bf_2048_3 where s = 'YMBTGV0R9N3RInZFOYcNkky4UOeiY2dH'"  | clickhouse-benchmark -i 100 -d 0 -c 1

The results:

table	dropped granulas (from 2443)	QPS	fastest query (sec.)
bf_no_index	0	8.001	0.110
bf_2048_3	1293	12.588	0.068
bf_4096_3	2074	28.467	0.029
bf_8192_3	2370	78.234	0.011
bf_16384_3	2434	59.685	0.014
bf_32768_3	2442	22.844	0.033
bf_65536_2	2442	10.880	0.076
bf_131072_3	2442	7.635	0.116
bf_262144_3	2442	4.543	0.205

As expected, a longer bloom filter means fewer false positives, which is clearly seen from the increase in the number of “skipped” granules in accordance with the expectations from the formulas.

But at the same time, it is obvious that longer bloom filters are not ‘free’. And when the length of the vector grows, the overhead of reading and processing the bloom filter itself begins to significantly exceed the benefit that it gives. It is clearly visible as a degradation of QPS with vector length growth.

      poor quality of bf (lot of false positives), but bf is small
                    │                                     and cheap to use
                    │
                    │        ┌──────── optimal value
             ▲      │        │
  query speed│      │        │
             │      │        │         high quality of bf (less of false positives)
             │      │        │               │  but bf itself is huge and expensive
             │      │        ▼               │     to use
             │      ▼      xxxxxxxxxxx       │
             │          xxxx         xxxx    │
             │        xxx                xxx ▼
             │      xxx                    xxxx
             │     xx                         xxx
             │   xx                             xxx
             │  xx                                xx
             │  x
          ───┼───────────────────────────────────────────►
             │                               bf vector size

Summary

Bloom filter indices can be tricky to tune but our experiments have revealed a number of lessons that help set parameters correctly in production systems.

Knowing the number of elements in the set (m), you can accurately predict the behavior of the bloom filter, and choose the optimal parameters using the formulas from the article above. The next article will provide practical guidelines for evaluating m in real data.
When changing the index_granularity of a table or the GRANULARITY of an index, you need to change the bloom filter parameters, because it directly impacts the m parameter.
The more hash functions we use, the more expensive the bloom filter calculation will be. Therefore, it is undesirable to use valuesgreater than 5, since this significantly slows down inserts, and also leads to “oversaturation” of the vector with included bits, thus performance degradation.

That being said, a 1 hash function would require a fairly long vector to get a satisfactory false positive rate. Therefore, it is preferable to use from 2 to 5 hash functions, while increasing the size of the vector proportionally.

Vectors that are too long despite the low probability of false positives are very expensive to use. Therefore, it is preferable to use bloom filters that are orders of magnitude smaller than the column size.
Internally in the formulas for calculation of bloom filter parameter called bits_per_element is used. And because of that one of the simplest ‘rough’ ways of picking good bloom filter parameters is: “take 3 hash functions, and the bloom filter of the size (in bytes) equal to the cardinality of the set”. It can be used as a simple ‘rule of thumb’, but of course, in real life, you may need to do more checks/estimations.

In the next part of the article, we will look at how bloom filters work and can be used as well as tuned with real data in a non-synthetic example. Stay tuned!

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE^® SOLUTIONS

Get in touch with ClickHouse experts.

ClickHouse® Black Magic, Part 2: Bloom Filters

Picking parameters for bloom filters

Test dataset design

Trial 1. Impact of number of hashes

Trial 2. Impact of vector length

Summary

Related:

One Comment

Leave a Reply Cancel reply

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE® SOLUTIONS

Get in touch with ClickHouse experts.

Picking parameters for bloom filters

Test dataset design

Trial 1. Impact of number of hashes

Trial 2. Impact of vector length

Summary

Related:

ClickHouse® Black Magic: Skipping Indices

All about JSON and ClickHouse®: Tips, tricks, and new features!

A Day in the Life of a ClickHouse® Query

One Comment

Leave a Reply Cancel reply

CLICKHOUSE^® SOLUTIONS