ClickHouse Dictionaries, Reloaded!

 

May 19, 2020

Dictionaries.png

ClickHouse external dictionaries are a “ClickHouse way” to handle multi-dimensional schema. ClickHouse Dictionaries Explained was the very first article in the Altinity blog three years ago. ClickHouse has improved significantly since then, and dictionaries have achieved a new level of utility. In this article, I will explain two new cool ClickHouse features that completely change the way how dictionaries can be used. Change in a good sense, of course.

Overview

ClickHouse External Dictionaries are in-memory key-value structures. They can be used to replace joins in SQL. Before multiple table joins were supported in ClickHouse, dictionaries were a silver bullet for many applications. Dictionaries are plugged to external sources. A source can be a table in another database (ClickHouse, MySQL or generic ODBC), file, or web service. ClickHouse can automatically refresh dictionaries from an external source to keep them up to date. Using dictionaries in this way makes ETL processes obsolete. Since dictionaries are usually in-memory, they can be used for low latent lookup queries, as well as boosting performance of queries in general.

There have been two usability issues with dictionaries, though. First, dictionaries had to be defined in XML configuration files. There was no way to define them using DDL. That made schema management inconsistent and inconvenient. For example, if one has no physical access to the server, there was no way to create a dictionary at all. 

The second issue is SQL compatibility. BI-tools and analysts had to know how to use dictionaries in SQL queries. Humans may learn, but BI-tools can not. Therefore dictionaries were very hard to use with tools like Tableau, for example.

I am happy to inform you that both issues are gone!

DDL for dictionaries

As mentioned above, up until recently, dictionaries had to be created in bulky XML files. Things changed a few months ago when CREATE DICTIONARY was introduced. Let’s create a test table in ClickHouse, populate it with 10M rows, and plug a dictionary to this table as below. Note that you need the latest ClickHouse 20.1 version for this behavior, e.g. 20.1.11.73:

CREATE TABLE table_for_dict (
  key_column UInt64,
  third_column String
)
ENGINE = MergeTree()
ORDER BY key_column;
INSERT INTO table_for_dict select number, concat('Hello World ', toString(number)) from numbers(10000000);
CREATE DICTIONARY ndict(
  key_column UInt64 DEFAULT 0,
  third_column String DEFAULT 'qqq'
)
PRIMARY KEY key_column
SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000 USER 'default' TABLE 'table_for_dict' PASSWORD '' DB 'default'))
LIFETIME(MIN 1 MAX 10)
LAYOUT(HASHED());

We defined a dictionary in DDL! We can check it with ‘SHOW DICTIONARIES’ and ‘SHOW CREATE DICTIONARY ndict’. It is also possible to create dictionaries with ‘ON CLUSTER’ syntax, propagating changes to all nodes in the clusters.

Let’s check our dictionary with the dictGet() function:

SELECT dictGet('default.ndict', 'third_column', toUInt64(1))
┌─dictGet('default.ndict', 'third_column', toUInt64(1))─┐
│ Hello World 1                                         │
└───────────────────────────────────────────────────────┘

Note that we need to add a database. DDL dictionaries are a part of the schema, so they have a database scope. 

Now things get interesting. It turns out we do not even need to use dictGet() calls anymore. 

Join with a Dictionary

Exposing a dictionary as a table with a Dictionary engine has been available since 2018, so dictionaries could be used in SQL queries similar to a table or a view. That seemed smart to do since a dictionary is always in memory, and we would expect good performance. We can join the table and dictionary with  the following test query, for example:

SELECT *
FROM numbers(5) AS n
INNER JOIN ndict ON key_column = number
┌─number─┬─key_column─┬─third_column──┐
│      0 │          0 │ Hello World 0 │
│      1 │          1 │ Hello World 1 │
│      2 │          2 │ Hello World 2 │
│      3 │          3 │ Hello World 3 │
│      4 │          4 │ Hello World 4 │
└────────┴────────────┴───────────────┘
5 rows in set. Elapsed: 2.986 sec. Processed 10.00 million rows, 358.89 MB (3.35 million rows/s., 120.21 MB/s.) 

If you are running this query with ClickHouse version 20.3 or earlier, you will get something similar. 3 seconds when there are only 5 rows to join is not fast at all! The query engine is not aware of the dictionary, and it processes 10 million rows at the right side of the join as it would do with a table. The example is deliberate to illustrate the inefficiency in this case. 

If you are experienced with ClickHouse, however, you can rewrite the query as follows:

SELECT
    number, 
    dictGet('default.ndict', 'third_column', number)
FROM numbers(5)
┌─number─┬─dictGet('default.ndict', 'third_column', number)─┐
│      0 │ Hello World 0                                    │
│      1 │ Hello World 1                                    │
│      2 │ Hello World 2                                    │
│      3 │ Hello World 3                                    │
│      4 │ Hello World 4                                    │
└────────┴──────────────────────────────────────────────────┘
5 rows in set. Elapsed: 0.002 sec. 

Now it is fast! The change was trivial, but it required a user to know that some tables are actually dictionaries. Of course, many users questioned why ClickHouse could not rewrite the query automatically. Starting with the newest 20.4 version, it finally can! 

SELECT *
FROM numbers(5) AS n
INNER JOIN ndict ON key_column = number
┌─number─┬─key_column─┬─third_column──┐
│      0 │          0 │ Hello World 0 │
│      1 │          1 │ Hello World 1 │
│      2 │          2 │ Hello World 2 │
│      3 │          3 │ Hello World 3 │
│      4 │          4 │ Hello World 4 │
└────────┴────────────┴───────────────┘
5 rows in set. Elapsed: 0.002 sec. 

Instant response with the join! In fact, ClickHouse does not join the right table anymore. Under the hood it makes dictionary calls in an efficient way. It requires only 5 calls instead of scanning a 10M rows table.

Another advantage is that all attributes of a dictionary can be joined together — with a joinGet() function, every attribute needs to be queried automatically.

Conclusion

ClickHouse external dictionaries are finally first class citizens in ClickHouse. Dictionaries are part of the schema. They can be created with a DDL locally or ON CLUSTER. The implementation details can be hidden from the end user starting with the ClickHouse 20.4 version, since ClickHouse can automatically convert joins to dictionary calls. ClickHouse dictionaries are now more user-friendly than ever before! 

Share