ClickHouse Dictionaries Explained

One of the most useful ClickHouse features is external dictionaries. They are extremely powerful, and if used efficiently may lead to quite elegant designs. I will lead you through the dictionaries using few examples that highlight basic and advanced usage scenarios. So let’s begin.

Consider there is a huge fact table with some financial data. Finances often deal with multiple currencies. Currencies are sometimes referred by numeric codes, where 840 stands for USD and 978 for EUR. Fact table designer decided that those numeric codes need to be used as currency keys. He has some logic, though you would probably do it differently. Now we want to have additional currency properties, for example currency name and country. Traditionally this is where SQL joins enter the play. It is possible to do the same in ClickHouse, but dictionaries are more efficient.

Simple example

Typical dictionary definition looks like this.

<dictionary>
  <name>currency</name>
  <source> … </source>
  <lifetime> ... </lifetime>
  <layout><flat/></layout>
  <structure>
    <id>currency_id</id>
    <attribute>
      <name>code</name>
      <type>string</type>
    </attribute>
    <attribute>
     <name>name</name>
     <type>string</type>
    </attribute>
  </structure>
</dictionary>

This is ‘flat’ layout dictionary that is optimal for integer keyed dictionaries of small and medium size. It resides completely in memory, and lookups are very fast.

Let’s look at ‘structure’ section first. It consists of id column name and attribute definitions. Column and attribute names correspond to the names in the source dataset. Id column type is always UInt64 for this dictionary layout. It is also possible to specify expression in addition to column name for both ids and attributes using ‘expression’ tag, and null value default for attributes using ‘null’ tag, e.g. '<null>Undefined</null>'.

Dictionary can be accessed by the set of ‘getDictXXX’ functions, e.g.

select number id,
       dictGetString('currency', 'code', toUInt64(id)) code,
       dictGetString('currency', 'name', toUInt64(id)) name
  from system.numbers where number<1000;

Similarly, it can be used in filters or wherever in SQL queries, including views and materialized views definitions.

Non-standard ID example

Now let’s consider that another database designer decided that it is more convenient to keep currency code straight in the fact table. So there is no need for join or lookup for a code. Can we use the dictionary in this case?

As mentioned above ‘flat’ dictionaries do not support keys other than UInt64. But there is another dictionary layout that works here.

<layout>
  <complex_key_hashed/>
</layout>
<structure>
  <key>
  <attribute>
    <name>code</name>
    <type>String</type>
  </attribute>
  </key>
  ...

Now key consists of attributes that allow to define multi column composite keys as well as use any data types. Here we use single column, but there are use cases when multiple columns make sense. Dictionary call is slightly different due to tuple.

select dictGetString('currency', 'name', tuple('USD'));

Effective Dates

An obvious extension of our example is currency rates. Currency rates depend on currency pair and date. So one approach would be to create a dictionary with composite key (code_from, code_to, rate_date).

<layout>
  <complex_key_hashed/>
</layout>
<structure>
  <key>
    <attribute>
      <name>code_from</name>
      <type>String</type>
    </attribute>
    <attribute>
      <name>code_to</name>
      <type>String</type>
    </attribute>
    <attribute>
      <name>rate_date</name>
      <type>Date</type>
    </attribute>
  </key>
  <attribute>
    <name>rate</name>
    <type>Float64</type>
  </attribute>
  ...

It can be accessed as:

select dictGetFloat64(‘currency_rate’, ‘rate’, tuple(‘USD’, ‘EUR’, today()));

The problem is that rate are not always updated. E.g. they usually don’t change on weekends and holidays. We will have to be careful populating dates in the dictionary for all currency pairs, or eventually we can get zero rate. However, there is a fancier way to handle such uses cases with ‘range_hashed’ layout.

<name>currency_rates</name>
<layout>
  <range_hashed />
</layout>
<structure>
  <id>
    <name>id</name>
  </id>
  <range_min>
     <name>start_date</name>
  </range_min>
  <range_max>
     <name>end_date</name>
  </range_max>
  <attribute>
    <name>rate</name>
    <type>Float64</type>
  <attribute>
  ...

As you can see there is a start and end date for the rates. If end date is empty (nulls are not yet supported at the time of writing, so empty means ‘toDate(0)’), it is considered as open-ended period. There is no need to populate dates carefully. Unfortunately, it is not possible to combine ‘complex_key_hashed’ and ‘range_hashed’ layouts and use composite key, so currency pair should be encoded in one number, e.g. id_from*1000+id_to. The access function in this case is modified as following:

select dictGetFloat64('currency_rate', 'rate', toUInt64(840978), today());

Here was assume that id_from=840, id_to=978 and dictionary key = 840*1000 + 978 = 840978.

Source and Updates

Once we defined the structure, let’s discuss where we can get the data from.

ClickHouse supports several dictionary sources:

  • Local file
  • ODBC data source, e.g. MySQL or Postresql database
  • ClickHouse table
  • Local executable script
  • HTTP/HTTPS service

All sources make sense, but probably the most frequently used are MySQL and Postresql tables. For mysql source definition may may look like this:

<source>
  <mysql>
    <port>3306</port>
    <user>dw</user>
    <password>topsecrect</password>
    <replica>
      <host>dwds1</host>
      <priority>1</priority>
    </replica>
    <db>ch</db>
    <table>dim_country</table>
    <where>country_key>=0</where>
   </mysql>
</source>

Note, that it supports multiple replicas for fail-over. And it is also possible to apply SQL filter to the source table. One caveat here is that ClickHouse opens a database connection for every dictionary from every cluster node. So database should have enough available connections.

Tips and Tricks

While dictionaries provide the convenient way to access dimension and lookup tables, there are following restrictions and problems that may be annoying at times. Here is my list of favorites:

  • ‘Normal’ keys are only UInt64.
  • No way to list all values for specific attribute
  • No on demand updates
  • Every cluster node has its own copy

We have already talked about the first one — it is possible to use ‘complex_key_hashed’ if different data type is needed. Let’s go through other bullets.

While it may be rarely needed when running reports, listing all values is convenient for analysis and also often required by UI tools. Eventually ClickHouse may propose a way to list all values (as of April 2017 this is in short TODO list), but at the moment we have to invent a workaround. Let’s revisit the first dictionary example at the beginning of the article:

select number id,
       dictGetString('currency', 'code', toUInt64(id)) code,
       dictGetString('currency', 'name', toUInt64(id)) name
  from system.numbers where number<1000;

it uses the fact that all keys are in 1..999 range so system.numbers table does the job. In general case the dedicated table with key column may help:

select currency_key,
       dictGetString('currency', 'code', currency_key) code,
       dictGetString('currency', 'name', currency_key) name
  from dim_currency;

This is quite an elegant approach that also allows to do more efficient filtering sometimes. Compare:

select sum(impressions) 
  from very_big_table 
 where dictGetString(‘currency’, ‘code’, currency_key) =‘USD’

vs.

select sum(impressions) 
  from very_big_table 
 where currency_key in (
        select currency_key from dim_currency
         where dictGetString(‘currency’,‘code’, currency_key) =‘USD’
       )

In the second example ClickHouse applies filtering right on the column level, that is more efficient, especially if column is a part of table primary key.

However such an approach requires to develop an infrastructure to maintain key tables. From the other side it may be useful in order to synchronize dictionaries’ state across cluster nodes..

It is convenient to have every dictionary in its own config file. It can be configured in ‘dictionaries_config’ configuration parameter. This parameter contains the absolute or relative path to the file with the dictionary configuration. A relative path is relative to the directory with the server config file. The path can contain wildcards and ?, in which case all matching files are found. Example: dicts/.xml.

ClickHouse’s standard way to refresh dictionaries is using scheduled refreshes configured per dictionary (see lifetime parameter). It is ok approach if dictionaries are small, but may be undesired if there are many dictionaries of significant size. There is no explicit SQL call to refresh dictionary on demand as well. Fortunately, ClickHouse is smart enough to refresh dictionary if dictionary definition file is changed. This opens up a number of possibilities but requires to develop external scripts dealing with dictionary files.

It worths mentioning, that Yandex team is aware of these limitations and we may expect that one demand or triggered updates will be available in the next few months.

Another problem may arise if you operate a ClickHouse cluster. Every node has its own independent dictionary copy and it is developer’s or DBA’s responsibility to make sure they are in sync. I can suggest a couple of possible solutions but neither one is ideal.

One approach is to create a replicated table and use it for synchronization. The table may have two columns (dict_name, dict_refresh_time). Script running at node can look at this table periodically and compare dict_refresh_table with its local time (in system.dictionaries table, see below). If local time is earlier, than go and touch the dictionary file.

If dictionary has a key table, this table updates can be used to trigger dictionary refresh as well.

Also one can utilize Zookeper to replicate configuration updates, including dictionaries definition files.

Troubleshooting

If something goes wrong with dictionaries there are two places to look into:

  • /var/log/clickhouse-server/clickhouse-server.err.log
  • system.dictionaries table

Log file contains all errors, including dictionary load/refresh ones. system.dictionaries table allows to look at ‘processed’ dictionary definition. Here is an example:

SELECT *
  FROM system.dictionaries 
 WHERE name = 'dim_country'
FORMAT Vertical

Row 1:
??????
name:dim_country
origin:/etc/clickhouse-server/dict/dim_country.xml
type:Hashed
key: UInt64
attribute.names: ['country_code','country_name','country_group','world_region','language','sub_region','sales_region','tier']
attribute.types: ['String','String','String','String','String','String','String','String']
bytes_allocated: 310056
query_count: 0
hit_rate:1
element_count: 255
load_factor: 0.2490234375
creation_time: 2017-04-11 15:42:04
last_exception:
source:MySQL: ch.dim_country, where: country_key>=0

In case of error the reason can be found in ‘last_exception’ column.

Typical problems include inaccessible data source or incompatible types. If flat dictionary size exceeds the 500K rows, dictionary needs to be switched to the different layout. There are two options: ‘hashed’ and ‘cache‘. The choice between them depends on how much memory you can allocate to that. There is also ‘complex_key_cache’ alternative for huge dictionaries with composite keys.

It may be also convenient to create distributed table on top of system.dictionaries across cluster nodes and access everything from a single place. This allows to check consistency very easy. But such DBA tricks is probably a topic for another article.

Share

3 Comments

Comments are closed.