One of our customers recently had a problem using CickHouse: the simple workflow of load-analyze-present wasn’t as efficient as they were expecting. The body of the problem was with loading and presenting IPv4 and IPv6 addresses, which are traditionally stored in ClickHouse as UInt32 and FixedString(16) columns. These types have many advantages, like compact footprint and ease of comparing values. But they also have shortcomings that prompted us to seek a better solution.
For example, imagine that you want to load your dataset with IPv4-values into a table. It generally looks like the following. We just show IPv4, but IPv6 values are treated analogously.
INSERT INTO hits_old (url, from) VALUES ('https://wikipedia.org',
Not too bad, but you can not insert too many values this way. You may prefer to load data from the file instead:
$ head -n3 data.csv
Unfortunately loading this file is not that easy. Before our new feature arrived there were two options:
- Use a staging table and perform type conversion when ‘INSERT SELECT’ from the staging table
- Use materialized columns or materialized views for type conversions
Both approaches require an additional data structure, hence add overhead to processing.
In order to overcome these problems, we introduced two new data types to ClickHouse: IPv4 and IPv6. With these new data types conversions are automatic. User code is now more straightforward and easier to read:
INSERT INTO hits_new (url, from) VALUES ('https://wikipedia.org', '184.108.40.206')('https://clickhouse.yandex',
And the best thing: you can now load your dataset with
INSERT INTO ... FORMAT directly:
$ clickhouse client -q 'INSERT INTO hits_new FORMAT CSV' < data.csv
IPv4 values have nice human-friendly textual representation, which is what you see SELECTing from a client console andor exporting values into a file.
Now let’s check out detailed differences between the traditional and new approach of handling IP address values.
Imagine that you need to analyze user activity. Previously you would have designed the following table:
CREATE TABLE hits_old ( url String, from UInt32 ) ENGINE = MergeTree() ORDER BY url;
The new way of doing exactly the same is:
CREATE TABLE hits_new ( url String, from IPv4 ) ENGINE = MergeTree() ORDER BY url;
Now let’s check out how third-party software sees the old table:
DESCRIBE TABLE hits_old; url String from UInt32
The new way looks as follows:
DESCRIBE TABLE hits_new; url String from IPv4
The new data type has a distinct type name. Now any third-party software has a chance of recognizing the column as IPv4 right away, without even looking at stored values.
Let’s fill our table with data. As you have already seen, inserting data into a table with new IPv4 columns is easy:
INSERT INTO hits_new (url, from) VALUES ('https://ya.ru', '220.127.116.11');
You can even load data files directly that contain CSV or any other format supported by ClickHouse:
$ clickhouse client -q 'INSERT INTO hits_new FORMAT CSV' < data.csv
This is made possible with automatic conversion of the textual representation at load time, as if you were wrapping each and every value with the
IPv4StringToNum() function call.
Please note that loading old-style datafiles with int-like IP addresses into table with IPv4 column does not work. The process will fail with:
Code: 441. DB::Exception: Invalid IPv4 value.
Instead you can use a UInt32-backed table and convert it later.
Working with values
You can use any function that takes a UInt32 argument and call it with an IPv4 value. In fact, you can even use basic arithmetic on IPv4 values, even though this doesn’t make much sense. It would be more meaningful to use specialized functions, like
SELECT IPv4NumToStringClassC(from) FROM hits_new LIMIT 1; ┌─IPv4NumToStringClassC(from)─┐ │ 183.247.232.xxx │ └─────────────────────────────┘
Please note that no other type conversions (other than to UInt32) are performed implicitly, so if you want to turn your IPv4 value into a string you have to do this explicitly with
SELECT concat(url, ' was accessed from ', IPv4NumToString(from)) FROM hits_new LIMIT 1; ┌─concat(url, ' was accessed from ', IPv4NumToString(from))──┐ │ https://clickhouse.yandex was accessed from 18.104.22.168 │ └────────────────────────────────────────────────────────────┘
After you have loaded all the data and performed your magic on it, I assume you want to see the results:
SELECT * FROM hits_old LIMIT 3; ┌─url────────────────────────────────┬───────from─┐ │ https://clickhouse.yandex │ 3086477370 │ │ https://clickhouse.yandex/docs/en/ │ 1953112818 │ │ https://wikipedia.org │ 1962748037 │ └────────────────────────────────────┴────────────┘
The old table is quite a nuisance, as you have to wrap data in a function call to get a human-friendly textual IP address. By contrast values in the new table look like normal IP addresses straight away:
SELECT * FROM hits_new LIMIT 3; ┌─url────────────────────────────────┬───────────from─┐ │ https://clickhouse.yandex │ 22.214.171.124 │ │ https://clickhouse.yandex/docs/en/ │ 126.96.36.199 │ │ https://wikipedia.org │ 188.8.131.52 │ └────────────────────────────────────┴────────────────┘
Quite a bit more convenient, right? The same happens when you are exporting values from table in bulk. Here’s the old approach.
SELECT * FROM hits_old LIMIT 3 FORMAT CSV;
And now the new approach:
SELECT * FROM hits_new LIMIT 3 FORMAT CSV;
Under the hood
Even though it appears that new-style IPv4 values are stored as strings, this is actually not the case. Let’s dump a value of the hits_old as hex to see the binary representation:
SELECT toTypeName(from), hex(from) FROM hits_old LIMIT 1; ┌─toTypeName(from)─┬─hex(from)─┐ │ UInt32 │ B7F7E83A │ └──────────────────┴───────────┘
Doing the same thing for the hits_new table produces exactly the same result:
SELECT toTypeName(from), hex(from) FROM hits_new LIMIT 1; ┌─toTypeName(from)─┬─hex(from)─┐ │ IPv4 │ B7F7E83A │ └──────────────────┴───────────┘
Not only IPv4 values are stored as efficient as UInt32-counterparts, but both cases have the same binary layout.
To be more precise, IPv6 and IPv6 are not real types but domains. Adding a new type to ClickHouse is an elaborate and difficult task that is not normally recommended. Domains are a sort of “tag” attached to an existing base type that redefines the type name and overrides the way values are parsed from and rendered to textual form. This design resembles user-defined domains from PostgreSQL, except there are no value constraints. That said, any IPv4 value representable as UInt32 is valid and any IPv6 as FixedString(16) is valid too. Consequently there is no need to for constraints, at least for now.
Converting old columns to new ones
If you are convinced and want to convert your existing IPv4-disguised-as-UInt32 columns into true IPv4-columns, you can easily do that with
ALTER TABLE hits_old MODIFY COLUMN from IPv4; SELECT * FROM hits_old LIMIT 1; ┌─url───────────────────────┬───────────from─┐ │ https://clickhouse.yandex │ 184.108.40.206 │ └───────────────────────────┴────────────────┘
How expensive would that conversion be? Let’s find out! I prepared a datafile based on the dataset from The CAIDA UCSD IPv4 Routed /24 DNS Names Dataset – 20171130. It has four million rows, so we can now have some fun.
$ wc -l ./IPv4_test_data_UInt32.tsv 4281966 ./IPv4_test_data_UInt32.tsv $ clickhouse-client -q 'DESCRIBE hits_old' url String from UInt32 $ time clickhouse client -q 'INSERT INTO hits_old FORMAT TSV' < ./IPv4_test_data_UInt32.tsv real 0m23.242s user 0m20.459s sys 0m0.268s $ clickhouse client -q "SELECT * FROM hits_old WHERE url != '' LIMIT 1" host-213-144-115-157.reverse.teknotel.com 3583013789 $ time clickhouse client -q "ALTER TABLE hits_old MODIFY COLUMN from IPv4;" real 0m0.141s user 0m0.099s sys 0m0.021s $ clickhouse client -q "SELECT * FROM hits_old WHERE url != '' LIMIT 1" host-213-144-115-157.reverse.teknotel.com 220.127.116.11 $ clickhouse client -q "SELECT count(*) FROM hits_old" 4281966
The conversion was nearly instant, even on my relatively old laptop! The trick is that the data itself wasn’t touched, only the column headers describing the type of the data. Thus very little work was done and execution time doesn’t depend on the size of data.
As this article showed, there are two issues with the traditional ClickHouse approach to IP address handling:
- To import data from a human-friendly IPv4/6 format you have to use additional temporary storage.
- There is no easy way of providing information about column type other than having contrived name, so integration with third-party software might be complicated.
These are fixed by the new IPv4 and IPv6 domains, which have the following traits:
- Distinct type name
- Light footprint
- Compatibility with legacy IPv4/v6 storage format
- Human-friendly text format for imported and exported values
Altinity contributed the new domain type implementation to ClickHouse. It is available in ClickHouse starting from version 19.3. It was great to see the pull request go in. We are now moving to new contributions. Stay tuned for more blog articles describing our work!