Neat ClickHouse® features more people need to know about

If you love ClickHouse®, Altinity can simplify your experience with our fully managed service in your cloud (BYOC) or our cloud and 24/7 expert support. Learn more.
As ClickHouse’s popularity greatly increased over the years, more and more people have heard about its core features, like its incredible performance, high compression ratio, and extensive capabilities of reading and writing nearly all available data formats.
Compared to 2016 when ClickHouse first became open source, it also got a lot easier to work with it as it’s compliance with standard SQL syntax has greatly improved over the years, meaning a lot of new users can just hop on and work with it.
But according to the principle of “easy to learn, hard to master”, if you want to master ClickHouse, you need to look beyond the standards of SQL. ClickHouse has a huge amount of hidden gems that can help with your day-to-day work, which many people don’t know about. Even though most of them are well documented in the official documentation, you might not come across them if you don’t know what you are looking for.
In this blog post, I’ll mention some of my favorite small features I use to help people in community questions you definitely should have heard about.
SELECT Modifiers
In column-based DBMS like ClickHouse, star queries like SELECT * FROM table
should generally be avoided. At least, that’s true when it comes to regular queries issued by your application. However, the typical day-to-day work of a DBA or database developer often includes these types of queries, as otherwise, it would take a huge amount of time to type down all X columns of a table manually.
So queries with stars are all over the place when manually doing stuff. But what if you want to execute the same function on multiple columns from your star? Imagine you have the following table:
CREATE TABLE customer (
customerId UInt64,
custom_num_1 UInt64,
custom_num_2 UInt64,
...
custom_num_50 UInt64,
custom_string_1 String,
custom_string_2 String,
...
custom_string_50 String
);
When you want to optimize your table, you might be interested in the average length of the String columns and the maximum value of your numeric columns. If you were to handwrite a query to collect this information, you would have to write the specific function for 100 columns.
Some database systems (for example MySQL but also ClickHouse), allow you to utilize the INFORMATION_SCHEMA.COLUMNS
table by using a meta query to generate your final query.
This can be convenient for a lot of DBAs as they might already be familiar with this method. However, ClickHouse provides an even faster way to achieve your goal: SELECT Modifiers.
By utilizing a combination of modifiers, our task comes down to a simple query:
SELECT
COLUMNS('.*num.*') APPLY max,
COLUMNS('.*string.*') APPLY length APPLY avg
FROM customer
We use the COLUMNS
modifier to apply a regex to the column names, filtering for columns with names containing “num” or “string”. We then apply the max
function for numeric columns. For string columns, we first apply the length
function and then the avg
function. This method gives us the desired results, and takes way less time than building a query via the information schema, or manually writing down 100 columns.
Manual test data in various formats
Often, when I help customers or ClickHouse users in the open Slack channel, there are cases where I need to provide examples in a test table. Sometimes, writing the table definition for such a test table and filling it with test data can involve more code than you want to show in the first place.
As you could see in the previous section, the table definition I used to explain SELECT modifiers is longer than the actual code I want to present (and it has already been abbreviated).
But there is a possibility in ClickHouse to directly work on data as if it would be in a table, which is the format table-function.
With this function, you can specify which format you want to use (like JSONEachRow) and then directly enter the data instead of inserting it into a table first:
SELECT *
FROM FORMAT(JSONEachRow, '{"customerId":1,"type":1,"custom_num1":4711}\n{"customerId":2, "type":2,"custom_ips":["127.0.0.1","127.0.0.2"]}')
+-customerId-+-type-+-custom_num1-+-custom_ips----------------+
| 1 | 1 | 4711 | [] |
| 2 | 2 | NULL | ['127.0.0.1','127.0.0.2'] |
+------------+------+-------------+---------------------------+
The function generates a result set with two rows, and even complex types like Arrays are doable. Additionally, you can use a large variety of data formats.
Generate Series with the numbers
table-function
Some DBMS can generate number sequences by using table functions. ClickHouse is no exception in this regard, providing the numbers()
table-function for that.
So far that’s nothing special, but did you know that you can also use this function to easily generate date or time sequences?
ClickHouse allows simple arithmetic functions on Date and DateTime data types, allowing you to generate sequences of dates or timestamps:
SELECT
number,
now() - number AS previousTimes,
toDate(now()) + number AS nextDays
FROM numbers(10)
+-number-+-------previousTimes-+---nextDays-+
| 0 | 2024-04-25 13:04:30 | 2024-04-25 |
| 1 | 2024-04-25 13:04:29 | 2024-04-26 |
| 2 | 2024-04-25 13:04:28 | 2024-04-27 |
| 3 | 2024-04-25 13:04:27 | 2024-04-28 |
| 4 | 2024-04-25 13:04:26 | 2024-04-29 |
| 5 | 2024-04-25 13:04:25 | 2024-04-30 |
| 6 | 2024-04-25 13:04:24 | 2024-05-01 |
| 7 | 2024-04-25 13:04:23 | 2024-05-02 |
| 8 | 2024-04-25 13:04:22 | 2024-05-03 |
| 9 | 2024-04-25 13:04:21 | 2024-05-04 |
+--------+---------------------+------------+
By applying multiplication, you can also introduce steps, etc. As the final data type of the columns will still be Date or DateTime, you can be sure that only valid dates are generated.
Data formatting in custom columns
Sometimes it is needed to partially format your data on output. For example, if you want to insert data into a streaming service like Kafka, you might have some columns that should be direct columns, while data from others needs to be combined into a payload column in a given format (typically JSON).
Of course, you can do that in other DBMS as well, by using string concatenation and building your JSON manually, or by using specific JSON functions like toJSONString
, or by manually creating JSON Objects, etc. ClickHouse has you covered by giving you the function formatRowNoNewline()
. This function allows you to format an arbitrary amount of columns into all possible output formats ClickHouse has to offer.
And, of course, you can also use SELECT modifiers to specify which columns to format:
SELECT
customerId,
formatRowNoNewline('JSONEachRow', COLUMNS('.*num.*')) AS payload
FROM customer
LIMIT 10
+-customerId-+-payload------------------------------------------------------------------------------+
| 20 | {"custom_num_1":"4503644724578621668","custom_num_2":"156","custom_num_50":"32624"} |
| 111 | {"custom_num_1":"9395348731023764754","custom_num_2":"4","custom_num_50":"8919"} |
| 187 | {"custom_num_1":"4410745110154105282","custom_num_2":"67","custom_num_50":"19015"} |
| 231 | {"custom_num_1":"8206799308850873985","custom_num_2":"151","custom_num_50":"43260"} |
| 262 | {"custom_num_1":"14904510309594397590","custom_num_2":"83","custom_num_50":"2802"} |
| 375 | {"custom_num_1":"14468162691130275987","custom_num_2":"13","custom_num_50":"6910"} |
| 388 | {"custom_num_1":"15191470301382236130","custom_num_2":"110","custom_num_50":"39490"} |
| 434 | {"custom_num_1":"11648353314185268442","custom_num_2":"211","custom_num_50":"52725"} |
| 439 | {"custom_num_1":"8430391284904487000","custom_num_2":"31","custom_num_50":"43376"} |
| 468 | {"custom_num_1":"11265143749859112447","custom_num_2":"41","custom_num_50":"58748"} |
+------------+--------------------------------------------------------------------------------------+
Querying the whole cluster
Sometimes, querying a single node is not enough. Imagine that you are looking for queries that run longer than a specific threshold. You can find the information within ClickHouse’s system.query_log
table, but you would have to check all hosts separately.
Once again, ClickHouse has you covered. The table-function clusterAllReplicas allows you to execute a query on all nodes of a cluster and gives you the combined result as if it were a local table:
SELECT
user,
substring(query, 1, 15) AS query_part,
query_duration_ms
FROM clusterAllReplicas('mytestcluster', system, query_log)
WHERE query_duration_ms > 50
LIMIT 3
+-user----+-query_part------+-query_duration_ms-+
| default | INSERT INTO col | 52 |
| default | INSERT INTO col | 55 |
| default | INSERT INTO col | 51 |
+---------+-----------------+-------------------+
Bonus: Working with AggregateStates
Working with Materialized Views and the possibilities of AggregateFunctions could fill multiple blog posts on its own. Therefore, I only want to briefly mention some functionality not everyone knows about, but which could come in handy.
Let’s assume we have the following two tables, which count unique customers per hour or day:
CREATE TABLE customers_hourly (
eventDate Date,
eventHour UInt8,
uniqueCustomers AggregateFunction(uniq, UInt64)
) ENGINE=AggregatingMergeTree
ORDER BY (eventDate, eventHour);
CREATE TABLE customers_daily (
eventDate Date,
uniqueCustomers AggregateFunction(uniq, UInt64)
) ENGINE=AggregatingMergeTree
ORDER BY (eventDate);
Initialize Aggregation
Filling those tables is quite easy with Materialized Views. But what if you want to insert a row manually? For example, for testing purposes, you want to insert the test customerId 4711
at 3 different hours of the same day.
As uniqueCustomers is an AggregateFunction, you cannot directly insert the customerId into that column, so something like this doesn’t work:
INSERT INTO customers_hourly
SELECT eventDate,eventHour,customerId
FROM FORMAT(JSONEachRow,
$${"customerId":4711,"eventDate":"2024-04-26","eventHour":1}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":2}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":3}$$)
One possibility to fill this Datatype would be to generate an aggregation state by using window functions, with the window only covering the current row:
INSERT INTO customers_hourly
SELECT eventDate,eventHour,
uniqState(toUInt64(assumeNotNull(customerId))) OVER ()
FROM FORMAT(JSONEachRow,
$${"customerId":4711,"eventDate":"2024-04-26","eventHour":1}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":2}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":3}$$)
However the readabilty is not really optimal. But there is a built in function to achieve exactly the goal: you can just use the initializeAggregation
function to directly insert into the table:
INSERT INTO customers_hourly
SELECT eventDate,eventHour,
initializeAggregation('uniqState',toUInt64(assumeNotNull(customerId)))
FROM FORMAT(JSONEachRow,
$${"customerId":4711,"eventDate":"2024-04-26","eventHour":1}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":2}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":3}$$)
Finalize Aggregation
Now that we know how to write directly into an aggregation function, how can we read from it? Of course, we can use uniqMerge
to get the final result, but this is an AggregateFunction, which means we need to use GROUP BY
, etc., to get the final result. If we want to see the intermediate results we added to the table, we’ll need to use a window function again, to prevent the collapsing of the rows. Alternatively, we can use finalizeAggregation
to make it easier:
SELECT
eventDate,
eventHour,
finalizeAggregation(uniqueCustomers)
FROM customers_hourly
┌──eventDate─┬─eventHour─┬─finalizeAggregation(uniqueCustomers)─┐
│ 2024-04-26 │ 1 │ 1 │
│ 2024-04-26 │ 2 │ 1 │
│ 2024-04-26 │ 3 │ 1 │
└────────────┴───────────┴──────────────────────────────────────┘
┌──eventDate─┬─eventHour─┬─finalizeAggregation(uniqueCustomers)─┐
│ 2024-04-26 │ 1 │ 1 │
│ 2024-04-26 │ 2 │ 1 │
│ 2024-04-26 │ 3 │ 1 │
└────────────┴───────────┴──────────────────────────────────────┘
-MergeState
How do we go from the hourly aggregation to the daily aggregation? In cases where you could sum up the results from the hourly table, that would be quite easy to achieve. Unfortunately, with the uniq function, you cannot simply sum up the intermediate results, as the same user could have been active during multiple hours. The daily result is not the sum of all hourly results.
I’ve seen customers resort to recalculating the daily table from the basic table, which is of course possible, but if the calculation is very CPU intense, this can have a heavy impact on your cluster, and you don’t want to do the same calculation multiple time. Instead you can just continue the aggregation by using uniqMergeState
:
INSERT INTO customers_daily
SELECT eventDate, uniqMergeState(uniqueCustomers)
FROM customers_hourly
GROUP BY eventDate
The logic is as easy as it sounds. It will merge the intermediate results, but instead of giving the merged result, it will give the new internal state, which will then be stored in the daily Aggregation Table.
These are 5 small features (plus 3 Bonus features for Aggregation) I think are good to know. If you have other neat features or topics you want to read about, just contact me or my collegues via Slack or LinkedIin, we love talking about ClickHouse.
And never forget Open Source lives from its community!
The original article was published by Stefan while he was at Double.Cloud.
ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.