Five Ways to Handle as-of Queries in ClickHouse

By Alexander Zaitsev on April 8th, 2020

Five Ways to Handle as-of Queries in ClickHouse

Alexander Zaitsev ClickHouseFeaturesTime SeriesTutorial

 

Apr 8, 2020

A common use case in time series applications is to get the measurement value at a given point of time. For example, if there is a stream of measurements, one often needs to query the measurement as of current time or as of the same day yesterday and so on. Financial market data analysis and all sorts of monitoring applications are typical examples. 

Databases have different ways to achieve this task and ClickHouse is not an exception here. In fact, ClickHouse offers at least 5 different approaches. In this article, we will review and compare them.

As-of use case description

We already described this use case in our “ClickHouse Continues to Crush Time-Series” article a year ago. We will use our 500 billion row dataset from “Analyzing 500 billion rows on an Intel NUC” and query a single sensor as of some date in the time range. 

Just to remind you, the table looks as simple as:

CREATE TABLE billy.readings (
    sensor_id Int32 Codec(DoubleDelta, LZ4),
    time DateTime Codec(DoubleDelta, LZ4),
    date ALIAS toDate(time),
    temperature Decimal(5,2) Codec(T64, LZ4)
) Engine = MergeTree
PARTITION BY toYYYYMM(time)
ORDER BY (sensor_id, time);

It stores 525 billion measurements from 1M sensors for the year 2019.

SELECT count()
FROM billy.readings
┌──────count()─┐
│ 525600000000 │
└──────────────┘
1 rows in set. Elapsed: 0.004 sec.

Now, let’s pick sensor_id=12345 and August 1st 2020, and ask what the temperature was for this sensor at this date. Note that we do not know for sure if there was even a single measurement on that day or the day before. So the challenge is to find the most recent measurement taken before or at this time. This is where the notation ‘as-of’ comes from.

All queries are going to have a common filter for a sensor_id and date, and the difference between the queries is the way filtered data is processed:  

sensor_id = 12345 and date <= '2019-08-01'

#1. Join approach

The most traditional SQL approach is to calculate the proper time point for every group in a subquery and then perform a join. This can be done with ClickHouse as well. Unfortunately, ClickHouse cannot properly utilize indexes for a join yet (this is going to be fixed soon), so sensor_id filter is required for the outer query:

SELECT *
FROM readings
INNER JOIN 
(
    SELECT 
        sensor_id, 
        max(time) AS time
    FROM readings
    WHERE (sensor_id = 12345) AND (date <= '2019-08-01')
    GROUP BY sensor_id
) AS last USING (sensor_id, time)
WHERE sensor_id = 12345
┌─sensor_id─┬────────────────time─┬─temperature─┐
│     12345 │ 2019-08-01 23:59:00 │       40.38 │
└───────────┴─────────────────────┴─────────────┘
1 rows in set. Elapsed: 0.062 sec. Processed 1.98 million rows, 19.82 MB (31.91 million rows/s., 319.07 MB/s.) 

Note, ClickHouse had to scan almost 2 million rows.

#2. Tuple IN approach

A more ClickHouse-friendly way is to replace JOIN with the IN clause. ClickHouse can use IN with multiple columns grouped in tuple:

SELECT *
FROM readings
WHERE (sensor_id, time) IN 
(
    SELECT 
        sensor_id, 
        max(time)
    FROM readings
    WHERE (sensor_id = 12345) AND (date <= '2019-08-01')
    GROUP BY sensor_id
)
┌─sensor_id─┬────────────────time─┬─temperature─┐
│     12345 │ 2019-08-01 23:59:00 │       40.38 │
└───────────┴─────────────────────┴─────────────┘
1 rows in set. Elapsed: 0.030 sec. Processed 8.19 thousand rows, 98.30 KB (269.64 thousand rows/s., 3.24 MB/s.) 

Here, the index is used appropriately, so we have decent performance.

#3. Aggregate function approach

ClickHouse developers can also use the ‘argMax’ aggregate function for as-of queries. 

SELECT 
    sensor_id, 
    max(time), 
    argMax(temperature, time)
FROM readings
WHERE (sensor_id = 12345) AND (date <= '2019-08-01')
GROUP BY sensor_id
┌─sensor_id─┬───────────max(time)─┬─argMax(temperature, time)─┐
│     12345 │ 2019-08-01 23:59:00 │                     40.38 │
└───────────┴─────────────────────┴───────────────────────────┘
1 rows in set. Elapsed: 0.054 sec. Processed 991.23 thousand rows, 11.89 MB (18.26 million rows/s., 219.17 MB/s.)

It is slightly slower compared to the IN approach, but can be packaged into an aggregating materialized view, if needed. Another difference is that for JOIN/IN we needed an index to match our use case perfectly. However, ‘argMax’ can work well even if there is no index on a time column. 

#4. ASOF JOIN

In addition to the pretty ‘SQL-standard’ approaches described above, ClickHouse also provides some exotic instruments. One of them is the ASOF JOIN inspired by kdb+ that can join two time series with different time points. The idea is to provide the nearest match for points in each series. We only have one time series here, but we still can use ASOF JOIN to create a dummy time series for a given sensor. 

SELECT readings.*
FROM 
(
    SELECT 
        toInt32(12345) AS sensor_id, 
        toDateTime('2019-08-01 23:59:00') AS time
) AS base
ASOF INNER JOIN readings USING (sensor_id, time)
WHERE readings.sensor_id = 12345
┌─readings.sensor_id─┬───────readings.time─┬─temperature─┐
│              12345 │ 2019-08-01 23:59:00 │       40.38 │
└────────────────────┴─────────────────────┴─────────────┘
1 rows in set. Elapsed: 0.069 sec. Processed 991.23 thousand rows, 11.89 MB (14.31 million rows/s., 171.76 MB/s.) 

Once again, we had to use an explicit filter for sensor_id, since ClickHouse joins are not very smart.

#5. LIMIT BY

The last example uses another feature unique to ClickHouse — LIMIT BY. LIMIT BY returns several elements for a particular group. In our case we only need one row for a sensor_id:

SELECT *
FROM readings
WHERE (sensor_id = 12345) AND (date <= '2019-08-01')
ORDER BY 
    sensor_id ASC, 
    time DESC
LIMIT 1 BY sensor_id
┌─sensor_id─┬────────────────time─┬─temperature─┐
│     12345 │ 2019-08-01 23:59:00 │       40.38 │
└───────────┴─────────────────────┴─────────────┘
1 rows in set. Elapsed: 0.095 sec. Processed 991.23 thousand rows, 11.89 MB (10.39 million rows/s., 124.63 MB/s.)

It is the slowest of all the options for this use case, but it can do something special. For example, one can use LIMIT 3 and return the 3 last measurements by time point. This is difficult to do with other approaches.

SELECT *
FROM readings
WHERE (sensor_id = 12345) AND (date <= '2019-08-01')
ORDER BY 
    sensor_id ASC, 
    time DESC
LIMIT 3 BY sensor_id
┌─sensor_id─┬────────────────time─┬─temperature─┐
│     12345 │ 2019-08-01 23:59:00 │       40.38 │
│     12345 │ 2019-08-01 23:58:00 │       40.44 │
│     12345 │ 2019-08-01 23:57:00 │       40.28 │
└───────────┴─────────────────────┴─────────────┘
3 rows in set. Elapsed: 0.095 sec. Processed 991.23 thousand rows, 11.89 MB (10.47 million rows/s., 125.62 MB/s.)

Conclusion

In this short article we demonstrated different approaches to as-of queries that are typical for time series applications. The fastest when there is a good matching index is #2, the IN approach, while #3 with argMax aggregation function is more generic and can be used together with aggregating materialized views. Other approaches are more exotic but can be efficient when the as-of query is extended to other requirements, like stitching two time series together with ASOF JOIN or returning several as-of measurements instead of a single one with LIMIT BY.

While we did not focus on performance here, note that the 500 billion row dataset has been used for all queries in this article, and ClickHouse constantly delivered millisecond response time. 

ClickHouse is fun. Stay tuned.

Leave a Reply

Your email address will not be published. Required fields are marked *