Real-time Moving Average with ClickHouse Live Views

By Vitaliy Zakaznikov on March 9th, 2020

Real-time Moving Average with ClickHouse Live Views

Vitaliy Zakaznikov ClickHouseLive Views

 

Mar 9, 2020

In data analytics, analysts often use moving averages. Moving averages help to smooth data series as well as identify long term trends. New Live View tables start to bring real-time capabilities to ClickHouse.

One of the applications of Live View tables is a calculation of real-time metrics on the event data. Readings from IoT sensors, price ticks from the stock exchange, or some metrics from your production servers are some of the examples of event data streams. ClickHouse can store all this data with a good compression ratio and excellent analytical query performance.

831_small.jpg

In this article, we will show an example of how we can use Live Views to calculate simple real-time statistics on a stream of events that we could continuously load into ClickHouse.

Setup

The ability to use Live View tables to calculate real-time statistics on a stream of events became available in the stable version 20.1.2.4. You can just upgrade your local installation, or if you don’t have ClickHouse already installed, please follow the instructions found in the ClickHouse documentation (https://clickhouse.yandex/docs/en/getting_started/install/).
Make sure you have this or later version if you want to follow along.

$ clickhouse-client
ClickHouse client version 20.1.2.4 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 20.1.2 revision 54431.

user-node :)

For the examples, we will use a table that contains only value and time columns to store a stream of data. In real-life streams, there will most likely be more columns, but we’ll keep it simple.

CREATE TABLE events(`value` Int32, `time` DateTime DEFAULT now()) 
ENGINE = Memory

Simple Average Value Over The Last N values

One of the simplest statistics that we can calculate using an events stream is an average value over the last N values. A query to calculate an average over the last N values, taking the N to be 10, could be written as:

SELECT sum(value)/10 FROM (SELECT * FROM events ORDER BY time DESC LIMIT 10)

The query is straightforward. First, we order the rows by descending time and then take the top 10 rows. After that, we calculate the average by summing all the values and dividing the result by 10.

Let’s pre-populate the events table with some values and make sure the query returns the correct result.

INSERT INTO events VALUES 
(1,now()-9),
(2,now()-8),
(3,now()-7),
(4,now()-6),
(5,now()-5),
(6,now()-4),
(7,now()-3),
(8,now()-2),
(9,now()-1),
(10,now())

If we execute the query, the result should be as follows.

SELECT sum(value) / 10
FROM 
(
    SELECT *
    FROM events
    ORDER BY time DESC
    LIMIT 10
)

┌─divide(sum(value), 10)─┐
│ 5.5                    │
└────────────────────────┘

Moving Average Over The Last N values

Having checked that our query to calculate an average value over the last ten rows returns the correct result, we can now create a Live View table to calculate the moving average. Let’s look at how we can create it, and then we will explain how it works.

CREATE LIVE VIEW ma10 AS
SELECT sum(value) / 10
FROM 
(
    SELECT *
    FROM 
    (
        SELECT *
        FROM events
        ORDER BY time DESC
        LIMIT 10
    )
    ORDER BY time DESC
    LIMIT 10
)

Give it a try and create the table. Then, try to query it to make sure we have the same result as before.

SELECT *
FROM ma10

┌─divide(sum(value), 10)─┐
│ 5.5                    │
└────────────────────────┘

You might have noticed that the query that we used to create a Live View table is not the same as we used to calculate the average value. Instead of just having one subquery, we now have two. What is the most surprising is that the subquery that wraps the original subquery seems to be redundant as it orders rows by descending time again and again applies the same limit of 10.

How Does It Work?

Having seen a seemingly weird definition of a Live View table to calculate a moving average, we are ready to dive into how Live View tables handle subqueries. It is not that complicated, so I will try to explain it.

Live View tables cache partial query results in memory and then combine the partial result with the new data to produce the final answer. The mergeable blocks store the partial result. These are the same blocks that are used by a distributed table to collect partial results from different remote servers. We call these blocks “mergeable” because we can merge them with new blocks that also contain partial (mergeable) results. In the case of a Live View table, we use the new mergeable blocks to store the latest data and combine them with the cached mergeable blocks obtained from the old data.

So the answer is simple. When a stored query contains a subquery, the mergeable blocks are obtained only for the innermost subquery.

For the example above, the innermost subquery is

SELECT * FROM events ORDER BY time DESC LIMIT 10

and the mergeable blocks contain partial results from this query.

How Does It Really Work?

Let’s take another look at the example above to understand the details. This understanding will help you apply Live View tables to the real-world data. First, let’s review the information that we have so far:

  1. Live View tables store mergeable blocks in memory
  2. they combine mergeable blocks obtained from the old and the new data to get the final result
  3. and when a stored query, the query that was used to create the Live View table, has a subquery, then the innermost subquery is used to obtain mergeable blocks

Here is the stored query again that generates a real-time moving average.

CREATE LIVE VIEW ma10 AS
SELECT sum(value) / 10
FROM 
(
    SELECT *
    FROM 
    (
        SELECT *
        FROM events
        ORDER BY time DESC
        LIMIT 10
    )
    ORDER BY time DESC
    LIMIT 10
)

We will break down how it works step by step.

Step 1

The Live View table first calculates mergeable blocks using the current data in the events table. The new data has not arrived yet, so the events table has all the data there is so far. Because the stored query has subqueries, the Live View table uses the innermost query to obtain mergeable blocks. Therefore, mergeable blocks contain a partial result of the query below.

SELECT * FROM events ORDER BY time DESC LIMIT 10

Step 2

There are no other mergeable blocks as no new data has arrived yet, so we have all the mergeable blocks we need. We obtain the final result by applying the outer query to the mergeable blocks, where the query is

CREATE LIVE VIEW ma10 AS
SELECT sum(value) / 10
FROM 
(
    SELECT *
    FROM 
    (
        ...
    )
    ORDER BY time DESC
    LIMIT 10
)

which first applies ORDER BY time DESC and LIMIT 10 and then calculates the average using the resulting rows.

Note that when there is no new data, the ORDER BY time DESC and LIMIT 10 in the outer query don’t do anything useful as the mergeable blocks already have the same ORDER BY and LIMIT clauses applied. They only come into play when new data arrives, and we combine mergeable blocks from old and new data. There is no new data, so we have the final result.

Step 3

Now we have a new INSERT that comes in, and ClickHouse breaks it into one or more blocks of data. Live Views handle each block separately. Therefore, we can consider that we only have one block as for the other blocks, the procedure will just repeat.

With a new block containing the latest data, the Live View has to combine the partial result from the old data with the data from the new block. We already know that the Live View table stores the partial result from the old data as the mergeable blocks, and it combines these blocks with other mergeable blocks. Therefore, the Live View table calculates mergeable blocks from the new block and then combines these mergeable blocks with what it already has.

The mergeable blocks from the old data have ten rows. Remember, it uses the innermost query to obtain them. Similarly, we use the innermost query to get the mergeable blocks from the new block, and it could also contain up to ten rows. When we combine these mergeable blocks from the old and new data, we could end up having up to twenty rows. But we only want ten, and this is the reason why we have another ORDER BY time DESC and LIMIT 10 in the outer subquery to filter out what is not needed!

Step 4

The final result is easy to obtain. Once we apply the ORDER BY time DESC and LIMIT 10 clauses to the mergeable blocks, we simply calculate the average as before and get the result
that we want. Voilà, we have a real-time moving average as for each new block of data, the procedure just repeats.

The Real-time Stream of Moving Average Values

With the new understanding at hand, we are ready to see our real-time moving average in action. Open another ClickHouse client. We will use it to insert data and execute the WATCH query in the current client to watch the real-time moving average.

client1 :) WATCH ma10

WATCH ma10

┌─divide(sum(value), 10)─┬─_version─┐
│ 5.5                    │ 1        │
└────────────────────────┴──────────┘

On the other client, insert value 1 into the events table.

Client2 :) INSERT INTO events (value) VALUES (1)

Observe that the new moving average value from the WATCH query did not change because the result is the same. Insert another value of 1, and now you should see

┌─divide(sum(value), 10)─┬─_version─┐
│ 5.4                    │ 2        │
└────────────────────────┴──────────┘

which is the result of sum([3, 4, 5, 6, 7, 8, 9, 10, 1, 1])/10 and is equal to 5.4. If you insert another 1, then you should see the result change to

┌─divide(sum(value), 10)─┬─_version─┐
│ 5.2                    │ 3        │
└────────────────────────┴──────────┘

which is the result of sum([4, 5, 6, 7, 8, 9, 10, 1, 1, 1])/10.

Add more data and see your moving average update and confirm that our real-time moving average using a Live View table does work.

Conclusion

In this article, we have looked at how we can use subqueries in Live View tables to calculate a real-time moving average. You can use a similar technique to calculate other real-time metrics.

The construction of the stored query is not very intuitive, and therefore we dove into the details of how subqueries work with Live View tables. Using this knowledge, we worked through the moving average example step by step.

Now you should be ready to apply this new capability to your data and enjoy how ClickHouse not only helps you explore your data but also how it can make your data come to life. Don’t forget that at Altinity, we are always happy to help you adapt ClickHouse to your needs and push its capabilities to the limits.

Leave a Reply