Using Live View Tables with a Real Dataset

By Vitaliy Zakaznikov on November 14th, 2019

Using Live View Tables with a Real Dataset

Vitaliy Zakaznikov ClickHouseLive Views

 

14 Nov, 2019

My previous blog article introduced Live View tables, a new feature in ClickHouse that enables users to subscribe to query results and get new results as the underlying input changes.

In this article, we will continue to explore using Live View tables but now using a real dataset. Specifically, we will see how Live View tables can be used as a cache to speed up your most time consuming and frequently executed queries as well as how Live View table performs when new data is loaded into the source table. To make it easier we will use the same OnTime (USA civil flight data from 1987 till 2015) data set as used in the Yandex’s ClickHouse tutorial that you can find at https://clickhouse.yandex/tutorial.html.

Setting things up

As in the previous article I will be using the 19.16.3 stable release.

$ clickhouse-client
ClickHouse client version 19.16.3.6 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 19.16.3 revision 54427.

Ubuntu-1804-bionic-64-minimal :)

To prepare things, we will take a shortcut and download prepared partitions to avoid manually creating a table, then downloading the dataset and loading it into ClickHouse. The dataset is about 15.5GB. Therefore, the download can take some time depending on your location and internet connection.

$ curl -O https://clickhouse-datasets.s3.yandex.net/ontime/partitions/ontime.tar
$ tar xvf ontime.tar -C /var/lib/clickhouse # path to ClickHouse data directory
$ # check permissions of unpacked data, fix if required
$ sudo service clickhouse-server restart
$ clickhouse-client --query "select count(*) from datasets.ontime"

If everything went well above, then the last command should show the following.

$ clickhouse-client --query "select count(*) from datasets.ontime"
183953732

If you get the same result, you know you are ready to proceed.

Preparing Source Table for The Live View

Now because we need to simulate live data we will create another table that we will call ontime_live into which we will add data from the original ontime table one chunk at a time.

Let’s create ontime_live table and pre-populate it with the initial data that consists of all the flights up to and including the year 2000.

CREATE TABLE ontime_live
ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192) AS
SELECT *
FROM datasets.ontime
WHERE Year <= 2000

Ok.

0 rows in set. Elapsed: 99.031 sec. Processed 69.93 million rows, 50.64 GB (706.17 thousand rows/s., 511.38 MB/s.) 

For me it took about 100 sec. Now we are ready to start playing with Live Views again.

Live Views for Caching Results of Most Important Queries

Live Views store query results in memory. Therefore, one of the use cases is to cache the results of the most time consuming and frequently executed queries.

Let’s create a Live View for the most popular cities of departure using the original dataset table.

CREATE LIVE VIEW lv AS SELECT OriginCityName, count(*) AS flights FROM datasets.ontime GROUP BY OriginCityName ORDER BY flights DESC LIMIT 10

Let’s check what are the top 10.

SELECT *
FROM lv

┌─OriginCityName────────┬──flights─┐
│ Chicago, IL           │ 11572746 │
│ Atlanta, GA           │  9944238 │
│ Dallas/Fort Worth, TX │  8212951 │
│ Houston, TX           │  6282330 │
│ Los Angeles, CA       │  6187921 │
│ New York, NY          │  5716750 │
│ Denver, CO            │  5583983 │
│ Phoenix, AZ           │  5188429 │
│ Washington, DC        │  4555821 │
│ San Francisco, CA     │  4319419 │
└───────────────────────┴──────────┘

10 rows in set. Elapsed: 1.117 sec. 

We can see that the query took 1.117 sec to execute. If you execute the same query on the original table the query will also take about 1 sec. Let’s go ahead and confirm.

SELECT OriginCityName, count(*) AS flights FROM datasets.ontime GROUP BY OriginCityName ORDER BY flights DESC LIMIT 10
...
10 rows in set. Elapsed: 1.116 sec. Processed 183.95 million rows, 4.08 GB (164.85 million rows/s., 3.65 GB/s.)

But what happens if we try to query our Live View again? Let’s see.

SELECT * FROM lv

┌─OriginCityName────────┬──flights─┐
│ Chicago, IL           │ 11572746 │
│ Atlanta, GA           │  9944238 │
│ Dallas/Fort Worth, TX │  8212951 │
│ Houston, TX           │  6282330 │
│ Los Angeles, CA       │  6187921 │
│ New York, NY          │  5716750 │
│ Denver, CO            │  5583983 │
│ Phoenix, AZ           │  5188429 │
│ Washington, DC        │  4555821 │
│ San Francisco, CA     │  4319419 │
└───────────────────────┴──────────┘

10 rows in set. Elapsed: 0.003 sec. 

Now it only takes 0.003 sec. This is 372 times faster! You can try it again and the elapsed time should be about the same. Why? Because Live Views cache query results in memory and therefore after the initial request the result is returned in a matter of milliseconds. If you have hundreds of clients sending the same query to your cluster this in itself can provide a significant performance boost and reduce your CPU and IO utilization, provided the query result fits into memory.

Creating and Watching Live Views

Top 10 Most Popular Cities Of Departure

Now let’s see the power of Live Views by watching how the results of top 10 most popular cities of departure change when new data arrives.

We will drop the old Live View table

DROP TABLE lv

and recreate it but now using ontime_live table as the source

CREATE LIVE VIEW lv AS SELECT OriginCityName, count(*) AS flights FROM ontime_live GROUP BY OriginCityName ORDER BY flights DESC LIMIT 10

Remember that right now our ontime_live table contains all the data up to and including the year 2000. Let’s execute the WATCH query to see what is the current top 10.

WATCH lv

┌─OriginCityName────────┬─flights─┬─_version─┐
│ Chicago, IL           │ 4260533 │        1 │
│ Dallas/Fort Worth, TX │ 3252765 │        1 │
│ Atlanta, GA           │ 3160526 │        1 │
│ Los Angeles, CA       │ 2307178 │        1 │
│ Houston, TX           │ 2178934 │        1 │
│ St. Louis, MO         │ 1994253 │        1 │
│ Phoenix, AZ           │ 1992657 │        1 │
│ New York, NY          │ 1947757 │        1 │
│ Denver, CO            │ 1872315 │        1 │
│ Detroit, MI           │ 1825804 │        1 │
└───────────────────────┴─────────┴──────────┘

Open a new ClickHouse client in a separate window while having the WATCH query running in the current client. Let’s now load the data for the year 2001 and see how the Live View will perform.

INSERT INTO ontime_live SELECT *
FROM datasets.ontime
WHERE (Year > 2000) AND (Year <= 2001)

Ok.

0 rows in set. Elapsed: 9.483 sec. Processed 6.64 million rows, 4.83 GB (699.91 thousand rows/s., 509.01 MB/s.) 

The output in the client that has the WATCH query running should start to change.

WATCH lv

┌─OriginCityName────────┬─flights─┬─_version─┐
│ Chicago, IL           │ 4260533 │        1 │
│ Dallas/Fort Worth, TX │ 3252765 │        1 │
│ Atlanta, GA           │ 3160526 │        1 │
│ Los Angeles, CA       │ 2307178 │        1 │
│ Houston, TX           │ 2178934 │        1 │
│ St. Louis, MO         │ 1994253 │        1 │
│ Phoenix, AZ           │ 1992657 │        1 │
│ New York, NY          │ 1947757 │        1 │
│ Denver, CO            │ 1872315 │        1 │
│ Detroit, MI           │ 1825804 │        1 │
└───────────────────────┴─────────┴──────────┘
...
┌─OriginCityName────────┬─flights─┬─_version─┐
│ Chicago, IL           │ 4652244 │       17 │
│ Dallas/Fort Worth, TX │ 3564801 │       17 │
│ Atlanta, GA           │ 3412197 │       17 │
│ Los Angeles, CA       │ 2532162 │       17 │
│ Houston, TX           │ 2366931 │       17 │
│ Phoenix, AZ           │ 2176980 │       17 │
│ St. Louis, MO         │ 2156440 │       17 │
│ New York, NY          │ 2125684 │       17 │
│ Denver, CO            │ 2004906 │       17 │
│ Detroit, MI           │ 1974571 │       17 │
└───────────────────────┴─────────┴──────────┘

As the new data is loaded into ontime_live table the results of the WATCH query change. Personally, every time I see the WATCH query making my data come to life I smile and I hope you can now do the same.

Go ahead and load data for 2002. Let’s see the results.

INSERT INTO ontime_live SELECT *
FROM datasets.ontime
WHERE (Year > 2001) AND (Year <= 2002)

Ok.

0 rows in set. Elapsed: 8.238 sec. Processed 6.00 million rows, 4.36 GB (728.61 thousand rows/s., 529.65 MB/s.) 

The last result version should be as shown below.

┌─OriginCityName────────┬─flights─┬─_version─┐
│ Chicago, IL           │ 5032077 │       32 │
│ Dallas/Fort Worth, TX │ 3846100 │       32 │
│ Atlanta, GA           │ 3644125 │       32 │
│ Los Angeles, CA       │ 2714563 │       32 │
│ Houston, TX           │ 2541844 │       32 │
│ Phoenix, AZ           │ 2350922 │       32 │
│ St. Louis, MO         │ 2289005 │       32 │
│ New York, NY          │ 2269666 │       32 │
│ Denver, CO            │ 2125180 │       32 │
│ Detroit, MI           │ 2115252 │       32 │
└───────────────────────┴─────────┴──────────┘

As a sanity check, execute the same query on the ontime_live table itself to see if the results match.

SELECT 
    OriginCityName, 
    count(*) AS flights
FROM ontime_live
GROUP BY OriginCityName
ORDER BY flights DESC
LIMIT 10

┌─OriginCityName────────┬─flights─┐
│ Chicago, IL           │ 5032077 │
│ Dallas/Fort Worth, TX │ 3846100 │
│ Atlanta, GA           │ 3644125 │
│ Los Angeles, CA       │ 2714563 │
│ Houston, TX           │ 2541844 │
│ Phoenix, AZ           │ 2350922 │
│ St. Louis, MO         │ 2289005 │
│ New York, NY          │ 2269666 │
│ Denver, CO            │ 2125180 │
│ Detroit, MI           │ 2115252 │
└───────────────────────┴─────────┘
10 rows in set. Elapsed: 0.519 sec. Processed 81.17 million rows, 1.80 GB (156.54 million rows/s., 3.48 GB/s.) 

They sure do. Now lets query the original datasets.ontime table to triple check.

SELECT 
    OriginCityName, 
    count(*) AS flights
FROM datasets.ontime
WHERE Year <= 2002
GROUP BY OriginCityName
ORDER BY flights DESC
LIMIT 10

┌─OriginCityName────────┬─flights─┐
│ Chicago, IL           │ 5032077 │
│ Dallas/Fort Worth, TX │ 3846100 │
│ Atlanta, GA           │ 3644125 │
│ Los Angeles, CA       │ 2714563 │
│ Houston, TX           │ 2541844 │
│ Phoenix, AZ           │ 2350922 │
│ St. Louis, MO         │ 2289005 │
│ New York, NY          │ 2269666 │
│ Denver, CO            │ 2125180 │
│ Detroit, MI           │ 2115252 │
└───────────────────────┴─────────┘
10 rows in set. Elapsed: 0.590 sec. Processed 81.17 million rows, 1.97 GB (137.55 million rows/s., 3.33 GB/s.) 

The results still match. As the author of the Live View feature, I smile again.

Top 10 Most Popular Destinations for 2003

Let’s now look at the top 10 most popular destinations in 2003. First delete the old view

drop table lv

and let’s create a new Live View as follows.

CREATE LIVE VIEW lv AS
SELECT 
    OriginCityName, 
    DestCityName, 
    count(*) AS flights, 
    bar(flights, 0, 20000, 40)
FROM ontime_live
WHERE Year = 2003
GROUP BY 
    OriginCityName, 
    DestCityName
ORDER BY flights DESC
LIMIT 10

And start the WATCH query in the same client while loading data for 2003 in another using the query below.

INSERT INTO ontime_live SELECT * FROM datasets.ontime WHERE Year > 2002 AND Year <= 2003

You should see the results of the WATCH query change until all the data is loaded. The final result should be as below.

WATCH lv
...
┌─OriginCityName──┬─DestCityName────┬─flights─┬─bar(count(), 0, 20000, 40)───────────────┬─_version─┐
│ New York, NY    │ Washington, DC  │   20132 │ ████████████████████████████████████████ │       19 │
│ Washington, DC  │ New York, NY    │   20088 │ ████████████████████████████████████████ │       19 │
│ Boston, MA      │ New York, NY    │   17105 │ ██████████████████████████████████▏     │       19 │
│ New York, NY    │ Boston, MA      │   17007 │ ██████████████████████████████████       │       19 │
│ Chicago, IL     │ Minneapolis, MN │   16745 │ █████████████████████████████████▍      │       19 │
│ Minneapolis, MN │ Chicago, IL     │   16528 │ █████████████████████████████████        │       19 │
│ San Diego, CA   │ Los Angeles, CA │   15475 │ ██████████████████████████████▊         │       19 │
│ Los Angeles, CA │ San Diego, CA   │   15462 │ ██████████████████████████████▊         │       19 │
│ Houston, TX     │ Dallas, TX      │   14838 │ █████████████████████████████▋          │       19 │
│ New York, NY    │ Chicago, IL     │   14199 │ ████████████████████████████▍           │       19 │
└─────────────────┴─────────────────┴─────────┴──────────────────────────────────────────┴──────────┘

Let’s again double check the result by running the same query directly on the ontime_live table.

SELECT 
    OriginCityName, 
    DestCityName, 
    count(*) AS flights, 
    bar(flights, 0, 20000, 40)
FROM ontime_live
WHERE Year = 2003
GROUP BY 
    OriginCityName, 
    DestCityName
ORDER BY flights DESC
LIMIT 10

┌─OriginCityName──┬─DestCityName────┬─flights─┬─bar(count(), 0, 20000, 40)───────────────┐
│ New York, NY    │ Washington, DC  │   20132 │ ████████████████████████████████████████ │
│ Washington, DC  │ New York, NY    │   20088 │ ████████████████████████████████████████ │
│ Boston, MA      │ New York, NY    │   17105 │ ██████████████████████████████████▏     │
│ New York, NY    │ Boston, MA      │   17007 │ ██████████████████████████████████       │
│ Chicago, IL     │ Minneapolis, MN │   16745 │ █████████████████████████████████▍      │
│ Minneapolis, MN │ Chicago, IL     │   16528 │ █████████████████████████████████        │
│ San Diego, CA   │ Los Angeles, CA │   15475 │ ██████████████████████████████▊         │
│ Los Angeles, CA │ San Diego, CA   │   15462 │ ██████████████████████████████▊         │
│ Houston, TX     │ Dallas, TX      │   14838 │ █████████████████████████████▋          │
│ New York, NY    │ Chicago, IL     │   14199 │ ████████████████████████████▍           │
└─────────────────┴─────────────────┴─────────┴──────────────────────────────────────────┘

10 rows in set. Elapsed: 0.130 sec. Processed 6.49 million rows, 301.73 MB (49.93 million rows/s., 2.32 GB/s.) 

Notice that the query is pretty fast and executes in 0.130 sec. However, remember that if you do the SELECT on the Live View, the same result will be returned even faster.

SELECT *
FROM lv

┌─OriginCityName──┬─DestCityName────┬─flights─┬─bar(count(), 0, 20000, 40)───────────────┐
│ New York, NY    │ Washington, DC  │   20132 │ ████████████████████████████████████████ │
│ Washington, DC  │ New York, NY    │   20088 │ ████████████████████████████████████████ │
│ Boston, MA      │ New York, NY    │   17105 │ ██████████████████████████████████▏     │
│ New York, NY    │ Boston, MA      │   17007 │ ██████████████████████████████████       │
│ Chicago, IL     │ Minneapolis, MN │   16745 │ █████████████████████████████████▍      │
│ Minneapolis, MN │ Chicago, IL     │   16528 │ █████████████████████████████████        │
│ San Diego, CA   │ Los Angeles, CA │   15475 │ ██████████████████████████████▊         │
│ Los Angeles, CA │ San Diego, CA   │   15462 │ ██████████████████████████████▊         │
│ Houston, TX     │ Dallas, TX      │   14838 │ █████████████████████████████▋          │
│ New York, NY    │ Chicago, IL     │   14199 │ ████████████████████████████▍           │
└─────────────────┴─────────────────┴─────────┴──────────────────────────────────────────┘

10 rows in set. Elapsed: 0.003 sec. 

Top 10 Cities of Departure Which Offer Maximum Variety of Destinations

Similarly, we can create a Live View to see the top 10 cities of departure which offer the maximum variety of destinations. Let’s drop our table

drop table lv

and create a new Live View as shown below.

CREATE LIVE VIEW lv AS
SELECT 
    OriginCityName, 
    uniq(Dest) AS u
FROM ontime_live
GROUP BY OriginCityName
ORDER BY u DESC
LIMIT 10

Start the WATCH query and then load the data for 2004.

INSERT INTO ontime_live SELECT *
FROM datasets.ontime
WHERE (Year > 2003) AND (Year <= 2004)

The last result of the WATCH query should be:

WATCH lv
...
┌─OriginCityName────────┬───u─┬─_version─┐
│ Atlanta, GA           │ 156 │       14 │
│ Chicago, IL           │ 155 │       14 │
│ Dallas/Fort Worth, TX │ 153 │       14 │
│ Cincinnati, OH        │ 137 │       14 │
│ Houston, TX           │ 132 │       14 │
│ Minneapolis, MN       │ 114 │       14 │
│ Newark, NJ            │ 112 │       14 │
│ Denver, CO            │ 112 │       14 │
│ New York, NY          │ 108 │       14 │
│ Pittsburgh, PA        │ 107 │       14 │
└───────────────────────┴─────┴──────────┘

Let’s again go ahead and do the same query manually to compare the results.

SELECT 
    OriginCityName, 
    uniq(Dest) AS u
FROM ontime_live
GROUP BY OriginCityName
ORDER BY u DESC
LIMIT 10

┌─OriginCityName────────┬───u─┐
│ Atlanta, GA           │ 156 │
│ Chicago, IL           │ 155 │
│ Dallas/Fort Worth, TX │ 153 │
│ Cincinnati, OH        │ 137 │
│ Houston, TX           │ 132 │
│ Minneapolis, MN       │ 114 │
│ Newark, NJ            │ 112 │
│ Denver, CO            │ 112 │
│ New York, NY          │ 108 │
│ Pittsburgh, PA        │ 107 │
└───────────────────────┴─────┘

10 rows in set. Elapsed: 0.948 sec. Processed 94.79 million rows, 2.58 GB (99.97 million rows/s., 2.72 GB/s.)

We have seen a couple of example queries that we can turn into a Live View. If you would like to try more queries then go to https://clickhouse.yandex/tutorial.html to get a list of the queries that you can try by yourself using a similar approach we used above.

Conclusion

In this article, we saw how Live View tables could be used to cache query results. We saw that because Live View tables cache query results in memory they could provide a simple cache mechanism for your most time consuming and frequently executed queries. We also watched live results pushed to us in real-time using the WATCH query on a real data set and confirmed that the result matched with the original data. So go ahead and keep playing around with Live Views and the OnTime data set and keep discovering the fun of using Live View tables for yourself.

As always until the next time, happy viewing using Live View tables! If you find problems or have ideas for new ways to extend Live Views, log an issue on ClickHouse. I’m always open to useful ways to improve this handy feature.

  1. Has the live view an impact on the insertion rate in the source table, especially if insertions in the source table are intensive & continuous?

  2. Yes, live views do have an impact on the insertion rate in the source table. Similar to Materialized Views, the stored query associated with a live view is evaluated on each insert block and a new query result is computed based on the cached mergeable result blocks. Therefore, the larger the number of live views that are looking at the source table, the slowerwill be the insertion rate into the source table. The actual performance impact depends on the stored query and the insertion rate.


Leave a Reply

%d bloggers like this: