Size Matters: Best Practices for Trillion Row Datasets on ClickHouse

RECORDED: Wednesday, August 10, 2022
SPEAKERS: Robert Hodges & Altinity Support Engineering

ClickHouse is so fast that virtually any developer can get a sub-second response on tables running into billions of rows. It’s different once you reach data sizes in the hundreds of billions or trillions of rows. This webinar walks you through best practices for designing a schema, loading data, and running queries on very large datasets. Expert tricks like combining events in a single fact table, using aggregation to simulate joins, and using materialized views to “index” interesting events in large fact tables are all covered. We’ll even demonstrate the ideas on a trillion-row test data set. Want to scale your data? This webinar is the place to start.

Webinar Slides:

You can download a copy of the webinar slides here.

Webinar Transcript:

Hi everybody! This is Robert Hodges. I will be the presenter today at our webinar called Size Matters: Best Practices for Trillion Row Datasets and ClickHouse

0:26 Housekeeping

Before I get started I’d like to just do a quick confirmation that the sound is good. Thank you very much for the person who answered. We can dive in. Let me tell you just a few things about this webinar — 1) it’s being recorded and I will send you links to the recording as well as the slides after this, 2) we have time for questions on this and there’s a couple of ways — you can put them in the Q&A box or you can just type them into the chat. There should be some time at the end to cover questions. 

1:35 Introduction

With that what I’d like to do is go ahead and dive in, and I’ll do some short introductions here. So I’m Robert Hodges. My day job is the CEO of Altinity. I’m a database person; I’ve been working on databases for almost 40 years starting with something called M204. I’ve worked with about 20 or more databases, some of them for many years and other ones like Db2 for one day. I just like databases. The experience that I’ll be describing in this webinar comes from all of Altinity. We have a number of very talented engineers; when we add up all their experience, it runs into centuries.

They have helped people build and operate very large datasets on ClickHouse, and in some cases, they’ve also operated these clusters themselves so we have a lot of experience in this area. What this talk does is collect some of the nuggets from that which we’d like to share with you. A little bit about Altinity: we’re an enterprise provider for ClickHouse. We help people build real-time analytic applications on top of ClickHouse through 1) software including Altinity.Cloud and things like the Altinity Kubernetes operator for ClickHouse (that’s the thing that enables ClickHouse to run cloud natively on Kubernetes and 2) support. We basically provide everything that you need to get started and become productive quickly on ClickHouse and stay that way. All right, that’s enough marketing…

Introduction to ClickHouse

3:25 What is ClickHouse?

What I’d like to do is begin by covering some of the basics. It is possible that there are people on this call who have never heard of ClickHouse. ClickHouse is a SQL data warehouse. A data warehouse is a database that is designed to answer questions about large amounts of data very quickly.

3:58 ClickHouse Characteristics: Understands SQL and Portable

ClickHouse characteristics: It understands SQL so like most data warehouses these days, it’s very portable. It runs on everything from bare metal to the cloud. It runs great in Kubernetes. ClickHouse has what’s called a shared nothing architecture — this is a traditional data warehouse architecture that’s been present for many years, but basically ClickHouse for the most part operates as a process with attached storage and then those processes are connected over a network so if you’re putting them together in a cluster, all they need is a network connection between them. 

4:35 ClickHouse Characteristics: Columnar Structure

Like most data warehouses, it stores data in columns which is much more efficient. I will dig into this a little bit, but basically, it means that when you’re doing selects, if you have 100 columns and you read three, you will only need to do I/O in those three columns and not the rest of the data.

4:54 ClickHouse Characteristics: Parallel and Vectorized Execution

ClickHouse also uses, like many data warehouses, both parallel and vectorized execution. So parallel in the sense that we’re very good at parallelizing a query across cores, across CPUs, and then across nodes in a network that comprise a ClickHouse cluster. Vectorize execution means that we treat data as arrays and try and push them into the CPU as efficiently as possible, so reading them directly through memory ideally hoping to match the structure of the L1 L2 L3 cache all right and then, of course, where possible to use cmd instructions, single instruction multiple data, so these are all the characteristics of important characteristics of modern data warehouses.

5:42 ClickHouse Scaling

ClickHouse scales to petabytes, it’s open-source, and it’s extremely popular. ClickHouse is probably the largest analytic project now on GitHub. It’s bigger than Elasticsearch — it has more contributors and is growing very rapidly.

6:02 ClickHouse Demo

What I’d like to do is show you some quick demos before we get into this talk just to show you what ClickHouse can actually do. [Jump to 9:55 to see a demo on ClickHouse Materialized View.]

11:09 Some Definitions to Guide Our Webinar Decision

So what we’re going to be looking at in the rest of this presentation is how to enable fast, cost-efficient end-user access to trillion row data sets. This is a key thing because a lot of times people will frame the problem as ‘I just need aggregates like daily aggregates or hourly numbers’. We’re actually going to be focused on enabling people to find the actual source data and then to ask questions directly on it.

When we say fast, we mean consistent sub-second response and we want it to scale linearly with resources. This is a really important point that means that as your data set gets larger, as you add more tenants, as you have longer time stretches of data, you can just add more resources and preserve your response time.

Additionally, we want to deliver the query results at costs that are low and predictable. With these large data sets, they’re often used for multi-tenant systems, and you can have sort of random serverless mystery billing where you don’t really know how long things are going to take to run, and you also don’t know how much they’re necessarily going to cost. That’s very difficult to use for many use cases. And then finally, trillion row data sets — there are many types of these: market TICK data, CDN telemetry. There are countless data sets that are very large and fit well into SQL data warehouses.

12:48 Why We Focus on Source Data Access

This picture sort of shows you why this is important. Let’s say that we have a question like we’re seeing temperature sensors fail intermittently but we don’t know why. What we’re going to do is we’re going to start going in and doing very quick explorative queries on this sensor data where we may look at particular sensors. We may look at particular regions. We may look at particular stretches of time to try and find patterns that enable us to zero in on the root cause. If we can do this effectively, we might find that there’s a bad firmware upgrade that affected certain devices in certain locations. BANG! We solved the problem.

That’s why you want this access to the data and you want to be able to do these queries very fast. By fast, we are referring to sub-second or better.

Principles of Large Datasets in ClickHouse

With ClickHouse, we as a community have a lot of experience now about how to build data sets that can solve these problems. Here are just some of the basic principles that we follow. 

14:07 Reduce The Queries to A Single Scan

When we’re either building these systems ourselves or advising people how to do it, the first one is you want to reduce the queries to a single scan. Why is that? If it’s a single scan, it’s linear. If you’re just jamming through some amount of data no matter how much it is, your response time is going to tend to be pretty predictable. It’s just a factor of how much data you are scanning.

14:34 Reduce I/O

Another basic principle is to make that data as small as possible so you reduce I/O. 

14:40 Parallelize The Query

The third thing you could do is parallelize the query so once we’ve got it as small as possible and we’ve got basically a linear process, we can now parallelize it. ClickHouse is very good at this.

14:53 Lean on Aggregation

The fourth thing we can do is we can lean on aggregation, and this is one of the secret powers of ClickHouse. We’ll show you how to use it in a way that, for example, allows you to avoid doing joins.

15:04 Index Information

And then finally, we’ll index information using ClickHouse materialized views. So I showed you an example in the demo of how I was pre-aggregating, but we could also use materialized views to find interesting events inside the dataset and we’ll show you an example of that. 

15:28 The Key: One Table to Rule Them All

The key here… in these very large datasets, the closer you get to having a big fact table, the easier it is to achieve your goals of fast, consistent, cost-efficient query. This [image] summarizes what we’re going to be looking for: you have an enormous table of unaggregated table data (so we’re not pre-aggregating, it’s just the source data), and we’re going to do things like maximize applied CPU, minimize the I/O by making things smaller, parallelize the daylights out of it, keep immediate results in RAM, and this will make the queries run fast and predictably.

Now I want to stress that it’s fine to have joins. There’s no problem in having joins in ClickHouse, which is contrary to popular belief, ClickHouse actually does joins pretty efficiently — it’s just you don’t want them to be big. So you want a fact table and then ideally relatively much smaller tables alongside.

ClickHouse is introducing support for large table joins but they’re relatively slow, and they have a problem that in very large data sets to do them effectively requires moving data. ClickHouse can’t do it as efficiently as other databases and even in those cases, this is relatively slow. This is basically the system that we’re focused on building, and there are many of these out there and they have really outstanding results.

Basic Design for Trillion-Row Tables

17:07 ClickHouse Server Architecture

Let’s go ahead and dig into basic design. These are design principles that you would apply no matter how big your dataset is. Let’s have a quick look at the ClickHouse server architecture. Basically, ClickHouse has a three-stage process for handling queries. There’s Query Parser, Query Interpreter, and then Query Pipeline which actually executes them. ClickHouse can read data from a couple of different locations: it can use columnar data in block storage that could be either local or, in the case of the demo we have here, we’re using elastic block storage on Amazon which is accessed over SAN, and you can also read data out of object storage.

Block storage goes through filters to the OS Page Cache — once data are cached, if you’ve seen blocks before, that makes the I/O faster. Then we have a bunch of caches in memory. For example, we store the primary key indexes, scanned column blocks from storage, joined data (hash tables), and intermediate Query Results. We won’t talk about memory in this talk but it’s definitely something to think about, and we have other webinars that cover it in detail.  

18:40 Different Ways ClickHouse Achieves High Performance

ClickHouse has an amazing number of ways to achieve high performance both in this use case of large data sets but also many others: codecs, compression, tiered storage where we have different types of storage, and everything you can think of that will make a data warehouse run fast — there’s a pretty good chance that it is implemented somewhere in ClickHouse.

What we’re going to do here is focus on a couple of key areas which include things like compression, use of indexes, correct data types, and some basic parallelization. Let’s go ahead and just dig into that.

19:28 MergeTree Table Organization in ClickHouse

The first thing you think about when you load data is to understand the organization of the tables. There’s a family of table engines if you will… these are access methods for data called MergeTree. This is the standard table for large amounts of data. There’s also replicated MergeTree and a number of other variants that we won’t go into here. The basic idea behind MergeTree is it breaks up your data into parts that can then be accessed independently in a parallel fashion

Within each part, there’s a very particular structure that includes a primary key index. It’s what we call a Sparse index that has entries to allow you to access particular rows within the data. The columns — instead of storing things in column format, this is a column store so for each column you’ll have typically a pair of files (it depends somewhat on what version you’re using, whether you have nulls, things like that) for each of these columns. One of the files is basically a bunch of offsets into the actual data which is stored in the other file. For example, if you’re looking for something that happened on a particular day and your index is correctly set up, it will enable you to very efficiently decide which blocks to read out of these columns. 

ClickHouse also has indexes. Skip indexes — these are two examples [in the picture] minmax and bloom filter indexes. These are things that you can use to help ClickHouse decide whether it needs to read a particular block. We’ll show an example of that in a minute. Getting that structure on disk correct means you know you’re gonna get a lot more efficient I/O.

21:47 Making An Experiment Table

When I started to put this data set together, I just started out with creating an unoptimized table and then because this is a sample, I used generated data. So it’s actually generated within ClickHouse but what I do is, I create a table that doesn’t contain all the data that I’m actually interested in but contains enough that I can begin to understand what it’s going to look like and what the size is going to be in storage. But it has sub-optimal data types, it doesn’t use codecs; we’ll show examples of those in a minute. It doesn’t bother to partition the table. This partition by tuple basically tells ClickHouse to partition it anyway and don’t bother about ordering… just throw the rows in the order that they arrive. 

This is a terrible table but the thing is that it will work no matter what your data looks like. You can get the data and start to play around with it and understand how to make it better.

23:00 A Better Experimental Table with A Lower I/O Cost

Here’s an example of where you might go with this data set and in fact, mirrors the process that I went through in setting up this sample. In this case, what I’m doing is optimizing the data type. 

I’m making them as small as possible by using ClickHouse codecs where possible. A codec is a function that squeezes the air out of the data before it gets handed over to compression. DoubleDelta is a great example of a codec; instead of storing the integer Int32 of a sensor ID, it stores not the change between successive values but the change on the change. So for values that, like integer values, that are increasing slowly and in a monotonic fashion, this is extremely efficient encoding. 

Then I squeeze even more air out of it using the ZSTD compression. ClickHouse can do LZ4 but ZSTD is particularly good (i.e., we’ll squeeze more data out of this column). 

Other things we can do: with dates, they can be computed trivially from time, so we have what’s called an alias column so instead of storing it we just say ‘Hey, this is something you can compute.’

We use time-based partitioning. That’s very common in ClickHouse in large data sets. This will now cause our data to be broken up into nice blocks that are based on, in this case, a month. 

Finally, we have an ORDER BY which is designed to give us the optimal location for data when we’re doing queries. If you get this right, it not only makes your queries faster but also enables you to use less space because good orderings tend to help with compression

25:00 On-disk Table Size for Different Schemas

This is the first part of the ClickHouse schema design, and what we see is substantially different on-disk data sizes. Given that we’re going to store a trillion rows, we want to get one that’s pretty efficient. We have a baby bear, a mama bear, and a papa bear size. The best one, let’s say the papa bear size, is over on the right. That’s the ZSTD compression and that gives us the smallest amount of data. 

The ZSTD is a little bit more expensive, so in fact what I ended up doing in this was I went with optimized datatypes and codecs, but I used LZ4 compression. What I get here is 1.34 Bytes per row, which is really great because it means that this sample dataset will fit on a single VM. 

Getting the storage right and well-optimized is a really important step for any dataset, but when they’re big it’s particularly critical.

26:00 ClickHouse Single Node Query

We can also look at the effects of parallelization and understand whether we can add more CPUs. This helps us size VMs. This here [image] shows the basic steps that we go through when we’re processing a query with parallelization. The key thing to notice is we have that parse and plan step. Those are the first two steps. Then we actually run the query. Most of the time, it is going to be a vectorized scan, so we’re going to farm this out to some number of threads. They’re going to be jamming through storage as quickly as possible. They will hold intermediate results in memory and then at the end, they will merge those results, sort them if it’s required, and hand them back. So that’s the query model.

26:49 Exploring Linear Local CPI Scaling

If you get this right, the ClickHouse scaling is excellent. We can now test the scaling. What this is right here [on the screen] is a query on just a billion rows of this sample data. The query itself is not important but what’s interesting is that over here we have a graph that shows the number of threads. We can set that with this max_threads parameter which is a setting that ClickHouse allows that will control the number of threads that will be applied, and what we can see here is we actually get linear scaling up to 16 threads and then it drops off a little bit at 32. 

This is a log-linear graph and it enables us to see the scaling. So ClickHouse has excellent scaling if you put things into a single table like this and ask questions that can be parallelized well. These are basic things we do to get started. 

At this point, we have a general sense of how to store the data, how to query it, and the scaling that we can expect.

28:01 Ingesting Data into Large Tables

Let’s talk a little bit about ingesting data into large tables and show a couple of techniques that people use.

28:13 Pattern: Multiple Entities in a Single Table

One thing that’s really important is that as we’re thinking about the final schema — we talked about having a single table which means that if we have interesting data, we’re likely to have multiple entity types. And what we’re going to do is store them in a single table. This is all because we’re trying to avoid large table joins. They’re basically an anti-pattern. So here’s the schema of the data that we’re reading. These are different types of events that may be arriving related to a particular sensor ID. Let’s just go ahead and look at how we might store that. 

28:58 Many Apps Keep Entity Sources for Future Flexibility

One really interesting thing that many systems do is they don’t just store the raw data or the actual data in columns the way I showed you a minute or two ago. They will include the source data as well. What’s the reason for this? Well, it’s simple. As you have these very large ClickHouse data sets, you may have event types that you don’t really know what questions you want to ask about them.

What you can do is if you have enough space, you can just go ahead and store the JSON data. If you apply compression on it, you can squeeze most of the air out of it. In this particular case, this type of event you see right here compresses by about 96% in ZSTD. You still pay the cost of the additional columns, but they are very small as we just saw. But even storing the source data as unformatted is also relatively efficient.

One thing you might be wondering about when you see this — because we’re actually going to store this as a string — is why don’t we use the new JSON data type in ClickHouse? For that, the answer is a simple one. It’s still experimental and for very large ClickHouse data sets, storing the strings has been done for years. It works really well. But there’s another issue: data type still does not handle multiple entity types. So when the JSON schema changes, it may not handle it properly and there are cases where the data can get sort of munged together. So we don’t recommend doing that yet. 

30:52 Schema for a Table Based on Multiple-entity JSON

We store as strings, and in fact, what you get is a schema that looks like the following. This is pretty close to our final schema which is basically what you saw before. We’re going to introduce a discriminator column because we have multiple event types. So we want to tell which kind of event we’re dealing with: a reading, a restart, or an error. There’s the temperature, because that’s the main thing we’re interested in, an error message if we get one, and then the JSON data here is just a string default, and there’s an error here [on the slide] that actually should be that should have a codec to apply the ZSTD compression on it. And then a reasonable sorting by ORDER BY message type and PARTITION BY time.

When you do this, you can start loading the data in ClickHouse.

31:48 Loading Raw Data into Large Systems

I’ll just very briefly say that there are many ways to load data. In this particular case, we have the sensor data coming in JSON. You can read it into an ETL table, just sort of a dummy table and we use the null engine, and then because you’re going to have to do some transformation on it to pull the data out and put it into columns, we use a materialized view for that. As the data is inserted into this initial table, what that does is it causes the materialized view to fire, and then it will select from that and put the data into the final destination. This is a very common pattern. We’ve covered this in other webinars around JSON that you can look up.

32:45 ClickHouse Makes it Easy to Materialize Columns

So one of the things that I wanted to talk about though is how you can get those columns out later. It turns out this is something that even for large datasets, ClickHouse makes this very easy. At any time you want, you can alter a table and you can add columns. The operation is instant.

What we do here is we’re going to pull out the firmware data, and we’re going to make it a column. We add the column if it doesn’t exist yet, call it firmware (it’s a string), and then we give it a default value. You can see this JSONExtract function that pulls it out of the JSON data or the JSON column. It’s looking for the value of the key firmware and that completes instantly. At that point any query you do, you can reference the column but the data isn’t actually in it, it will simply run this function to go ahead and compute it. 

We can force the data to be materialized by using this ALTER TABLE. What this will do is rewrite that one column, and you will have the actual properly materialized data. 

33:58 You Can Also Index and Query JSON Directly

There’s one other thing that’s kind of interesting, and this is an advanced trick that I have seen used successfully: you can query the column but you can, of course, query the JSON data directly. You can even index it. Here’s an example! Take that JSON data type and add a bloom filter index on it. This is called a token bloom filter index and it basically tokenizes the data in that JSON, and then it applies an index. 

The settings here are tricky. The first setting is the size and then the second one is the number of functions. You have to play around with this a little bit to get the tuning right. But once you do that, you can actually run queries very efficiently and directly on the JSON. Here’s an example for just counting matches for a particular firmware type: you see that the query on the column, you see the query directly on the JSON, and the performance difference is actually not that bad. 

35:14 Results Are Good if You Have High Cardinality Values

This is a test that shows the effect if you tune the bloom filters (orange color). The middle one [in the graph] we can see that the bloom filter actually performs pretty well — almost as well as the column. So this is a cool trick that you can use in large ClickHouse datasets which allows you to do needle-in-the-haystack searches.

Unique ClickHouse Tricks for Large Datasets

Let’s get to the final part to talk: unique ClickHouse tricks for large datasets. Many of the things, for example parallelization, are things that [other databases like] Snowflake do. But ClickHouse has some really interesting features that are particularly well suited for these very large data sets. 

So returning to our basic question about how we make queries fast on these large datasets. The key is to create queries that work on a single scan without large table joins. You just want to have a linear scan and just do one. Let’s have a look then at some of the ways that we can do it 

So how would we actually do that? What tools do we have available?

36:35 Aggregation Run in a Single Pass

The tool that we can lean on is aggregation on ClickHouse. Aggregation, as it turns out, has a bunch of really favorable performance properties. This is an example here… you can look at it! This is how averages are computed inside ClickHouse. First of all, there’s no need to move the data. You just scan the data once and you’re sort of keeping sums and counts as you go down each thread. The second thing is that the same algorithm parallelizes extremely well not only across cores but across machines and even across time, in the sense that you can compute things up to a point, store the results and you can add more to it later and you still get a correct result. And that’s the third thing is that these intermediate results are reusable. 

Aggregation is a tremendously powerful feature of ClickHouse and in fact, it is the fundamental mechanism for getting information out in a hurry that’s useful to human beings. So how can we use it? 

If you want to read more about aggregation, check out our blogs: ClickHouse Aggregation Fun, Part 1: Internals and Handy Tools and ClickHouse Aggregation Fun, Part 2: Exploring and Fixing Performance.

37:43 What About Queries Over All Entities?

Well, how do we do a query over all entities? Normally if you’re thinking in traditional SQL, you would end up doing a query on each of the entities and then you would join them together. In this case, what we’re looking for is trying to get monthly stats on a particular sensor. So like what happened in January or March? How many readings? How many restarts? What were the minimum values, minimum average, maximum values, etc.?

You’d probably have to use it like an auto-join in many databases. ClickHouse has this incredibly cool feature called conditional aggregation, which other databases have, but conditional aggregation in ClickHouse means something slightly different.

You have this what’s called an if combinator, and basically what it means is we’re going to do a count if this condition is matched. So if it’s a reading, we’re going to count it in under-column readings. If the message type or the event type is a restart, we’ll count it as a restart, and then we have some other things that we’re looking at. I could have put if combinators on them as well, but I just left them as they are.

So the results are slightly off but the point is you can do this, and it’s incredibly fast. It’s a single scan. We don’t have to do auto-join. We don’t have to move data. It completely gets around this problem of combining different data from different types. 

39:20 What About Joins on Distributed Data? (Traditional Databases)

What’s really interesting is since we’re doing a join, one of the other things you can ask is ‘Can we do this more generally? Can we think about joins and how would we handle them?’ Here’s an example of a join that you might need to do which is to join restarts with temperature readings. Let’s say that you’re interested in what happens after a device restarts. What do the temperature readings look like? What we’re going to do here is we’re going to have a join key — that’s sensor_id — and then we’re basically going to filter it so that we look at it over a bounded period of time. 

In a traditional database, this is what it would look like: you’d have the restart times, you’d have the temperature readings, you join on the sensor_id, you filter appropriately, and then you could get temperatures after restart. And, in fact, you could even compute this field uptime which is the amount of time that these sensors have been up since their last restart. You can do this because you’ve joined the data.

40:29 What About Joins on Distributed Data? (ClickHouse)

So let’s see how to do this in ClickHouse. First, we’ll start with a picture.

Because we’re going to use aggregation for this, instead of a JOIN key, we’re going to use the GROUP BY key which is your fundamental way of pulling information together. 

So we have the different message types (one is the restart and the second one is the reading), and what we’re going to do is group them. 

We group by the sensor_id — that’s our key — and we can pick up the restart time. And what’s really interesting is that ClickHouse can actually collect this information into arrays. So the time that it did the reading, and the temperature that was associated with that reading — there’s a special aggregate function that ClickHouse can do to group array values in the order that they arrive.

Then there is a feature in ClickHouse called ARRAY JOIN which will pivot the data along the values of these arrays so that we can basically reconstruct a table that looks like the following. We get the sensor_id. We see the time, we see the temperature, we see the uptime. This is all done using aggregation. 

If you want to read more about ClickHouse arrays, check out our blogs: Harnessing the Power of ClickHouse Arrays – Part 1, Harnessing the Power of ClickHouse Arrays – Part 2, and Harnessing the Power of ClickHouse Arrays – Part 3.

42:11 Here’s The Code

I think it’s fair to say that this code is not everybody’s cup of tea. You may look at this in horror. It is easy for me to say that I can do this — I actually teach a class on arrays. But the point is that ClickHouse allows you to do this. The code is complicated in this particular case, but it does work. 

Let me just demonstrate that. [Jump to 42:46 for a quick demo.]

44:00 What About Location Key Events in Tables

Let’s look at one final thing which is indexing data. As we get these very large tables we can ask questions like ‘Hey, when was the last restart on sensor 236?’ What I just showed you is a traditional way of getting this in SQL databases. We have an IN clause with a sub-query. Some databases have real difficulty doing this. ClickHouse has a very neat feature where the IN clause can return multiple columns. But even with that, this type of query where you just on the fly ask ‘When was the last restart on sensor 236?’ can become very expensive to compute. If you do it often, it’s going to make things very slow. 

44:52 Finding the Last Restart is an Aggregation Task

Once again this kind of problem of finding the last restart — it’s an aggregation task. Basically, we can think of this as using the sensor_id as the GROUP BY key. We’re going to look for the maximum value in an event that is of type restart. And we can capture as many matching values as we want in this case, we’re just going to pull in the message type. But this basically allows us to construct what’s called a last point index. 

45:45 Use Materialized Views to “Index” Data

We can do this through ClickHouse materialized views. And the way that they work is a materialized view is like a trigger. We’re going to set one up so that when a block lands in the source table, it will automatically run a query like the following over it. Then it will stick the result in a materialized view target table. We’re going to use the AggregatingMergeTree engine — we’ve covered this in many other webinars and blog articles but this will basically allow us to capture these last restart times automatically in a much smaller table. What does it look like?

If you want to read more about materialized views, check out our blogs and webinars on this topic: ClickHouse Materialized Views Illuminated, Part 1, ClickHouse Materialized Views Illuminated, Part 2, Webinar. ClickHouse Materialized Views. The Magic Continues, and Using Joins in ClickHouse Materialized Views.

46:16 Code for the Materialized View

This is the code for the materialized view. Here’s the last restart. This is the table of the target table. There’s a materialized view that feeds that table so every time new data hits our source table, it will get read in. Basically we’ll be able to get that answer much faster. Now in this particular example, I’m just partitioning by tuple — that’s a kind of dubious choice for a couple of reasons.

As you saw when I was querying the materialized view (that was the third demo query), ClickHouse is still exceedingly fast. One of the big reasons for that is when you construct these materialized views, they’re vastly smaller.

47:11 Comparison of Source Table to Typical Materialized View

This is an example that shows the storage size. It compared the materialized view that we were using previously, and that materialized view is only 3GB compared to over a terabyte of data in the source table. So it’s many times smaller. The other thing that’s kind of interesting is it has a smaller number of parts, and this is an example of why the tuple not correctly partitioning that materialized view could bite you.

One of the reasons is that as the data is being added from the source table when blocks are inserted, it will then just get randomly thrown into different partitions and it causes ClickHouse to rewrite the data. It’s also somewhat less efficient on queries. It’s better if you partition things properly so the data is located in longer sections. But as you saw from that demo, it came back in 14 milliseconds. Sometimes it just doesn’t matter. ClickHouse is so fast. This is an example of where you can kind of get stuff wrong, but ClickHouse still gives good performance. 

48:22 Wrapping Up!

This is a really deep subject, and we’ve just scratched the surface. Let me talk about a few other basic learnings from this as well as some things you might want to look at to build your own tables. First, as we build these using a single large table to hold all entities, it’s not bad. I mean this is not a question of ‘Can we do joins?’ It’s not a question of limitations at ClickHouse. The question is ‘How do we make this query behavior become linear?’

The way to do that is to put it in a single table. The second thing is to make good implementation choices. ClickHouse is a great database, but you have to do some work to make sure that your schema is correctly defined and that you’re using features like data types, codecs, compression, partitioning, ORDER BY, and that you’re using those most efficiently. If you want to be flexible and you have space, include the source data — many people do that. Then you can materialize out the columns or at least have it available to query if you change your mind.

Once you load ClickHouse data sets like this, it is really hard to change them. In this particular example, I generated the data within ClickHouse itself — it still took four hours to load a trillion rows. If you’re reading it from some other location, you can imagine just how hard it would be to update the data and if you realized you missed something. 

Finally, aggregation is the secret ClickHouse power. It is a little weird when you first use it but these additional things like the if combinators, the ability to simulate joins, the ability to use materialized views — these are features that are really the things that allow ClickHouse to be incredibly fast. If you use them, you’ll be happy. The reward is linear scaling, good cost efficiency, and of course, happy users because they can actually see raw data and get useful problems solved.

50:40 Other Techniques for Big Data

Some other things that you might think about — obviously sharding and replication. Some of you probably thought I was going to talk about that. But it just happens that with ClickHouse, if you set it up correctly there is no problem doing a trillion rows on a single VM.

Tiered storage and object storage. Probably the most interesting topic (to think about) is how to save memory. we didn’t touch on that because it’s a deep topic and very use case-specific. But one of the common things that we look at is that as we’re building these large queries, since they’re holding results in memory, we will often use approximation techniques including sampling as well as what are called approximate uniqs.

Uniqs are when you’re doing things like counting the number of unique visitors to a website over a period of time, or the number of unique trades or unique GitHub contributors, or something like that. Those aggregates can be very expensive both in terms of processing time and memory. ClickHouse has lower-cost versions of them, and we constantly use them to make things run faster.

51:50 Additional Resources

There’s documentation for all of this. Obviously the ClickHouse official documentation you can go to which is your basic reference. We have a number of things like the Altinity Blog [and] YouTube channel.

Then we have the ClickHouse knowledge base. This is just a gold mine of raw data that comes out of our support. We have about 200+ customers we’re helping. And every time we find something interesting our support engineers, [they] just toss it in there. I use it constantly and in fact, a number of things in this talk came out of the knowledge base.

Then of course there are meetups, other people who have blogs, and external resources — if you search, you will find it. 

I hope this has been helpful, and I would be delighted to answer any questions that you have about this.

[End of Webinar & Q&A]

Share

Related: