Webinars

ClickHouse® Unleashed 2020: Our Favorite New Features for Your Analytical Applications

RECORDED: August 6, 2020
SPEAKERS: Alexander Zaitsev, CTO, Altinity & Robert Hodges, CEO, Altinity

In this webinar, the Altinity team reviews their favorite new ClickHouse features for analytic applications, spanning storage, security, and SQL. Robert Hodges (Altinity CEO) hosts, with CTO and co-founder Alexander Zaitsev, who has directed much of Altinity’s ClickHouse feature work, QA architect Vitaliy Zakaznikov, author of the TestFlows framework and the live views feature, and Kafka engine maintainer Mikhail Filimonov joining for the Q&A. The talk is organized into three areas, storage improvements, security, and a set of SQL and server features, with many other contributions from the community acknowledged along the way.

On storage, Robert shows how storage policies let you attach multiple disks organized into volumes, including JBOD configurations that multiply bandwidth on cloud block storage like Amazon EBS, and tiered storage that matches data to NVMe SSD, dense disks, and now object storage as it ages. He introduces S3 disk support, where parallelization is key to closing the gap with block storage, and Alexander explains compact parts and the new in-memory parts with a write-ahead log, which dramatically reduce file counts and speed up small inserts, plus TTL extensions for moving, conditionally deleting, and rolling up data with GROUP BY.

On security, Vitaliy walks through role-based access control with full SQL statements for users, roles, row policies, settings profiles, and quotas, and LDAP authentication that removes hard-coded passwords from users.xml, while Robert previews AES encryption functions as a first step toward transparent data encryption. Alexander then covers dictionary improvements, including CREATE DICTIONARY DDL, automatic join-to-dictGet rewriting, and new direct and ssd_cache layouts for large key-value sources like Redis, MongoDB, and Cassandra, along with the JSONAsString format for messy JSON logs and cluster features such as parallel distributed INSERT SELECT and the Replicated database engine. Robert closes by previewing a multi-year push toward separating storage from compute and improving SQL conformance, followed by a Q&A.

Key Moments (Timestamps)

Key moments generated with AI assistance.

  • 0:09 – Introduction and housekeeping
  • 1:25 – Speaker introductions
  • 2:48 – Storage policies, JBOD, and tiered storage
  • 9:18 – S3 object storage support
  • 12:23 – Compact parts and memory parts
  • 18:36 – TTL: move, delete WHERE, and GROUP BY
  • 21:13 – Security features and AES encryption
  • 22:51 – Role-based access control
  • 31:36 – LDAP authentication
  • 35:48 – Dictionaries: DDL, join rewrite, and new layouts
  • 42:13 – JSON processing, cluster features, and the road ahead
  • 52:54 – Q&A

Webinar Transcript

0:09 — Introduction and Housekeeping

Robert Hodges: Hello everybody, and welcome to our webinar, ClickHouse Unleashed 2020: we’ll be reviewing our favorite new features for your analytic applications, brought to you by ClickHouse and the ClickHouse community. Before we dive in, I’d like to bring a couple of things to your attention in case you’re a new attendee. This webinar will be recorded; we will send you a link with both the recording and the slides, so you don’t need to take frantic notes. We also have a number of people on the call able to answer questions, so if you have questions as we proceed, feel free to pop them into the question and answer box in the control panel, and we’ll answer some in the background, but we also have time at the end to discuss them. Finally, we will run a very short poll with three questions at the end; it takes about 15 seconds, and we’d appreciate it if you stick around and fill it out, as it helps us understand how we’re doing and what topics you’d like to see in the future. With that, I’d like to dive in.

1:25 — Speaker Introductions

Robert Hodges: On today’s webinar, we have Alexander Zaitsev, the Altinity CTO and also a founder of Altinity itself. He has decades of experience designing and operating petabyte-scale analytic systems, and Alexander has been directing almost all of the work at Altinity related to new ClickHouse features, so he’s definitely an authority on the subject. Second, it’s my great pleasure to introduce Vitaliy Zakaznikov, our QA architect. He has great experience testing hardware and software, and he’s the author of TestFlows, an open source framework for testing that was just added to the build system in ClickHouse, something we use for all our tests. Vitaliy is also the author of the live views feature, so he’s not just testing the interfaces, he’s building them as well. Finally, for me, I’m the Altinity CEO. I’ve been working for over 30 years on databases, and ClickHouse is database number 20. I think it’s fair to say that for the purposes of this talk, we all love databases.

What we’re going to do today is talk about three general areas: storage improvements, security, and then a couple of specific features at the level of SQL and the ClickHouse server itself. There are of course many more things coming in ClickHouse; over 500 people have contributed features or code of one kind or another, so it’s really hard to pick our favorites, and we’re happy to answer questions about the others as well. With that, let’s dive in and talk about storage improvements.

2:48 — Storage Policies, JBOD, and Tiered Storage

Robert Hodges: The way to start is just to say, what a difference a year makes. This time last year, if you were running ClickHouse, you were basically going to have a single mount point for your data, so a single device that would contain all the data for your tables on each ClickHouse node. That has changed enormously over the last 12 months. One of the big things that has enabled this is the introduction of storage policies, which allow us to attach additional disks to ClickHouse, additional mount points organized into volumes. This has a terrific impact on both performance and efficiency.

Let me talk about how these are configured. If you use ClickHouse, you know that ClickHouse developers love XML, so the storage configurations are described using a storage_configuration tag that splits things into disks, which are your mount points, your locations for data. There’s one called default, so that if you don’t make any changes, ClickHouse continues to work as before, and then you can add as many additional mount points as you like. The second part is policies, which let us combine these disks flexibly into volumes and specify how the volumes relate to each other. In this example I’m showing what we call JBOD, just a bunch of disks, where we combine four disks together to spread the data across each of them. The way you actually use these is through a new setting called storage_policy on CREATE TABLE; here’s an example with a table called trip_data, where you see SETTINGS storage_policy equals EBS_JBOD_4, which refers back to that XML configuration. That tells ClickHouse that when it writes data to or reads data from these tables, it needs to look at the storage policy to figure out which mount points things are located on.

When this first appeared, I was a little unimpressed; I thought, how useful could JBOD be? Well, fortunately there’s the Amazon cloud. If you use Amazon and are familiar with block storage, you may know there’s a limitation in the bandwidth available for each EBS volume, so particularly with small instances it’s difficult to scale up performance. One thing you can do with JBOD is attach multiple EBS volumes to your ClickHouse instances, and this graph shows the effect. These are hot and cold runs; the blue runs are cold, going directly to storage for queries, and we see one, two, and four disks, with a drastic improvement in performance particularly as you jump from one to two. Once you’re hot, meaning the data is in cache, it makes no difference because it’s already reading out of the OS page cache, so the hot runs are basically the same across one, two, and four disks. This is an example of scaling performance in a way that’s particularly useful for public clouds.

JBOD is just one type of storage policy. One of the big end goals of this work is to get to something that’s not just a bunch of disks you write to, but what we call tiered storage. The basic idea is that when you look at data, particularly time-ordered data, a very large number of your queries run on the most recent data, and as you go from days to weeks to months to years, the number of queries falls off almost exponentially. So tiered storage lets you match storage to the level of access you need: you can use NVMe SSD to host your most recent data, say the last day or last 12 hours, it then flows to high-density disks that are cheaper but store much more, and then finally, the holy grail for very old data and very large volumes, you can store on object storage. All of these are available right now, although object storage is still in development, and I’ll dig into it in a minute. We won’t go into the details of configuring tiered storage; we have blog articles that cover this nicely.

9:18 — S3 Object Storage Support

Robert Hodges: Let me jump in and talk about S3, the newest storage type we’re working on adding together with others in the community. The slide shows an example of the S3 disk configuration. If you’re an Amazon person, you know S3 is Amazon’s object storage, and it uses an API supported by many other products as well. Here’s how it’s configured: we call the type s3, we give the endpoint pointing to a location in an S3 bucket with a prefix, and then we give the access key ID and the secret key for Amazon, that’s it. At that point you’re able to have ClickHouse reading and writing directly from object storage instead of local storage.

Let’s look at the effects. One thing that’s interesting is that when you start to write to object storage, you quickly realize it’s very different from the normal block storage database engineers are used to. Metadata operations are very slow, inserts proceed slowly, and there’s a tremendous difference from block storage, where you can write very fast, particularly to SSDs, using single threads. With object storage, parallelization is really important. This graph shows the difference between a couple of runs: things are fairly slow when writing with single threads, and one of the things we’ve focused on with our recent S3 work is to enable multiple threads for parallel reading and writing to S3. So you see the difference between two threads and eight threads, where the blue is the reference implementation writing to block storage, in this case EBS, and the red is the S3 performance. As you go between these graphs, the S3 performance is starting to become, in some cases, very close to EBS. This is a really big breakthrough, because object storage can hold enormous amounts of data and has a much different cost profile from block storage; it can be as much as 10x cheaper for typical usage. This is something we expect will hit production readiness in the 20.6 release.

12:23 — Compact Parts and Memory Parts

Robert Hodges: There are more things coming related to storage. Another particularly interesting option is compact parts. If you’re familiar with the way ClickHouse MergeTree tables are organized, you know there is a .idx file, your sparse primary key index defining the rows, and then .bin and .mrk2 files for each column, where the .bin holds the compressed blocks and the .mrk2 holds the offsets into those blocks corresponding to the index granules. What compact parts do is store the same information but vastly reduce the number of files necessary, which turns out to have major performance impacts. To dig into this more deeply, I’d like to turn it over to Alexander.

Alexander Zaitsev: Thank you, Robert. Hello everybody. The idea of compact parts is reducing the number of files, but internally we still try to keep the same columnar structure, so data is still stored in columns inside the big compact part. There are mark files and binary files, and ClickHouse just knows the offsets of the proper file chunks. What this means is that it’s much faster for inserts, and the original design idea was to make inserts, especially small inserts, faster. If you’re experienced with ClickHouse, you know that when you insert data too fast and in small chunks, ClickHouse doesn’t perform well and you can get into some problems, so compact parts were focused on fixing that. On the other side, when we need to query data, it performs worse than the original columnar design, because it has to do file seeks and isn’t as efficient as reading separate files. What’s interesting is that when we were testing S3 performance, we figured out that using compact parts makes a lot of sense, because it vastly reduces the number of files and file operations, so with compact parts you can insert data much faster compared to the original wide parts, though the query side is still a concern and we’re still working on improving it.

This is an example of the file structure for wide parts and compact parts. For a table we used for tests, the on-time data set, we had 224 files in the wide part and only eight files in the compact part. This is a huge difference, and as I mentioned, it shines on inserts, especially small inserts. ClickHouse moves the data between different part formats, and quite recently we introduced memory parts, a third type of part which is data in memory with a write-ahead log to make things reliable. The idea is that if you have a very small insert, it goes to memory, then the merge process takes it and puts it directly into the wide part, or into an intermediate compact part if necessary, thus reducing the amount of file operations. With this design, for example, you don’t need to use buffer tables at all; you can use memory parts as a more convenient replacement.

Here are some performance figures. Unfortunately I couldn’t do a very good test, it’s kind of ugly, but it still shows the difference. There are three tables created to enable the different part types internally, using the min_rows_for_wide_part and min_rows_for_compact_part parameters, and I’m inserting a single row. If you look at the insert statement, there’s a setting, max_block_size, that forces ClickHouse to do inserts in single rows, so we emulate the amount of insert iterations per second; it’s not 100 percent true but it’s close enough. As you can see, once we go from a wide part to compact we have a performance improvement, probably not as dramatic as I would expect but still notable, and when we go to memory parts the performance improvement is something like 70 percent, which is even better. This is new functionality whose purpose is to enable frequent and small inserts, and I appreciate any feedback on it, so go and use it.

18:36 — TTL: Move, Delete WHERE, and GROUP BY

Alexander Zaitsev: Another interesting thing that already exists in ClickHouse is TTL. TTL started with delete, where ClickHouse could automatically delete parts or partitions that were old enough using time constraints, and we started to extend this functionality with move, which is how tiered storage works, so you can move data between volumes automatically with no manual intervention required. The community picked up this feature and continued to add interesting new flavors of TTL expressions. For example, in 20.5 we have TTL DELETE WHERE, which lets you specify an extra condition for the rows that need to be deleted by the TTL expression. Last but not least is a very interesting feature, TTL GROUP BY.

The idea of TTL GROUP BY is to perform automatic roll-up of data once the time condition has been reached. For example, you may have detailed data for the last day, then roll up to minutes, or after 30 days roll up to days, losing the minute granularity. To do that, you define a TTL expression; in this example I define a one-day TTL that groups data by three columns, and those three columns have to be the prefix of the primary key, and then it applies an aggregation function that you specify, which could be sum, count, or any aggregation function. So this is a cool feature that lets you reduce the accuracy of old data automatically without difficult and complicated coding.

21:13 — Security Features and AES Encryption

Robert Hodges: Thank you, Alexander, that was great. I’d like to switch gears and talk about security features. As ClickHouse has gotten more enterprise usage, outside the realm of the early adopters, people are asking for a number of features that help improve security and access to the data. There’s a long list here; I won’t go through it in too much detail, just highlight a couple. We’re going to do a deep dive on LDAP and RBAC in the slides that follow. Another interesting thing coming in is AES encryption functions, and I’m happy to say at least one of the folks who helped sponsor this is on this call. We’ve added functions very similar to the MySQL AES encrypt and decrypt, giving you the ability to add high-quality encryption to values on individual columns. This is the first step toward something like transparent data encryption, where ClickHouse will manage keys and implement encryption regimes suitable not just for protecting data in general but, more importantly, for multi-tenant situations where you have different keys that get rolled over. This is the tip of the spear, starting to appear with the AES encryption functions, and we expect to see it built out beyond that. What I’d like to do now is turn things over to Vitaliy Zakaznikov, who’s going to dive into RBAC followed by LDAP.

22:51 — Role-Based Access Control

Vitaliy Zakaznikov: Hello everyone. We’ll quickly touch on the new feature that came to version 20.4, which is role-based access control. It’s a very exciting feature, as it provides SQL statements to control users, roles, row policies, settings profiles, and quotas. As you can imagine, having SQL statements greatly simplifies access control and user management, especially for multi-tenant systems. With the row policies that came with RBAC, that brought support for row-level security, which enables you to provide access to users or roles to a specific part of the table based on a filtering expression. And with RBAC, the feature provides fine-grained control over privileges, so you can control which users or roles have a specific privilege to execute which statements, and it’s a hierarchical structure to enable easier control.

How would you enable RBAC? By default RBAC is disabled, so it’s pretty easy. There are two parts: first, if you want, you can specify a directory where configurations will be stored, done inside your config.xml using the access_control_path setting, which by default is set to /var/lib/clickhouse/access. Then, for a user you would like to have access management, you need to enable the access_management setting. As you can see in the slide, if we want to allow the default user to create other users and grant rights to them, you set the access_management setting to one and restart your server, and then if you log in with the default user you’ll be able to create new users and grant rights to them.

So, the RBAC SQL statements. I broke them into some groups. For user management, you now have CREATE, ALTER, and DROP USER statements, and a special SHOW CREATE USER statement to see how the user was created. The same is available for roles: you can CREATE, ALTER, DROP, and SET ROLE, as well as set a default role for the current user, so if the user has multiple roles you can limit your privileges by setting a specific role, and SHOW CREATE ROLE lets you see how the role was created. Roles are really nice because they let a group of users share the same role, so you don’t have to assign the same privileges to each specific user; you assign privileges to your role and then assign roles to your users. For access control, given that we have privileges, you grant and revoke them using the GRANT and REVOKE statements. Then there are row policies, which bring row-level security, that you can CREATE, ALTER, and DROP, and view with SHOW CREATE ROW POLICY. In addition, there’s now the ability to create settings profiles that you can associate with users and roles, with CREATE, ALTER, DROP, and SHOW CREATE SETTINGS PROFILE statements, and you can limit or assign quotas to your roles and users with the corresponding CREATE, ALTER, DROP, and SHOW CREATE QUOTA statements. All these new statements really let you manage your users, roles, and access, so your admins can easily manage the ClickHouse cluster and the tenants inside your database.

The main point of RBAC is to provide privileges, and privileges provide permission to execute specific kinds of queries. There are a lot of them, and you can see the whole list in the ClickHouse documentation, but the important part is that privileges have a hierarchical structure, and there are a few special privileges. One is the ALL privilege, used to grant all privileges to a role or user; there’s also a special NONE privilege, so when creating a user or role you can specify that no privilege is assigned. The last two are the ADMIN OPTION and the GRANT OPTION: the ADMIN OPTION privilege allows a user to grant its role to another user, which otherwise only a user with access_management set to one can do, and if you specify GRANT OPTION when granting a privilege, that user will be able to grant that privilege, of the same or smaller scope, to another user.

Here we have an example of the hierarchical structure of RBAC privileges. Because there are a lot of them and it would be tedious to manage them individually, the hierarchy was created, so for example the ALTER COLUMN privilege has sub-privileges like ALTER ADD COLUMN, ALTER DROP COLUMN, MODIFY, CLEAR, and RENAME, which map to corresponding commands you can execute. You can control all those sub-privileges using the ALTER COLUMN privilege, which itself belongs to the ALTER TABLE privilege, which belongs to ALTER. So if you want to give a user or role control to execute any ALTER commands, you just grant the ALTER privilege generically, and if you want to be specific and only enable them to add a column, you just grant the ALTER ADD COLUMN privilege.

Here’s a slide with examples of how these statements can be used; there are many more variations, and we encourage you to try this feature. For example, the first statement is how you create a user: you can create a user account, say “mirror,” with a password, just by executing CREATE USER, identified with, if you want to use a SHA256 password, the password. It’s very easy, you can do it on your ClickHouse client. Similar for altering a role, if you have one role and want to rename it, you use ALTER ROLE. For granting and revoking, we use the GRANT and REVOKE statements; for example, there is a SELECT privilege that can specify access to a specific column or tuple of columns, so GRANT SELECT(x, y) will grant access to the x and y columns on some table to a user, for example John, with a GRANT OPTION, which again will allow John to potentially grant this privilege to another user. If your admin wants to revoke a role from a user, they would use the REVOKE statement specifying either a role or a user. Similar for CREATE ROW POLICY and CREATE SETTINGS PROFILE, you use the corresponding statements and can then assign those settings or row policies to a specific user or role. That was a quick introduction to RBAC; if you want more information, we recommend checking out the documentation, which has a page listing all the privileges and how they map to the SQL statements.

31:36 — LDAP Authentication

Vitaliy Zakaznikov: Another exciting feature related to security is that now you can authenticate users using LDAP. Before, you would have your users defined in your users.xml, where you would define passwords, which is very cumbersome when you have a large organization, because now users must have specific passwords just to enter the ClickHouse database. With LDAP authentication, that process becomes easier, because now you have the option to have an LDAP server and, instead of specifying a password, specify which LDAP server ClickHouse should use to authenticate the user. That feature appeared in 20.7 and is still in the testing stage. It really simplifies user-password management; you no longer need to store passwords in your users.xml. If you have an LDAP server in your organization, you just update your user definition to specify which LDAP server to use instead of a hard-coded password, and the implementation supports secured connections, so passwords are not transmitted in plain text. You have the ability to specify one or more servers, and different users can be authenticated using different LDAP servers, so the implementation is very flexible.

Let’s see how you would switch to using LDAP user authentication. First you need to define your LDAP server; the descriptions go into the config.xml, or you can create a separate XML configuration inside the config.d directory, which is the recommended way. There you define an ldap_servers section, and in it you define your server. In this slide we provide an example of how you can define my_ldap_server, where you specify the host, the port, whether you want to enable a TLS connection (a secure connection, which by default is set to yes), and then the DN prefix and suffix that would be used to contact the LDAP server and check the password provided to ClickHouse. Once you have your servers defined, it’s very easy to switch users to use LDAP. You go to your users.xml, and for a given user you remove the password section and add an ldap section where you specify which LDAP server to use to authenticate that user, then restart your server. After that, anytime users specify a password, a request will be made to the LDAP server, which will either verify or reject it; if rejected, the user will not have access to ClickHouse.

So the simple workflow: first you define LDAP servers inside your config.xml, then convert your user definitions to use those servers, then restart your ClickHouse server, and now your users can be authenticated using LDAP, and you no longer have to store passwords explicitly just for ClickHouse inside the users.xml file.

Robert Hodges: Thank you very much, Vitaliy. One thing worth adding on LDAP is that this is just the beginning of LDAP support; we do expect to eventually be able to hold the users in their entirety in LDAP, as well as other types of identity managers, so definitely keep an eye on this, there’s more on the way.

35:48 — Dictionaries: DDL, Join Rewrite, and New Layouts

Robert Hodges: For the final section, we’d like to discuss a couple of new ClickHouse features, and for this I’m going to turn it back over to Alexander Zaitsev.

Alexander Zaitsev: Thank you, Robert. It’s very hard to pick some new features, because there are too many, and I picked some related to dictionaries, because it opens up some interesting new capabilities, and later we’ll talk about some JSON processing improvements. There’s been quite a lot of work around dictionaries, but recently, starting from dictionary DDLs later in 2019, you are now able to create dictionaries just like tables with the CREATE DICTIONARY SQL statement, so they’re first-class citizens. Since then there have been some interesting changes. First, starting from ClickHouse 20.4, ClickHouse can now automatically rewrite joins; if you join a dictionary, because it looks like a table you can join it, and ClickHouse automatically rewrites that to the traditional calls which are more efficient. This is a cool feature because it lets you define dictionaries but use normal SQL joins, which is very friendly to users, BI tools, and everything. This is an example of joins: if you have a query and you join a dictionary to extract some attributes, it gets automatically rewritten to dictGet function calls internally, you don’t need to do it yourself, ClickHouse can do it automatically starting from 20.4. You can find details in Altinity blog articles back in May.

More interesting are the new layouts, and we’ll start from ssd_cache. Consider the case where your dictionary is too big to fit in RAM but your dictionary source is too slow to use a cache layout, so if you have a cache miss you have to go to some external system to fetch the value, and it may be slow, so a cache layout may not give you enough performance. What you can try now is the ssd_cache dictionary layout, which lets you do a kind of two-layer cache, both in RAM and on local SSD storage if you have it. ClickHouse will try to store as much data as specified in memory, cache everything else in fast SSD, and still go to your original source if there’s nothing in the local SSD cache or the in-memory cache, or if the data in cache has expired, since you have an expiration strategy. With this two-layer approach you can get much better performance compared to a cache layout or some other techniques.

But it’s possible that your data is even bigger and you can’t even use ssd_cache, or your data changes too often so you can’t use a reasonable invalidation strategy. For that use case there is a new layout called direct. With this layout, ClickHouse always calls the external source; it doesn’t do any caching or validation, so if you call a dictionary with the get or any method, ClickHouse always calls your external system, and you rely on that system’s performance. You have plenty of things to work with: there’s Redis, which is supported quite well, there’s MongoDB support even from 2018, which I was surprised existed for so long, there was Cassandra support added in the 20.5 release very recently, Aerospike support was planned and somebody worked on it, but I checked before the webinar and the pull request is kind of abandoned, so I hope somebody picks it up, because Aerospike is a good technology. Or you can even use MySQL; you can do wonders with MySQL, especially if you read their corner blog, they claim something like enormous read performance, millions of key lookups per second, if you have everything in the InnoDB cache.

42:13 — JSON Processing, Cluster Features, and the Road Ahead

Alexander Zaitsev: Now let’s talk a bit about JSON processing improvements. We had a webinar earlier on how to do log processing, and the typical workflow is to take some logs, insert them into a staging or ETL table without a lot of processing, and then apply a SQL materialized view to extract all the necessary information from your raw data. This is a very neat approach and it works fine, but in reality, if you need to store JSON data, and it’s quite often that your logs are in JSON, it’s not well-formed JSON. Your JSON logs are typically built from JSON messages separated by commas; every message is well-formed JSON, but all together, if you tried to load them into ClickHouse, it wasn’t very easy, because you cannot use any of the JSON formats since it’s not well-formed JSON, and you cannot use CSV because you have CSVs inside the JSON. Before recently, you had to do some tweaking, some extra processing in bash, to convert this into a tab-separated format, or try to load it as a JSON array and do some extra processing, which was not very convenient. Now we have a solution: a new format called JSONAsString, which takes the input and considers it as a sequence of comma-separated JSON messages, and voila, we have the data loaded in a table with every JSON in a string, and now we can apply all the rich JSON processing functions ClickHouse has to extract the data.

There were a lot of cluster improvements, and it would take a full webinar to talk about them in detail, but I want to highlight some, so you know this functionality is already available. Distributed-over-distributed tables are available, so you can define a distributed table that refers to other distributed tables. One long-awaited feature is parallel distributed INSERT SELECT: if you work with a cluster and need to copy data from one distributed table to another, you may have noticed it works slowly, because ClickHouse had to collect the data from all shards to the node where you’re running the query and then ingest it, so you had a bottleneck on a single server. Now, with parallel distributed INSERT SELECT, ClickHouse can do it in parallel on every shard, just moving data locally from shard to shard, which is much more efficient, especially with a bigger cluster. One more interesting feature is the system.distribution_queue table, which lets you see what data has been inserted into the distributed table but not yet propagated to your shards; before, you had to go to the file system to see this, and now you have it in ClickHouse itself.

Last, I’d highlight three features coming in the next several months. One is the Replicated database engine, which greatly simplifies management of replicated tables: in essence you put your logic about how to replicate the tables inside the database, and then you just create a replicated table and it gets automatically propagated with the proper path and created on all replicas for that database automatically, which makes replicated-table management much easier. Two things about ZooKeeper are also coming. We started to work on replica recovery, because if you work with ZooKeeper, I think everybody with ClickHouse experience has lost ZooKeeper at least once, and if you lose it, the replicated tables turn into a kind of read-only mode and are not very useful, and it’s a very long and painful process to recover. So we started to work on a feature that would let you do it with a single SQL statement. In parallel, there are a number of changes going on inside ClickHouse, driven by Altinity, that should eventually let us continue inserts and writes even if ZooKeeper is not available; right now if ZooKeeper is not available you’re in read-only mode, which is not good for insert-intensive real-time workloads, and with this change ClickHouse can continue to operate almost normally, it just cannot merge old parts with new parts inserted while ZooKeeper is down, but everything else continues to work.

Robert Hodges: Thank you very much, Alexander. We’re going to do a short wrap-up, and I’d like to encourage you to ask more questions about these features or anything else in ClickHouse; we have four people here ready to answer. At a high level, the focus of recent development has been particularly around storage, an enormous amount of work, and around security as we thread in enterprise-level features like RBAC, which has been a community request for years and is now getting the level of testing it needs for people to count on it, plus feature improvements around query optimizations.

For the future, a couple of things to highlight. The first is to make ClickHouse more cloud-friendly; you saw the thin edge of this, which is the S3 storage support, but what we’re really after is the ability to separate storage from compute, the holy grail for cloud databases. The goal of the community is to get there in incremental steps, and it also includes things like spending less time explicitly managing shards and replicas and more time just putting your data in, where ClickHouse figures out the right place to put it. This is a multi-year project, but you’ll start to see these features coming. Another area of great interest is SQL conformance, things like full support for common table expressions, all the things necessary to become a fully compliant SQL implementation. We’ve never really worried too much about SQL conformance on ClickHouse, we rather worry about doing things good for our users, but in the long run, if you want to integrate with BI tools, there are a lot of standard SQL features people expect, and it’s our goal to put them in. With that, we’re done for today; this is the best part of my job, getting to talk about stuff like this. Please fill out the poll, and then we’ll transition to questions. If you’re interested in learning more about ClickHouse, we offer a free one-hour consultation, and if you’d like to work on this stuff, we’re also hiring.

52:54 — Q&A

Robert Hodges: We’ve also got Mikhail Filimonov, another one of our experts, who is the maintainer of the Kafka engine, joining us to answer questions. Question one: Mikhail, it looks like you want to answer this, if ZooKeeper is down, how do you rebalance data once it’s back online?

Mikhail Filimonov: Can you hear me? Usually, if your ZooKeeper is down during the downtime, you can’t insert data to that node, so that node doesn’t do anything with your data; the data stays safe, nothing is happening. Once you get your ZooKeeper back, everything just continues working, you don’t need to do any rebalances or anything like that. The thing here is just to try not to lose the data in ZooKeeper, because otherwise recovery is more painful.

Robert Hodges: Great. We had a couple of previous questions on performance and insertion rates that were already covered. One question that comes up really commonly, and Alexander, this is one for you: what about rebalancing of shards in general? It’s such a common ask in the community.

Alexander Zaitsev: Thank you, Robert. This is definitely an interesting question, especially if you go to the cloud and you don’t know your workload in advance and you just want to be able to scale up when you need to. Right now there is no good solution for rebalancing; if you need to, you use clickhouse-copier to move your data between clusters, smaller ones or bigger ones, or move parts manually in the background, which is kind of painful and error-prone. We do have a design proposal internally at Altinity, and we discussed it, for how we can make rebalances easier to do, not automatically yet, but with something like a SYSTEM REBALANCE TABLE SQL statement. It’s probably not a very complicated project; we’re just looking for companies who can fund it right now, because our resources are busy with other things, but if we find funding we’ll definitely do it in two or three months, it’s not complicated. In the long term, there’s a concept of cloud tables that was published two years ago in a Yandex blog, and that concept is still true, so there are plans to do that, but for ClickHouse to build those cloud tables it has to undergo a lot of internal evolution, which is currently happening as a set of small steps, so it’s not very visible to users or many developers. We expect those cloud tables to be available something like 2022.

Robert Hodges: Great. I’ve got a question that just popped up: I noticed that ZooKeeper doesn’t delete the data after being inserted on all nodes, did I configure it wrong? Anybody want to grab that one?

Mikhail Filimonov: What ZooKeeper stores is not the data, it stores the meta information about parts. Every time you insert data into ClickHouse it creates a new part, and the information about which parts exist is in ZooKeeper, not just right after insertion but persistently, so all the parts registered in the system should be registered in ZooKeeper, and that’s the source of truth, the source of knowledge about the state of our replica. So if you just join one more replica, it reads that list of parts that should be there and downloads them from healthy replicas. That information persists; it doesn’t get removed after it’s been synced.

Alexander Zaitsev: This is, by the way, something that we’re going to change in the future, because right now ClickHouse heavily depends on ZooKeeper being available, and nobody likes that, so we have a sort of campaign.

Robert Hodges: I think that’s a good point, and we’ve hit the top of the hour, so we’re going to have to close up. There is definitely a campaign in the community, which we’re involved with, to make ZooKeeper kind of go away as an operational issue, so you’ll see continued improvements. One final thing in closing: as you can tell, we’re deeply involved in the development of these features, and we’d like to thank the members of the community who have helped fund the development. If you have ClickHouse features of interest that you’d like to see, come and talk to us; we’re always interested in adding new features, particularly if you have enterprise problems that can be solved by them, so that’s another reason to contact us and see how you can help add new features to ClickHouse to make it better for everybody. With that, I’d like to thank everyone who helped on this presentation, Alexander, Vitaliy, and Mikhail, and for our audience, thank you so much for attending. You’ll get an email shortly with links to the recorded presentation and the slides. Thanks and have a great day.

FAQ

What are storage policies, and how do JBOD and tiered storage work? Storage policies, introduced over the year before this webinar, let you attach multiple disks (mount points) to ClickHouse and organize them into volumes, selected per table with the storage_policy setting. A JBOD (“just a bunch of disks”) policy spreads data across several disks, which on cloud block storage like Amazon EBS multiplies the bandwidth available to small instances. Tiered storage goes further by matching data to its access pattern, keeping recent data on fast NVMe SSD, moving older data to cheaper high-density disks, and ultimately to object storage, with TTL rules driving the moves automatically.

How does ClickHouse use S3 object storage? You configure an S3 disk by giving its type as s3, an endpoint pointing to a bucket and prefix, and the access key and secret, after which ClickHouse can read and write directly from object storage. Object storage behaves very differently from block storage, with slow metadata operations and inserts, so parallelization across multiple threads is essential; Altinity’s work brought S3 read and write performance close to EBS in some cases. Because object storage can be roughly 10x cheaper for typical usage, it’s a major development, expected to reach production readiness in the 20.6 release.

What are compact parts and memory parts? A normal “wide” MergeTree part stores each column in separate .bin and .mrk2 files plus a primary key index, which can mean hundreds of files per part. Compact parts store the same information in far fewer files (in one test, eight files instead of 224), which makes inserts much faster, especially small ones, at some cost to column-selective query efficiency, and helps a lot on object storage. Memory parts are a newer third type that hold small inserts in memory with a write-ahead log for reliability, then merge them into compact or wide parts, and can replace buffer tables.

What new TTL capabilities does ClickHouse have? TTL began as a way to automatically delete old parts or partitions based on time, then was extended with TTL MOVE, which powers tiered storage by moving data between volumes automatically. The community then added TTL DELETE WHERE, which adds an extra condition for the rows to be deleted, and TTL GROUP BY, which automatically rolls up older data, for example reducing minute-level detail to coarser granularity after a set period, by grouping on a prefix of the primary key and applying an aggregation function.

What does role-based access control add, and how do you enable it? RBAC, added in 20.4, provides SQL statements to manage users, roles, row policies, settings profiles, and quotas, with a hierarchical privilege structure and special privileges like ALL, NONE, ADMIN OPTION, and GRANT OPTION. Row policies bring row-level security based on a filtering expression. To enable it, you optionally set access_control_path in config.xml (defaulting to /var/lib/clickhouse/access) and set access_management to one for a user, then restart, after which that user can create users and grant rights. LDAP authentication, added in 20.7, complements this by letting you authenticate against an LDAP server instead of storing passwords in users.xml.

What improvements came to dictionaries and clusters? Dictionaries became first-class objects with CREATE DICTIONARY DDL, and from 20.4 ClickHouse automatically rewrites joins against a dictionary into efficient dictGet calls. New ssd_cache and direct layouts handle dictionaries too large for RAM, with direct always calling the external source, useful for key-value sources like Redis, MongoDB, and Cassandra. On clusters, the JSONAsString format simplifies loading messy comma-separated JSON logs, while parallel distributed INSERT SELECT, the system.distribution_queue table, and the Replicated database engine improve distributed operations and replicated-table management.


© 2026 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

Join our Slack

ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.