What’s a Data Lake and What Does It Mean For My Open Source ClickHouse® Stack?

Recorded: January 22 @ 08:00 am PT
Presenters: Robert Hodges

In this webinar, Altinity CEO Robert Hodges introduces data lakes from first principles and explains why they are increasingly important for organizations managing rapidly growing analytical data sets. He begins with a clear definition of what a data lake actually is, tracing its origins from the internal storage architecture of data warehouses like ClickHouse® and explaining how data lakes simply make that previously hidden storage accessible to any application via open APIs.

The session then focuses on Apache Iceberg as the leading open table format, walking through its four layers: S3-compatible object storage, Parquet data files, Iceberg metadata files, and a catalog service that ties everything together. A live Python demo using PyIceberg, PyArrow, and Pandas creates an Iceberg table, populates it, and reads it back, illustrating the full access flow without a database server in the middle.

Robert then shifts to ClickHouse and walks through three ways to read Iceberg data: the S3 table function (direct file access using wildcards), the Iceberg table function (metadata-aware file resolution), and a fully integrated database using the Iceberg REST catalog (which makes Iceberg tables appear as native ClickHouse tables). A second demo shows all three approaches live, and then uses Apache Spark to delete rows and confirms that all three access methods reflect the updated data, except for the raw S3 approach, which returns stale results because it reads orphaned files and cannot distinguish valid from invalid data.

The talk then covers advanced ClickHouse capabilities for data lake workloads: clustered reads across multiple nodes using the Iceberg cluster function, Parquet-level optimizations including Bloom filter indexes and minmax partition pruning, filesystem caching for S3, and in-progress work on caching Iceberg and Parquet metadata. The Merge table engine technique for combining hot MergeTree data with cold Iceberg data using a single SQL query is demonstrated.

The roadmap section covers what ClickHouse cannot yet do (write to Iceberg, tiered storage archiving, compaction, and autoscaled parallel query) and what is coming from both ClickHouse Inc. and Altinity in 2025. The session closes with a discussion of medallion architectures, comparing a pure MergeTree approach, a raw S3 approach, and the emerging ClickHouse-plus-data-lake approach that combines Kafka ingest, Spark for cleaning, and ClickHouse for fast aggregated queries on shared Iceberg data.

Altinity CTO, Alexander Zaitsev, answers audience questions throughout the session.

Here are the slides:

What’s a Data Lake and What Does It Mean for My Open Source ClickHouse Stack-2025-01-22 Download

Key Moments (Timestamps)

Key moments generated with AI assistance.

00:04 – Welcome and housekeeping
01:00 – Speaker introductions and Altinity background
02:23 – Why data lakes matter: 10x annual data growth in real-time warehouses
03:58 – What is a data lake? From hidden warehouse storage to shared open APIs
07:53 – Data lake components: object storage, Parquet, Iceberg metadata, and catalogs
10:41 – How a Python data science application reads Iceberg data
11:35 – Reference implementation: MinIO, Iceberg REST catalog, PyIceberg, PyArrow, Pandas
12:50 – Demo 1: Creating, populating, and reading an Iceberg table in Python
15:29 – What the demo created: buckets, metadata files, Parquet data files under MinIO
17:35 – Pandora’s box of data lake choices: formats, catalogs, languages, storage
19:28 – ClickHouse architecture overview: MergeTree, S3, vectorized query, Keeper
21:18 – Three ways ClickHouse can access a data lake
22:13 – Method 1: S3 table function with wildcard paths
23:10 – Method 2: Iceberg table function for metadata-aware reads
23:40 – Method 3: Iceberg REST catalog as a native ClickHouse database
25:11 – Demo 2: All three access methods live, plus Spark delete and consistency check
32:26 – Why direct S3 reads return wrong answers on Iceberg data
32:44 – Building data lake applications: clustered reads with the Iceberg cluster function
35:09 – Parquet read optimizations: Bloom filters, minmax, PREWHERE, filesystem cache
38:03 – Combining hot MergeTree and cold Iceberg data with the Merge table engine
40:46 – What ClickHouse has and has not yet in data lake support
43:25 – ClickHouse Inc. and Altinity 2025 roadmaps for Iceberg and Parquet
45:52 – Medallion architecture: how to design it with today’s ClickHouse stack
49:14 – Design options: pure MergeTree vs. raw S3 vs. ClickHouse plus data lake
51:35 – Managed data lakes: EKS-style hosting, AWS S3 Table Buckets, Databricks, Snowflake Polaris
54:47 – Summary and wrap-up

Webinar Transcript

[00:04] – Welcome and Housekeeping

Robert: Welcome. We are going to be presenting today on what’s a data lake and what does it mean for my open-source ClickHouse® stack.

Before we get started, a little bit of housekeeping. This talk is being recorded. We will provide both a copy of the recording or a link to it, as well as the slides, after the talk. You will get it most likely within a few hours of the talk being completed. This talk also has a couple of demos based on a reference implementation of Iceberg using Docker Compose. There is a link to that at the end of the slides so you can find the code we are running if you want to try it yourself.

[01:00] – Speaker Introductions

Robert: My name is Robert Hodges. I am CEO of Altinity, and I have been working on databases for about 40 years. I have my friend and colleague Alexander Zaitsev here as well. He is CTO of Altinity, and he will be around to help answer questions and may chip in on some of the topics.

This talk is based on our experience in Altinity Engineering. We have about 50 people in the company spread worldwide, most of them engineers. We have extensive experience with databases, particularly ClickHouse, and increasingly with data lakes.

A little more about Altinity: many of you on this call may know us as the authors of the Altinity Kubernetes Operator for ClickHouse®. We do the Altinity community plugin for Grafana with close to 20 million downloads. We do the Altinity Backup for ClickHouse® project. We do a bunch of work on ClickHouse itself and have what are called Altinity Stable® Builds for ClickHouse®. We have done hundreds of PRs on ClickHouse over the years, and we are very invested in this technology.

[02:23] – Why Data Lakes Matter: 10x Annual Data Growth

Robert: What I want to do in this talk is start with the basics: what is a data lake, and why is this important?

The graph I am showing gives you an illustration of the increases in data size over the last five years or so. Altinity has been in business working on ClickHouse since 2017, and like everybody else, we have seen enormous growth in the amount of data being managed in real-time data warehouses. Back in 2019 when we would get people coming on board, they would often have, say, 100 GB of data. That has rapidly increased to data sets where we are dealing with many petabytes, particularly in large observability data sets. This increase in growth, on the order of 10x per year in some cases, is putting pressure on data warehouses because it is very difficult to stuff all this data into a data warehouse and both manage it and make it accessible to users.

Data lakes have emerged as an answer to this problem of being able to store very large amounts of data cheaply.

[03:58] – What Is a Data Lake?

Robert: One way to understand data lakes is to think about what happens inside a data warehouse today. ClickHouse is a query engine, and your applications connect to the ClickHouse server. ClickHouse manages storage, but the way that storage works and the access methods are completely hidden to you. That is actually a benefit in most cases. Your application connects, you run your queries, ClickHouse takes care of the rest.

What is going on under the covers? ClickHouse has a notion of tables: getting your data organized in a tabular form, having a schema for those tables, a format for the files that store the data, being able to break the data into parts, holding statistics like minmax indexes or primary key indexes, some understanding of transactions, and time travel, which is a feature ClickHouse does not really have but is present in many other data warehouses.

What a data lake does is take that storage, which was previously hidden, and make it accessible to other applications using APIs that everybody can read and write. You can think of it as pulling out the storage layer. Query engines can still get to it, and your applications can still connect to the query engines, but other applications can connect to it as well. This basically allows a wide range of applications to work on a single copy of data.

This type of organization, where you have transactions or schema enforcement alongside cheap object storage, is often called a data lakehouse. That is a marketing term developed by the Databricks folks in a seminal paper they wrote in 2021. It describes combining the ability to store things cheaply with the controls we expect when managing large amounts of data in a data warehouse. From here on out, when we say data lake, that is what we mean.

[06:47] – Data Lakes Are Already Pervasive

Robert: Data lakes are already pervasive in certain workloads, even if they are not yet common in data warehouses. A bunch of applications have begun using them very heavily: model training, data science, and large batch operations like Spark jobs that manipulate very large amounts of data. They build on Open Table Formats such as Iceberg, Delta Lake, and Hudi. They use file formats like Parquet or ORC. They are based on S3-compatible object storage. These solutions offer global access, and critically, they run in user accounts. Data sets are so large that people typically do not want to push them into a cloud service that somebody else owns, because they lose control and it is really expensive. This is already allowing multiple applications to work off a single copy of data.

[07:53] – Iceberg Architecture: The Four Layers

Robert: Let us dig into the actual plumbing. From here we are going to focus fully on Iceberg. Iceberg looks like the winner in this race among the various table formats. When you build a data lake on Iceberg, what do you actually have?

First, you have S3-compatible object storage. Second, you have data most commonly stored in Parquet files. Parquet, for a ClickHouse person, is an analog to the MergeTree file format. It stores data in columns, compresses it, has indexes built in, has metadata and schema. It does all the things MergeTree does; it is just an open format that anybody can read or write.

Third, you have metadata. The metadata in Iceberg manages the tables: what is the table, what is its name, how is it partitioned, where are the files. All of that is stored in the metadata layer. Finally, because you do not want to have to manipulate these metadata files yourself, there is what is called a catalog. A catalog holds the list of metadata files, their locations, and instead of having to go read metadata files individually, you can connect to the catalog and get all the information from one place.

With these four parts: the S3-compatible object storage, the Parquet file format, the Iceberg metadata files, and a catalog, you have all the pieces necessary to allow a Python data science application to connect to your data lake.

[10:41] – How a Python Application Reads Iceberg Data

Robert: When this Python data science application spins up and you want to do selects, the first thing it does is connect to the catalog service and ask for metadata for a particular table. The metadata gets pulled down through the catalog, the Python libraries know where the actual files are, and then they go read them directly. This is a really important point: unlike using a database like ClickHouse, there is no server between you and the data. Applications go and read it directly, and this is what allows many applications to read the data simultaneously because they can all see and access it.

[11:35] – Reference Implementation

Robert: For this particular example, we are going to use MinIO to provide our S3 storage. We will have our metadata managed in Iceberg format and our data in Parquet. We are going to use an Iceberg REST catalog, which provides a REST API we can call to get information about tables. And since we are in Python, we will use three libraries: PyIceberg, which knows how to find and select from Iceberg tables; PyArrow, which knows how to read and write Parquet; and Pandas, because everybody uses Pandas for data frames. These are all the parts of the application plus the data lake.

[12:50] – Demo 1: Creating, Populating, and Reading an Iceberg Table in Python

Robert: Let us actually look at an example. If you have questions as we go along, please feel free to put them in the Q&A box or the chat and we will answer them as we go.

Here is what a Python application looks like when it talks to a data lake. There is a lot of code. We are connecting to the catalog, creating a namespace, listing our tables, using somewhat complicated code to define the schema of a table, creating the table, putting data in it, and so on. This is a very low-level approach to accessing data, and you will notice there is not a drop of SQL in it.

Let us run this script. It is connecting to the catalog, creating the namespace, listing the namespaces, adding some data, adding more data. Very simple: we are adding four rows. Now, let us run another script to read this back. The reading code is much simpler. We connect to the catalog, find the table we just created, scan the table to read all the data into a Pandas data frame, and print it. There are our four rows. As you will see later, the data is in two partitions because we are partitioning by day.

[15:29] – What Was Created: Buckets, Metadata, and Parquet Files in MinIO

Robert: As we connected to the catalog and created the table, ClickHouse created both the metadata and the actual Parquet data files. If you go into MinIO and use the console, you would see a bucket called Warehouse, and under that a path called Data. Under that you would see what amount to two directories. Under the metadata path you would see all the Iceberg metadata files, a mix of Avro and JSON. Under the data path you would find the actual Parquet data files, one per partition. This data is completely accessible to any application, and if you have a console you can go through and look at it directly. The point is that everything is just S3 files.

[17:35] – Pandora’s Box: Choices When Adopting a Data Lake

Robert: One thing about adopting this technology: it is somewhat mind-expanding, not entirely in a good way. There is a Pandora’s box of choices. We are using Iceberg, but you have other table formats. Iceberg looks like the winner here. There is the Iceberg REST catalog, but that is not the only catalog. There are others: Unity, AWS Glue, Nessie, the Hive Metastore, and more. You need to pick one. The Iceberg REST catalog is a reference implementation, simple to use, and pops up quickly.

Parquet is a good format. It is not the only one that can be used, but it is a good bet because it is so widely supported and there is already a lot of data written in it. You will have to learn more about language ecosystems. We are using PyIceberg, so that is Python, but there are other ways. In fact, if you are using Iceberg, some things are very difficult to do outside of Java because the basic Iceberg libraries are written in Java, so you sometimes have to go back to those. And of course you have lots of options for storage.

One question that popped up: does Iceberg provide any data backup or disaster recovery strategy? The answer is an emphatic no. That is something you as an application user need to put together. In the same way that a ClickHouse MergeTree table does not provide built-in backup, that is an application-level concern.

[19:28] – ClickHouse Architecture: A Quick Review

Robert: Let us shift from talking about data lakes and start bringing ClickHouse® into the picture. Most of you here know ClickHouse, but a quick review does not hurt.

ClickHouse is a real-time analytic database. The ClickHouse servers run as separate processes, typically on different VMs. They have their own columnar storage built in. The workhorse storage engine is called MergeTree, with a bunch of variants, and that stores data on block storage. ClickHouse has increasingly capable ability to read and write S3 object storage. It has very powerful caching to make reads as efficient as possible, and a vectorized parallel query engine that spreads queries across multiple servers on the network, and within a single server is very good at breaking data into pieces and pushing them onto as many cores as possible. It also uses ZooKeeper or Keeper for coordination, important particularly for replication.

This is the basic architecture, and what is happening right now is it is being extended to work with data lakes.

[21:18] – Three Ways ClickHouse Can Access a Data Lake

Robert: How can ClickHouse access a data lake? There are basically three options. One is to go read the Parquet files directly, skipping all the catalog bureaucracy and going straight into object storage. Another option is to go straight to the metadata files, cutting out the catalog but still reading the Iceberg metadata to find the right files. The third option is to go to the catalog and let that tell you what the tables are and access them that way. We are going to try all three and comment on which is most appropriate.

Method 1: S3 table function. To read data directly from S3 storage, you just have to know where the files are. Using the S3 table function, you give the path with wildcards that say “under these directories you are going to find a bunch of Parquet files, just read them all.” Very powerful and super fast. There is also a clustered form called S3Cluster. The downside is you have to know exactly where the data is, and if some data belongs to different versions of files, ClickHouse would not know the difference and would read the wrong files.

Method 2: Iceberg table function. This allows ClickHouse to go directly to the Iceberg metadata. ClickHouse will grab the metadata files, figure out what the table definition looks like, and read the correct files. This is more accurate than raw S3 access because it respects Iceberg’s versioning.

Method 3: Iceberg REST catalog as a ClickHouse database. This is a little more complicated. Right now it is an experimental feature that you need to turn on. You create a database in ClickHouse backed by the Iceberg REST catalog, and once you have done that you can use it as if it were a native ClickHouse database. This is very powerful: all your operations are just like working on tables inside ClickHouse itself. There is one small difference: because Iceberg and ClickHouse do not have completely consistent notions of table namespace nesting, you get some double-barreled names in backticks. That is on the ClickHouse roadmap to fix.

[25:11] – Demo 2: All Three Methods Live, Plus Spark and Consistency Check

Robert: Now let us go ahead and try all three approaches live.

Raw S3 approach. We have done no preparation in ClickHouse. We just happen to know where the data is. We do a SELECT COUNT on those Parquet files using the S3 table function, then SELECT *, and we get our data. Four rows.

Iceberg table function approach. Providing the path to the Iceberg metadata, ClickHouse resolves the table and reads the files. By the way, you see two blocks in the result because there are two files underneath and ClickHouse treats them as separate partitions. Same four rows.

REST catalog approach. I have gone ahead and set up the database already to save time. You can see the database called data_lake in the SHOW DATABASES output. SHOW CREATE DATABASE data_lake reveals it is an Iceberg database that knows where the storage is and where the REST catalog is. SHOW TABLES FROM data_lake shows exactly one table. We can do SHOW CREATE TABLE on it, which shows it is a Data Lake table using the Iceberg table engine created automatically. And running a SELECT COUNT and SELECT * confirms the same four rows.

Spark delete and consistency check. ClickHouse cannot yet write to Iceberg, but Spark can. Let us fire up Spark and read the data first. And yes, Spark is very slow compared to ClickHouse. Same data. Now let us delete some rows in Spark: we wipe out anything where the bids are below a certain threshold. That leaves two rows. Now let us check: Python sees two rows. ClickHouse via the Iceberg table function sees two rows. ClickHouse via the REST catalog also sees two rows. They are all looking at the same data.

Why direct S3 reads return wrong answers. Now let us go back and read the data using the raw S3 table function. It still returns four rows. Here is the problem: Iceberg keeps the old data files around because it has a notion of snapshots, and you can ask ClickHouse to read an older version of the data. But S3 does not know about versions. It does not know which files are valid at any particular time and so returns the wrong answer. This is the reason why even though it is convenient to read data directly from S3, you cannot use it reliably for Iceberg. It will give you wrong answers. You need to use either the Iceberg table function or the catalog integration.

[32:44] – Clustered Reads: Distributing Iceberg Queries Across Multiple Nodes

Robert: Let us talk about building data lake applications with ClickHouse. More features are coming.

When you are reading from S3, it is really important to be able to do reads in parallel not just across the vCPUs of a single server but across servers. There is an Iceberg cluster function that works very much like S3Cluster. You give it a cluster name, which is a list of servers, and the path, and it uses the resources of all of those servers simultaneously to do reads.

Here is how it works internally. You do a SELECT, it comes in, and the server it hits is what we call the initiator node. That initiator node reads the Iceberg metadata, finds the file locations, and then parcels those file locations out to the nodes in the cluster that actually do the reads. At this level you can also do partition pruning: if you are only looking for data from today, you can skip all files that are not from today. The files get distributed to the worker nodes, they do their reads, and the results get streamed back to the initiator for the final result. This is very powerful because it lets you throw a lot of compute at a read and also gives each worker its own network connection, increasing total network bandwidth.

[35:09] – Parquet Read Optimizations: Bloom Filters, minmax, PREWHERE, and Filesystem Cache

Robert: There are a lot of optimizations available for Iceberg Parquet reads. At the lowest level, there has been quite a bit of recent work on making Parquet fast. Bloom filter indexes in Parquet are now supported, and we contributed that work. Parquet has a notion of row groups, and you can read minmax values to do partition pruning so you do not read all Parquet files. Work is ongoing on PREWHERE support for Parquet, where if you do not have an available index but you are selecting off a column, you can take that column as an index scan and use it to filter row groups before reading them.

ClickHouse has very good caching for S3. You can set up a filesystem cache, and as your queries pull from Parquet data, they can take advantage of this cache. If you touch the same Parquet files from the same server, the reads will be vastly faster because they are on local disks. And once they are on the filesystem, they will go into the OS buffer cache, so you are effectively reading from memory.

There are two other kinds of caching still in progress that are necessary to make Parquet reads on Iceberg really fast. The first is caching Iceberg metadata, so you do not have to go to the REST catalog server on every single query. The second is caching Parquet metadata, which includes schema information and row group statistics, analogous to how ClickHouse already caches index contents so it does not have to re-read them every time. This work is ongoing and should be available soon. Parquet reads are becoming very capable and will quickly become competitive with MergeTree for cold data.

Here is how to set up the filesystem cache today. You add the XML configuration to teach the ClickHouse server how to set up the cache and then use a couple of settings to invoke it for your queries. It can be smoothed out over time, but this is how you can do it now.

[38:03] – Combining Hot MergeTree and Cold Iceberg Data with the Merge Table Engine

Robert: One of the things about ClickHouse is how much it can already do with data lakes. Here is a great example. It is very common to want to have hot data inside MergeTree because it is super fast for a variety of reasons, but keep the cold data on object storage in Iceberg. You can actually do that today.

First, create a local MergeTree table using the column definitions from your data lake table. Then use the Merge table engine. The Merge table engine is a specialized engine that says: here is a collection of tables, just read them as if they were a single table and union everything back to me. In this example, CREATE TABLE all_bids with a regular expression that picks up both the local MergeTree table and the Iceberg-backed table, and when you SELECT you get both your hot data and your cold data back.

This is a very powerful technique for what you might call poor man’s tiered storage. The key thing is you get a single SQL interface and you can see everything: the local data and the Iceberg data through one query. This is something we plan to leverage going forward to make tiered storage to Iceberg much easier to use.

[40:46] – What ClickHouse Has and Has Not Yet in Data Lake Support

Robert: Let us get real about what ClickHouse can and cannot do today, and then look at the roadmap.

What ClickHouse has already: Reading and writing to S3 storage is very capable. We have performance-tested this over the years and it is excellent. Reading in parallel across many S3 objects is fast and efficient. Reading and writing Parquet has gotten much better over the last two years. Connecting to Iceberg catalogs was added in December of last year by a contributor at ClickHouse Inc. It is a great step forward: Iceberg tables now just look like ClickHouse tables in a database. And there is a whole raft of optimizations coming.

What ClickHouse cannot yet do: It cannot write to Iceberg or other open table formats from within ClickHouse. It cannot archive data from MergeTree to data lakes using native tiered storage. It cannot do compaction, which is important for large tables where you want to coalesce small blocks written during initial insertion into larger blocks. And it does not yet have autoscaled parallel query: the ability to spin up and tear down compute resources elastically using something like Kubernetes to match query demand.

[43:25] – Roadmap: ClickHouse Inc. and Altinity 2025

Robert: Here is what the ClickHouse Inc. roadmap looks like for 2025: PREWHERE support for Parquet, partition pruning for Iceberg, writing to Iceberg tables, and compaction. There is very active work here. Not only are there people at ClickHouse Inc. putting this stuff in but also a wide range of people from different organizations contributing. This stuff is getting better fast.

On the Altinity side, our focus areas this year include flexible scaling of S3 and Iceberg queries. We will have something fun to show in February on this. Efficient writes to Parquet, so you can export data out of ClickHouse into Parquet efficiently. Open-source compaction and transformation, so you can read data, transform it in various ways, and put it back into the data lake. Tiered storage: we contributed some key parts of the existing TTL move functionality, and we are going to focus on getting those to work for archiving MergeTree tables out to Iceberg data lakes. And we are planning to host an Iceberg catalog service in Altinity.Cloud so you do not have to run the catalog yourself.

[45:52] – Medallion Architecture: Designing on the Emerging ClickHouse Data Lake Stack

Robert: A couple of people mentioned medallion architectures when I was talking with them before setting up this talk. A medallion architecture is a way of dealing with data coming in from sources that may be raw or dirty.

You have a pool of raw data called the bronze layer: just raw data dumps. From those you do some cleanup to get a filtered and clean silver layer. Then you build your gold layer: the aggregates that allow people to get fast answers. You can think of the gold layer as your materialized views in ClickHouse.

Can you build this system right now in ClickHouse? The answer is yes, kind of, but not all the parts are there. You are going to have to combine with other types of applications that can also see the data lake.

Kafka can be combined with a Kafka-to-Iceberg connector, such as the one built by the Tabular folks, which will dump event stream data into Iceberg. That handles the ingest into the bronze layer. Filtering and cleaning involves both reading and writing, which ClickHouse cannot yet do on the writing side, so you would use Spark for that silver stage. But then it gets interesting. Once the data is cleaned, ClickHouse can read it. And you could pull that cleaned data down and build your own aggregates, your hand-crafted gold layer materialized views by running scheduled jobs that select from Iceberg and insert into MergeTree.

Over time, as ClickHouse gets write support for Iceberg and compaction, you will be able to drop Spark and replace it with ClickHouse for the bronze-to-silver transformation. And what you will end up with is: raw source data out on Iceberg, cleaned-up data out on Iceberg, and aggregates stored inside ClickHouse MergeTree because that makes them fast. That is a really capable system and opens up interesting new horizons for using ClickHouse speed while taking advantage of shared data that everybody else can see.

[49:14] – Design Options: Pure MergeTree vs. Raw S3 vs. ClickHouse Plus Data Lake

Robert: These are your options today.

Pure MergeTree: You can build a medallion architecture entirely inside ClickHouse MergeTree. There is a nice article on the ClickHouse blog that shows how to do this. It is integrated and fast, but the data is not shared. If you have very large data sets where other people are putting the data in, it will not work because you do not have the sharing.

Raw S3: You can kind of cobble something together on S3. Fast read and write, sharing is possible, but as we saw, it has to be done very carefully. The S3 table function does not know which file versions are valid and you can get wrong answers. You are also basically building without Iceberg, so you would be using Hive-style partitioning at best and you do not have the metadata.

ClickHouse plus data lake: Available now or emerging now. You can have local data on MergeTree, shared data in Iceberg, and it is integrated: you can treat Iceberg tables as ClickHouse databases. The sharing is excellent as the demos showed. Reads are fast and will get a lot faster. The integration with ClickHouse is great, so your users do not have to think about file paths and storage locations.

What we will see is that this ClickHouse-plus-data-lake option is going to open up further over the next year. Rights and tiered storage are coming, and at that point it gets very interesting.

[51:35] – Managed Data Lakes: AWS S3 Table Buckets, Databricks, Snowflake Polaris

Robert: One final thing on your radar: managed data lakes. This is what we saw with Kubernetes a few years ago. When Kubernetes first emerged, everybody ran it themselves using tools like kops. Pretty quickly the cloud providers realized managed Kubernetes was a great business because it helps people consume a lot of compute and storage. They built things like EKS, which you can run for about $70 a month on Amazon, with everything else just being the cost of the VMs. It is now relatively rare for people running on Amazon or Google to run Kubernetes themselves.

The same thing is now happening with data lake catalogs. Catalogs are an obvious thing to manage. This is developing very quickly because some catalogs already existed, like AWS Glue. Databricks is building a hosted catalog based on Unity. Snowflake has Polaris. Rather than running it yourself, your cloud provider can do it, and we are also planning to do that in Altinity.Cloud very shortly.

The more profound development is that the actual data lake storage itself will be managed. Databricks, who invented the data lakehouse concept, is very focused on managing this for you. OneHouse is another new provider. The most interesting development for my money is the AWS S3 Table Buckets: a new kind of S3 bucket specifically designed to hold Iceberg data. It knows about Iceberg metadata and will not let you use the bucket in ways that violate Iceberg rules, for example you cannot just list files and run them. It also has performance optimizations for the type of API access patterns typical in Iceberg. We will probably see every cloud provider do something similar.

[54:47] – Summary

Robert: A few summary points.

There is an emerging solution now for a single copy of data on cheap storage. People talked about data lakes ten years ago when Hadoop was big and it was a mess. This stuff actually works. We have seen enough applications built on it and there is enough momentum that it looks like it is now becoming viable.

Iceberg and Parquet look really solid. It does not really matter whether they are the best choices. A lot of people are using them, and there is a network effect. Databricks for example is betting big on Iceberg: that is why they bought Tabular, and Snowflake is doing the same.

Iceberg and Parquet reads will be competitive with MergeTree for cold data over the course of this year. There are some things they do not do so well, but for cold, shared data, they have enough advantages that they are something you will want to use. Keep your cold data on the data lake, let everybody else look at it, and use ClickHouse for the fast, hot analytical layer.

Writes and tiered storage will emerge over the course of this year.

There is just way more in ClickHouse already than you might think. Things like the way threads are used, max_insert_threads, and other options continue to work exactly as they do in the rest of ClickHouse and will work when talking to data lakes. You can combine things in powerful ways.

Managed data lakes and storage are coming from every direction, including Altinity.Cloud.

Stay up to date. You want the latest builds to have these features. We at Altinity are going to be introducing a new type of build called an edge build that lets you get features much closer to the ClickHouse head, much faster.

References: The Altinity blog will be covering this material in depth. I highly recommend the Databricks 2021 Lakehouse paper for understanding the architecture: it is an excellent reference even four years on and really laid out the principles of this approach.

FAQ Section

Q: What is a data lake and how does it differ from a data warehouse like ClickHouse?

A: A data lake is a storage architecture that takes the internal storage mechanisms of a data warehouse, such as tabular schema, file formats, partitioning, and metadata, and makes them accessible to any application via open APIs. In a data warehouse like ClickHouse, all of this is hidden. The query engine manages the storage, and applications connect through the database server. In a data lake, the storage layer is separated out: data lives in open formats like Parquet on S3-compatible object storage, organized by an open table format like Iceberg. Any application, whether it is Python, Spark, ClickHouse, or a machine learning framework, can read or write the same data directly. The key benefit is a single shared copy of data that multiple workloads can use simultaneously, without pushing everything into a proprietary cloud service.

Q: Why is Apache Iceberg the recommended table format for data lakes?

A: Iceberg has emerged as the leading open table format because of broad industry adoption and momentum. Databricks acquired the company Tabular specifically to double down on Iceberg. Snowflake supports it through Polaris. AWS has built Iceberg-specific support into S3 Table Buckets. It provides important features including schema evolution, partition evolution, time travel (reading historical snapshots), and ACID-style transaction support for data lake operations. The Parquet file format, which Iceberg uses for data storage, is also extremely widely supported across Python, Java, and Spark ecosystems. While Delta Lake and Hudi are also viable, the network effect around Iceberg makes it the safest bet for new data lake projects.

Q: What are the three ways ClickHouse can read Iceberg data, and when should I use each?

A: The first way is the S3 table function with wildcard paths, which reads Parquet files directly from object storage without consulting any Iceberg metadata. It is fast and requires no setup, but it cannot distinguish valid from invalid file versions in an Iceberg table, so it can return wrong results if data has been deleted or updated. Use it only when you control the files entirely and Iceberg semantics are not involved. The second way is the Iceberg table function, which reads the Iceberg metadata files to identify exactly which Parquet files belong to the current table snapshot. It is accurate and metadata-aware, and does not require catalog setup. The third way, and the most powerful, is creating a ClickHouse database backed by an Iceberg REST catalog. Once configured, Iceberg tables appear as ordinary ClickHouse tables and you can query them with standard SQL without specifying any file paths. For production Iceberg workloads, the catalog-integrated approach is the right choice.

Q: What can ClickHouse not yet do with Iceberg data lakes, and what is coming?

A: As of this webinar, ClickHouse cannot write to Iceberg tables, cannot archive MergeTree data to Iceberg using native tiered storage, cannot perform compaction of small Iceberg parts into larger ones, and does not have autoscaled parallel query that can elastically provision and release compute resources via Kubernetes. These are all on the roadmap. ClickHouse Inc. is working on writing to Iceberg tables, PREWHERE support for Parquet, and partition pruning. Altinity is working on flexible scaling of S3 and Iceberg queries, efficient writes to Parquet, open-source compaction, tiered storage archiving from MergeTree to Iceberg, and hosting an Iceberg catalog service in Altinity.Cloud.

Q: How does the Merge table engine help combine hot MergeTree and cold Iceberg data?

A: The Merge table engine is a ClickHouse table type that takes a collection of tables and presents them as a single unified table, unioning all the data. You can create a local MergeTree table for your hot, recently ingested data, and point the Merge table at both the local MergeTree table and an Iceberg-backed table for cold historical data. A single SELECT query against the Merge table returns data from both storage locations transparently. This is an effective way to do tiered storage today: keep hot data fast and local, keep cold data cheap and shared on Iceberg, and give users a single SQL interface that sees everything.

Q: Does Iceberg provide backup or disaster recovery?

A: No. Iceberg is a table format that manages metadata and data file organization on object storage. It does not provide backup or disaster recovery. Those are responsibilities of the application or infrastructure layer on top of it. Iceberg does provide snapshots, which enable time travel reads of historical data versions, but this is not a substitute for a proper backup strategy. You need to implement backup separately, for example by using object storage versioning, replication to another bucket or region, or a dedicated backup tool.

© 2026 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE^® SOLUTIONS

Get in touch with ClickHouse experts.

Related:

Leave a Reply Cancel reply