September Project Antalya Roundup: Fresh Features to Make ClickHouse® Run Faster and Cheaper on Data Lakes

Recorded: September 10 @ 08:00 am PDT
Presenters: Alexander Zaitsev & Robert Hodges

In this webinar, Altinity CEO Robert Hodges and CTO Alexander Zaitsev deliver a September roadmap update for Project Antalya, Altinity’s open source initiative to extend ClickHouse with Apache Iceberg and Parquet as the primary storage layer. Robert frames the problem: as data volumes at modern companies reach massive scale, the shared-nothing architecture of traditional ClickHouse clusters becomes both expensive and difficult to scale, with block storage replication running up to 10x the cost of S3 object storage. Project Antalya addresses this by moving shared data onto Iceberg, enabling stateless swarm clusters for scalable query compute, and building in a catalog to manage the metadata efficiently.

Alexander then walks through three major new features. First, full support for AWS S3 Table Buckets, a managed Iceberg-over-S3 service from AWS that includes built-in compaction, exposed through the Altinity ice REST catalog acting as an authentication proxy. Second, a set of catalog management operations now available in the ice REST catalog, including manifest compaction, data compaction, snapshot cleanup, and orphan cleanup, which can be run either as part of the catalog service or on an external schedule. Third, current approaches and near-term plans for writing data from ClickHouse into Iceberg, including the ice command-line tool, ice watch mode (event-driven automatic snapshot commits), and the upcoming ALTER TABLE EXPORT PARTITION command planned for the Antalya 25.8 build. The session closes with a broader roadmap preview covering the upcoming 25.6-based Antalya build, infinity tables (seamless MergeTree and Iceberg tiering from a single table interface), and ingest swarms.

Here are the slides:

September Project Antalya Roundup_Fresh Features to Make ClickHouse Run Faster and Cheaper on Data Lakes-2025-09-10 Download

Key Moments (Timestamps)

Key moments generated with AI assistance.

0:12 – Welcome and session overview
1:01 – About Altinity: what we do and our open source projects
3:14 – Why ClickHouse needs a new storage model: cost and scaling challenges
8:04 – What is Project Antalya: shared Iceberg storage and swarm clusters
13:26 – Alexander introduces recent and upcoming Antalya features
15:46 – Feature 1: Full support for AWS S3 Table Buckets
22:00 – Feature 2: Catalog management operations in the ice REST catalog
26:06 – Feature 3: Writing from ClickHouse to Iceberg, current approaches and next steps
33:20 – Upcoming: ALTER TABLE EXPORT PARTITION
36:23 – Roadmap: 25.6 build, infinity tables, and ingest swarms
44:43 – Q&A: Delta vs. Iceberg, catalog choices, ice watch, GCP BigLake, aggregation in Iceberg

Webinar Transcript

[0:12] – Welcome and Session Overview

Robert: Hi, everybody, and welcome to our webinar on fresh features to run ClickHouse faster and cheaper on data lakes. We’re going to give everyone one minute to come in, and then Alexander and I will kick things off.

Okay, let’s get going. I’d like to welcome you all to our September Project Antalya roadmap, where we’ll be talking about fresh features to run ClickHouse faster and cheaper on data lakes. This is a snapshot of current development. We’ll give you some context on what Project Antalya is, and then talk about some really cool new features that have landed over the last couple of months. My name is Robert Hodges. I’m CEO of Altinity. With me today is Alexander Zaitsev, who is our CTO and is driving everything.

[1:01] – About Altinity: What We Do and Our Open Source Projects

Robert: A little bit about Altinity, in case you don’t know us. We are a vendor for ClickHouse. We started in 2017. We provide enterprise support for ClickHouse, however you choose to run it: everything from small servers to massive clusters in public clouds. That’s about half our customers. The other half run in Altinity.Cloud, which is a managed ClickHouse solution. It can run in our accounts or in your accounts with a bring-your-own-cloud (BYOC) model. The BYOC model is very popular, pretty easy to set up, and works across four different clouds, from Amazon to Hetzner.

Just a couple of housekeeping items. First: we are not ClickHouse, Inc. People sometimes get confused. We are actually a competitor to ClickHouse, Inc., although since this is open source, we collaborate on many things to make ClickHouse better. That includes making hundreds of contributions to upstream ClickHouse over the years. We are the authors of the Altinity Kubernetes Operator for ClickHouse. We work on Altinity Backup for ClickHouse, the Grafana community plugin, and a bunch of other ecosystem tools that make ClickHouse systems run better.

Second: this webinar is being recorded. Shortly after it ends, you will get a link to the recording and the slides in PDF form. And finally, we have plenty of time for questions. You can pop them into the chat or the Q&A box. Alexander and I are tag-teaming this, and we’ll have time at the end to go deeper on anything we’ve discussed or on ClickHouse in general.

[3:14] – Why ClickHouse Needs a New Storage Model: Cost and Scaling Challenges

Robert: This talk assumes you are here because you love ClickHouse and want to see it get better. Let’s level set. When ClickHouse was first developed, it came out with what’s called a shared-nothing architecture, which is how most analytics systems were built starting in the late 1990s. The way it works: you have ClickHouse servers with attached block storage mounted off the file system. Those servers communicate over a network, and in order to implement sharding and replication, you need some record of who owns which parts of tables. That job belongs to ClickHouse Keeper, which helps manage replication and sends tasks across the cluster.

Over time, ClickHouse opened up the ability to store data in S3-compatible object storage, as well as the native object storage of Azure and GCP. It started with reading and writing from object storage, but increasingly led to people building systems with shared data on S3 itself. That’s a lot of what we’ll be talking about today.

This architecture became phenomenally successful, with thousands of users, but there are now some real problems. The biggest is that data sizes have become enormous, even at small companies. We have a customer that puts in about 20 petabytes of uncompressed data every day. They run hundreds of servers to absorb that load, and it is putting serious strain on the shared-nothing design of ClickHouse. Let’s look at what those strains are.

The key problem is cost. If you have very large amounts of data, storage becomes very expensive. On a shared-nothing model where all tables are stored on block storage replicated two, three, or four times, storage can cost up to 10x more than keeping the same data in S3. The second problem is that ClickHouse, as most of you know, is a single binary that handles everything: ingesting data through inserts or from Kafka, merging data in the background, applying mutations, and processing SELECT queries. Because all of these run together, you cannot scale them independently. People end up overprovisioning servers, which wastes capacity. And even with overprovisioning, heavy load days, like a busy trading session where every analyst hits the database simultaneously, can exhaust capacity before you can scale up. This creates instability in addition to the cost problem. This is the problem we want to solve.

[8:04] – What Is Project Antalya: Shared Iceberg Storage and Swarm Clusters

Robert: The solution that allows you to decouple compute from storage is shared object storage, the approach that analytics systems have converged on. You put the shared table data on object storage, and then different groups of servers pull the data and work on it independently. Snowflake, for example, calls those groups virtual data warehouses. You can have one virtual data warehouse for loading data, one for fast dashboard queries, and another for long batch reports. That model is implemented in ClickHouse Cloud, but it is not open source.

The question is: how do we offer this in open source ClickHouse? That is what Project Antalya is about. After a lot of research and work on Parquet, our conclusion was to build shared object storage on Iceberg. Iceberg is a very popular open table format that lets you define, query, and update tables living on S3. The files themselves are mostly Parquet, another open format that has become very prominent in recent years. For an open source project, this is a natural choice: Iceberg has a large community behind it, and the question is just how to make it work well as storage for ClickHouse.

The way we do that is to first invest in reading from this storage as efficiently as possible, including fast Parquet reads. On top of that, we added swarm clusters. Swarm clusters are stateless ClickHouse servers that work together as a group. You can delegate query processing on Iceberg out to the swarm and have as much or as little compute as you need. This increases your bandwidth to object storage so you get answers back more quickly. It also gives you reserves of CPU for compute-intensive analytics, and just as importantly, reserves of memory, which is what makes things fast: keeping Parquet metadata and hot data blocks in memory. This is a 100% open source feature, first introduced in the Antalya builds. It was announced in April and has been available in builds ever since.

We also run Altinity.Cloud. We have users running the Antalya stack there today. It is completely compatible with standard ClickHouse: switching to an Antalya build and then building out swarm clusters can be done in a minute or two using our cloud tooling. An important part of the Antalya stack in Altinity.Cloud is a built-in catalog. Iceberg stores table metadata next to the data in object storage by default, but that is not efficient for quickly enumerating tables or finding the files for a query. A catalog solves this by providing a REST API that answers questions like: what tables do I have, give me the metadata for a particular table, and what files need to be scanned for this query. That catalog is built in. Now I’ll hand it over to Alexander.

[13:26] – Alexander Introduces Recent and Upcoming Antalya Features

Alexander: Thank you, Robert. Hello, everybody. Let me briefly recap what we’ve done so far and then tell you about some new features that are now available as part of Project Antalya.

We started with the query story: how to run queries in the most efficient way on Iceberg and Parquet data. That’s where we introduced the swarm execution model. We also introduced a number of extra caches on Parquet and Iceberg, like the Parquet metadata cache, which don’t always exist in upstream ClickHouse. After we launched this project publicly, we added support for SELECT queries on Iceberg, which required changes to ClickHouse itself. We also worked on cache locality: when you have a lot of nodes serving queries from S3, you need to make sure that if data is cached locally, it is always requested from the same node. And finally, we added the ice REST catalog, which I will return to several times because it is a very important and integral part of the Antalya stack.

What we have recently done, and what I will focus on today, are three new features added in the last couple of months: full support for AWS S3 Table Buckets, new catalog management features in the ice catalog, and the current state of, and path toward, writing data from ClickHouse into Iceberg.

[15:46] – Feature 1: Full Support for AWS S3 Table Buckets

Alexander: Let’s start with S3 Table Buckets. This is a new AWS service introduced at the end of 2024. In a nutshell, it is a service that uses S3 as storage, with Parquet as the primary data format, and exposes this data using either the Iceberg REST API or the Glue API. Glue is specific to AWS; Iceberg REST is a completely vendor-independent protocol. So this is effectively a managed Iceberg service inside AWS.

What makes S3 Table Buckets especially interesting is that they include catalog management features from the very beginning. The most important is compaction. If you insert a lot of data into Iceberg very frequently, say every minute for an observability system, you end up with many thousands of small Parquet files. In ClickHouse, the merge process takes care of this for you: it combines small parts into bigger ones. In Iceberg, the catalog does nothing on its own. You need external servers to perform compaction. S3 Table Buckets provide that as a managed service. If you look around at other catalog implementations you can deploy yourself, not many include compaction out of the box. Usually you have to run Spark. S3 Table Buckets are one of the few that do, and Altinity’s ice catalog is another.

When we started working with S3 Table Buckets, we ran into a couple of problems. First, they require special authentication parameters that are specific to this service and that ClickHouse has no way to supply. We could have changed ClickHouse itself, but given the release cycle, that could take a while. So instead, we implemented a proxy inside the ice catalog, which is part of the Antalya stack. The catalog authenticates correctly to AWS and then acts as a proxy: any call ClickHouse makes to the catalog, whether listing tables or fetching the list of files for a query, gets proxied to the AWS service and the results are returned to ClickHouse.

We also discovered that the REST API exposed by S3 Table Buckets does not fully implement the Iceberg REST specification. Some operations that ClickHouse needs to perform are simply not implemented. So internally, the ice REST catalog does not use the S3 Tables REST API. It uses the direct Java AWS API, which is what Spark uses, and this way we have no limitations from the incomplete REST implementation.

From the user perspective, if you have the ice catalog deployed, you just configure it, connect to it as a normal REST catalog using a bearer token or other authentication parameters, and set the warehouse to the path to your bucket. That’s it.

The second problem we ran into is that upstream ClickHouse calls to S3 Table Buckets just did not work. In a regular Iceberg catalog, all file paths are relative to the warehouse base location. That is not true for S3 Table Buckets: the S3 paths returned are essentially random. AWS services understand these paths perfectly, but they do not match the warehouse location in any way. This limitation still exists in upstream ClickHouse and the core team has not addressed it yet. So to fix it, we made a change in the Antalya builds. For now, if you want to work with S3 Table Buckets, you need to use Antalya builds. We may contribute this fix upstream in the future.

We have also added support in Altinity.Cloud. If you are an Altinity.Cloud user and want to use S3 Table Buckets, tell us and we will enable it for your environment, whether it is SaaS or bring-your-own-cloud. Once enabled, you can connect and run queries.

[22:00] – Feature 2: Catalog Management Operations in the ice REST Catalog

Alexander: I mentioned that S3 Table Buckets have catalog management features built in, which is really valuable. For our ice catalog, which we provide both as a managed service in Altinity.Cloud and as an open source project, we have also implemented four key catalog management operations.

The first is manifest compaction. When you insert into Iceberg multiple times, every insert or deletion generates a new manifest. Over time, you can end up with many manifests referenced from the root manifest. This is a lot of metadata to read. Manifest compaction collects all the individual manifests and merges them into one, which is much more efficient to read from S3.

The second is data compaction. This is directly analogous to the ClickHouse merge: if you have many small Parquet files belonging to the same partition, it makes sense to merge them into fewer, larger files. Data compaction does exactly this, and it is currently supported in the ice catalog.

The third is snapshot cleanup. Every time you add new data, a new snapshot is generated. Old snapshots may reference files that are no longer in the current snapshot. Those files can only be deleted once the old snapshots referencing them are removed. Snapshot cleanup finds old snapshots, based on a configurable age threshold, and deletes them automatically.

The fourth is orphan cleanup. This removes unreferenced files that can appear during failures. If you start uploading data and fail partway through, or if there are inconsistencies in the logic, you end up with files that are not referenced in any snapshot. If you are familiar with the orphan file problem in zero-copy replication for ClickHouse, this is the same concept. Orphan cleanup finds all referenced files in current snapshots and deletes everything else.

These management operations can be configured and executed in two ways. You can configure them as part of the catalog service itself and let the service run them automatically. The caveat is that data compaction in particular can be memory-intensive and CPU-intensive, and since you do not need to run it very often, always-on execution may mean overprovisioning. So we also support running these operations externally on a schedule, for example as a cron job running every hour or every day, provisioning only the resources you need for the job at that time.

[26:06] – Feature 3: Writing from ClickHouse to Iceberg, Current Approaches and Next Steps

Alexander: The third major feature I want to discuss is writing data from ClickHouse into Iceberg. There is a partial implementation in upstream ClickHouse as of version 25.7 that allows writing to local Iceberg tables. In ClickHouse, you can work with Iceberg in two ways: through a catalog, which is what we are doing, or directly with data in a bucket, interacting with data and metadata files directly without any catalog. The direct approach lacks concurrency control and only lets you work with a single table at a time. It is the only use case for which the ClickHouse core team has implemented write support. Without catalog integration, it is limited in scope, and the issues requesting full catalog-aware write support have been closed for reasons that are unclear. So we need a different approach, and we have one.

We have the ice command-line tool, and I will show you two ways to use it to write data from ClickHouse into Iceberg. The first approach is straightforward: you insert data from ClickHouse, in my example even from an S3 bucket but it could be a MergeTree table as well, directly into the correct location inside your catalog warehouse. If you use an Iceberg catalog, the warehouse directory has a logical structure: you have a namespace as part of the path, a table name, and then data and metadata directories. What we do is insert the data directly into the data directory, so the Parquet files from ClickHouse go directly to the right location inside the catalog. Then we run a single ice command locally to register this data into the catalog snapshot.

We have a –no-copy flag that skips the double-copy step if the data is already in the correct warehouse location. If the data is somewhere else, you can omit this flag and ice will copy it over the network into the catalog location.

The second approach automates this using what we call ice watch mode. In this mode, you set up an SQS service, which is straightforward to configure with Terraform, to deliver S3 bucket events. Ice can listen to those SQS events and automatically commit new files to the catalog snapshot as they arrive. Instead of running the ice command on every insert, you run it in watch mode, and it hangs waiting for events. As soon as new data appears in the watched location, it is automatically committed to the snapshot and catalog. From the ClickHouse user’s perspective, they just insert data to the right location and it automatically appears as part of the Iceberg table. No other actions needed.

In Altinity.Cloud, we have automated this further. You do not need to run the ice command directly. If you specify a flag and which tables to watch, it runs in the background for you.

We have ready-to-run examples with Terraform scripts in our GitHub repository. However, this approach is not ideal for all cases. When running it over a full month of data with automatic partitioning using a partition-by expression and an S3 table function, ClickHouse can run out of RAM because its memory management when writing Parquet is not yet very efficient. There are also some data type compatibility issues. Parquet supports nanosecond timestamp precision, but Iceberg does not. To write such Parquet files into Iceberg, you have to do a type conversion to microsecond precision. Additionally, ClickHouse has two different Parquet writers: the original Arrow-based writer and a newer native writer. The native writer is excellent for reads, but for writes it is sometimes not fully compatible with the Parquet spec. For writing, you may want to disable the native writer. This is something we will fix.

[33:20] – Upcoming: ALTER TABLE EXPORT PARTITION

Alexander: To fix the memory problem and make writing more user-friendly, we are working on a new ClickHouse extension called ALTER TABLE EXPORT PARTITION. This command will allow users to write parts or partitions directly to S3 storage or to Iceberg. It is better than SELECT … INTO for several reasons. First, there is no need to resort or re-serialize the data: we move data from MergeTree to Parquet preserving the same ordering and everything that MergeTree does. Second, we take care of atomicity: the command either fully succeeds or fully fails. It is also retriable: if the server restarts during export, the export will finish after the server comes back online. Third, it will be memory-efficient. We do not want the server to run out of RAM when someone exports a partition to S3.

We are currently working on this, and it should be available in the Antalya 25.8 build, targeting October.

Comparing what is available in upstream ClickHouse versus Antalya: some features, like swarm discovery and catalog integration, are available in both because we were able to push them upstream. Others, like swarm queries, we could not commit upstream because they conflict with the ClickHouse core team’s current direction. S3 Table Bucket support is another example where we have not yet pushed to upstream; the teams appear to be taking different architectural approaches to the path location issue. Our general approach is to push as much of our work upstream as possible, but it typically arrives in Antalya builds first.

[36:23] – Roadmap: 25.6 Build, Infinity Tables, and Ingest Swarms

Alexander: Looking at the roadmap: we are currently working hard on a new Antalya release based on upstream ClickHouse 25.6. The previous Antalya release was based on 25.3 with some backports from newer versions. This one follows the same pattern: 25.6 as the base, with backports and all Antalya features included. This release is coming in September. The most important things expected in it are much better support for AWS Glue, support for joins in swarm queries, and other improvements.

By the end of Q3 or early Q4, we will release the writing-to-Iceberg capability as part of the Antalya 25.8 build. The ALTER TABLE EXPORT PARTITION command is a major step toward this.

Later this year, we plan to release support for what we call infinity tables. From the user perspective, an infinity table is just a table. Underneath, it is a smart structure that routes queries to either MergeTree or Iceberg, or even to the swarm if you have a lot of data, depending on a configurable watermark condition. Part of the data sits in a fast MergeTree for hot data; the rest lives in Iceberg for cold data. The user sees a single table and runs queries against it without thinking about where the data lives. We are actively developing this feature now.

Robert: Let me add that there is also work coming on ingest swarms. We have query swarms that help you scale query performance, but a similar problem can arise on the ingest side when you have a lot of data coming into ClickHouse. Much of the processing there is stateless: parsing data, producing Parquet files, producing merged parts. We plan to offload this stateless processing into ingest swarms. An ingest swarm could, for example, read from Kafka, run materialized view-style processing, and produce merged parts or Iceberg files, allowing you to scale ingest independently from your query system. If you have more incoming traffic, you just add more nodes to the ingest swarm without affecting the main analytics cluster.

[44:43] – Q&A: Delta vs. Iceberg, Catalog Choices, ice watch, GCP BigLake, and Aggregations in Iceberg

Robert: Before we turn to questions, I want to mention that we are doing a series of in-person meetups on real-time data lakes in October. The first is October 2nd in New York City, followed by London, Atlanta, and San Francisco. These are put on by Altinity but also include Snowflake, StarRocks, and the Amazon team to talk about S3 Table Buckets. We would love to meet you in person. This is completely open source, and we want people to join us, contribute, and try it out.

Alexander: Jeff asks about the Delta table format and the trade-offs between our catalog and Polaris and Unity. We considered Delta at the beginning, but we saw Iceberg gaining more and more traction, especially after Databricks acquired Tabular. That was a signal to us to invest more in Iceberg support, and it turned out to be the right call, since S3 Table Buckets are Iceberg-compatible but not Delta-compatible.

That said, ClickHouse can work with Unity and with Delta, and the ClickHouse core team has been making improvements to Delta support. The swarm functionality and most Antalya features are more or less catalog-agnostic at the Parquet level. We have more caching for Iceberg, but basic Parquet reading and writing is catalog-agnostic, and we have caches at the Parquet metadata level that help regardless of catalog. If you use another catalog and run into performance issues, we would be happy to look at them together.

Robert: On catalog choice more broadly: we talked to a huge number of people about this topic and found that a lot of people are thinking about data lakes, but far fewer have actually implemented them. We wanted something that would just pop up and run anywhere, obviously in our cloud but also for the many self-managed ClickHouse users. The ice catalog is very simple and portable. As for other catalogs: ClickHouse is adding increasingly good support for AWS Glue, which is the most common one we hear about from customers. Unity and Polaris are also important. LakeKeeper, which Alexander mentioned, is an interesting new option. We know the people working on it and they are coming on our podcast this Friday. Our catalog is mostly for ease of use: it is a wrapper on top of the standard Java Iceberg libraries, so its behavior is very consistent with the open source Java ecosystem.

Alexander: Omar asks about ice watch. Ice watch can watch any bucket, but in order to set it up, you need some control over cloud infrastructure. Specifically, you need to configure S3 events to be delivered to SQS, and the catalog needs sufficient credentials to load data from the external location. If you want to use the no-copy option, the catalog still needs read access so ClickHouse can read the data.

Alexander: On GCP BigLake Metastore: we have a customer asking us to look into it, but it is on our Q4 list. We do not yet have evidence either way on whether it works. If anyone has time to test it, we would be very grateful. We will turn to GCP Metastore ourselves later this year.

Robert: There is an interesting edge case worth mentioning: some users, particularly in HPC environments and large financial services firms, store metadata in object storage while the actual data files live on a distributed file system on-premises. That is another creative storage variation we are tracking. Our goal is to make all of these just work through ClickHouse so that your applications connect and get a single view of data no matter where it lives.

Robert: Tim asks about tables using aggregation functions and whether aggregate results can be stored directly in Iceberg. You are absolutely right that there is no way to do that today. In the short to medium term, the approach will be to keep your source data in Iceberg but keep your materialized views in ClickHouse. We are planning to add features that allow materialized views to subscribe to changes in Iceberg and use those changes to populate ClickHouse-side materialized views.

Robert: Omar also raises the scenario of external compute inserting into Iceberg, and asks whether we have plans to pre-cache metadata and recent data for fast access. Yes, we have been thinking about this. One of our close design partners is a high-frequency trader with exactly this problem. One approach is a lambda architecture where data takes two paths on ingest: hot data goes into MergeTree immediately for fast queries, and a separate ingest path stuffs the same data into Iceberg as quickly as possible, enabling tools like Looker, which has strong Iceberg support, to see data with minimal latency. The infinity table can then decide which layer to use based on the query. Ideas around ingest swarms caching hot data in memory are more speculative, but we are thinking about them.

We are at the top of the hour, so let’s call it there. Thank you all for joining and for the great questions. Please come to altinity.com, join our Slack, reach out on LinkedIn, or if you are a customer, send questions through support tickets. Thank you all, and have a wonderful day.

FAQ Section

What is Project Antalya, and how does it extend ClickHouse? Project Antalya is Altinity’s open source initiative to make Apache Iceberg and Parquet the primary storage layer for ClickHouse rather than just a source or destination. It introduces stateless swarm clusters for scalable parallel query on Iceberg data, additional Parquet and Iceberg caches, and a built-in ice REST catalog for metadata management. The goal is to reduce storage costs by up to 10x compared to replicated block storage while keeping query performance at real-time speeds.

What are AWS S3 Table Buckets, and why do they require special Antalya support? AWS S3 Table Buckets, introduced at the end of 2024, are a managed Iceberg-over-S3 service that organizes Parquet files into tables and exposes them via the Iceberg REST or AWS Glue API, with built-in compaction. They require special authentication parameters that upstream ClickHouse cannot supply. The Altinity ice REST catalog solves this by acting as an authenticated proxy: ClickHouse talks to the ice catalog over a standard REST API, and the catalog handles the AWS-specific authentication and proxies all requests. A separate Antalya-specific ClickHouse fix is also required because S3 Table Buckets return file paths that do not conform to the standard warehouse-relative path assumption that upstream ClickHouse enforces.

What catalog management operations are available in the Altinity ice REST catalog? The ice REST catalog now supports four operations: manifest compaction (merging many small Iceberg manifests into one), data compaction (merging small Parquet files within a partition into larger ones), snapshot cleanup (removing old snapshots and the data files they reference, based on a configurable age), and orphan cleanup (removing unreferenced files left behind by failed operations). These can be run automatically as part of the catalog service itself or externally on a cron schedule to avoid placing extra load on the catalog.

How do I write data from ClickHouse into an Iceberg catalog today? There are currently two approaches. The first is to insert data from ClickHouse directly into the correct data directory inside the catalog warehouse, then run the ice command-line tool to register the new files as a snapshot. The –no-copy flag skips double-copying if the data is already in the catalog location. The second is ice watch mode: configure SQS to deliver S3 bucket events, run ice in watch mode, and it will automatically commit new files to the catalog as they arrive. In Altinity.Cloud, this watch mode runs in the background automatically if you enable it. A more capable ALTER TABLE EXPORT PARTITION command is under development and targeted for the Antalya 25.8 build.

What are infinity tables in Project Antalya? Infinity tables are an upcoming feature that will present a single ClickHouse table interface over both MergeTree and Iceberg storage layers. The table transparently routes queries to MergeTree for hot recent data and to Iceberg for older cold data, based on a configurable watermark. Users query a single table and do not need to think about which storage layer holds which data. Altinity plans to ship an MVP of infinity tables by the end of the year.

Why did Altinity choose Iceberg over Delta Lake for Project Antalya? Iceberg was chosen because it is gaining broader community traction. A key signal was Databricks acquiring Tabular, the company behind Iceberg. S3 Table Buckets, which represent a major direction from AWS, are also Iceberg-compatible but not Delta-compatible. That said, ClickHouse can work with Delta and Unity catalogs, and most Antalya features are Parquet-level and largely catalog-agnostic. The ice REST catalog is specifically built around Iceberg because it is the primary focus of Project Antalya.

© Altinity, Inc. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE^® SOLUTIONS

Get in touch with ClickHouse experts.

Related:

Leave a Reply Cancel reply