The Future Has Arrived: Parquet on Iceberg Finally Outperforms MergeTree

Alien they seemed to be:
No mortal eye could see
The intimate welding of their later history.
– The Convergence of the Twain. Thomas Hardy
The Altinity Antalya project that was introduced last month started from our work on Parquet performance a year ago after “Reading External Data Parquet Data from ClickHouse®” research. It quickly became apparent that Parquet combined with an open Iceberg table format is a good alternative to ClickHouseⓇ native MergeTree when stored on object storage. To prove the point and open up usage in real applications, we needed performance data. We now have the first results. They look very promising.
Iceberg and Parquet allow us to implement the main performance features of MergeTree, but also give additional benefits. It provides the foundation for compute-storage separation that allows us to apply cheap external swarms of compute for query processing. In this article, we dive deeper into Altinity Antalya foundation features and demonstrate that performance of queries on Iceberg data can already match or even outperform ClickHouse MergeTree!
From MergeTree to Iceberg
ClickHouse MergeTree used to be fundamental for ClickHouse performance over the years. It has been originally designed for block storage where file operations are almost free, and therefore has some shortcomings that are apparent once data moves to the object storage.
To make it short, every MergeTree table consists of lots of files, sometimes thousands, and a reasonably big ClickHouse installation may have hundreds of thousands of data files that sometimes even reaches the inode limit of the file system! When running a query, ClickHouse typically opens and reads many files.
When MergeTree data is stored in object storage like S3, every file is stored as an object. So instead of hundreds of thousands of files we end up with hundreds of thousands of objects. This is a big problem. For instance, in order to access an object, ClickHouse has to perform an HTTPS call. This is slow. Even on the fastest network, the latency of accessing S3 objects is orders of magnitude bigger than the latency accessing the file on a block storage. The icing on the cake is that AWS gladly charges users for such calls! There are other problems as well; if we start digging deeper we may never get to the performance results.
The ClickHouse development team put a lot of effort into adapting MergeTree to be efficiently used with object storage. Some features like compact parts, compact metadata, various caches, lazy table loading, parallel reads, and so on are available in upstream ClickHouse versions. Others, like SharedMergeTree or distributed cache are closed source ClickHouse Cloud features only. This helps to mitigate shortcomings of original MergeTree design to some extent. Unfortunately, the most useful features are not available for open source ClickHouse users.
There is also an important limitation that can not be addressed by improving MergeTree at all. Once the data is on object storage it can be shared, and it is tempting to access it from other systems, not only from ClickHouse. Unfortunately, MergeTree is understood only by ClickHouse itself, which greatly restricts sharing options.
This is why we have chosen the Iceberg table format combined with Parquet files as a storage for our Antalya project. The combo shares a lot of attractive design and performance properties with ClickHouse MergeTree tables whilst addressing MergeTree shortcomings. It is fully open and can be read and written by any external system! There is also a big community and industry leaders backing it, including AWS, DataBricks and Cloudflare.
We’ve spent the last year improving query performance for Parquet. ClickHouse developers added support for Iceberg and other catalogs. The combined efforts over very good results already. Let’s see how ClickHouse over Iceberg performs. But first we need to set up an Iceberg catalog.
Iceberg Catalog
Setting up an Iceberg Catalog has never been that easy. In Altinity.Cloud it is just a single button click. If you are running ClickHouse by yourself, we have released an open source Iceberg REST catalog for Kubernetes that makes it almost as easy.
Once you have a catalog enabled in Altinity.Cloud, there is a wizard that helps to create a catalog connection in ClickHouse. It ends up with SQL CREATE DATABASE statement with all connection parameters already defined.
CREATE DATABASE "ice"
ENGINE = DataLakeCatalog('https://<iceberg_catalog_url>')
SETTINGS catalog_type = 'rest',
auth_header = 'Authorization: Bearer [TOKEN]',
warehouse = 's3://<catalog_warehouse_bucket>';
If you are running a catalog by yourself, it is pretty much the same but you will need to specify connection parameters manually.
Note: ClickHouse versions 25.2 and earlier used ‘Iceberg’ database engine, but in 25.3 it was renamed to ‘DataLakeCatalog’. Direct upgrade from 25.2 to 25.3 is not possible without fixing the schema files directly. A lot of Iceberg performance features were added in 25.3 and 25.4 ClickHouse versions and backported to the latest Altinity Antalya Build, so we recommend using the latest upstream ClickHouse version or latest Altinity Antalya build for the best results.
Icing the Test Data
We start with the popular New York Taxi Rides dataset first introduced by Mark Litwintschik a few years ago. There is an excellent summary page that compares multiple databases including ClickHouse. The dataset is available in multiple variants, we will use the one that is stored in s3://altinity-clickhouse-data public bucket. It is described in Ultra-Fast Data Loading and Testing in Altinity.Cloud, for example.
Here is a ClickHouse table:
CREATE TABLE tripdata (
pickup_date Date DEFAULT toDate(pickup_datetime) CODEC(Delta, LZ4),
id UInt64,
vendor_id String,
pickup_datetime DateTime CODEC(Delta, LZ4),
dropoff_datetime DateTime,
passenger_count UInt8,
trip_distance Float32,
pickup_longitude Float32,
pickup_latitude Float32,
rate_code_id String,
store_and_fwd_flag String,
dropoff_longitude Float32,
dropoff_latitude Float32,
payment_type LowCardinality(String),
fare_amount Float32,
extra String,
mta_tax Float32,
tip_amount Float32,
tolls_amount Float32,
improvement_surcharge Float32,
total_amount Float32,
pickup_location_id UInt16,
dropoff_location_id UInt16,
junk1 String,
junk2 String)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(pickup_date)
ORDER BY (vendor_id, pickup_location_id, pickup_datetime)
It can be loaded with the following query:
INSERT INTO tripdata
SELECT * FROM s3('s3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/*.parquet', NOSIGN)
SETTINGS max_threads=32, max_insert_threads=32, input_format_parallel_parsing=0;
The table contains 1.3B rows and takes up 37GB of local storage.
Now we need to load it into Iceberg. Since ClickHouse does not support direct writes to Iceberg yet, we will use ice. This is an open source client for Iceberg REST catalogs that we developed in order to load tables into Iceberg until native ClickHouse support is available. It can also load data from existing S3 buckets. If you are interested in learning more, read the Introduction to Ice article.
First, let’s write data from ClickHouse into an S3 bucket managed by Iceberg catalog, preserving the path structure that catalog would create:
INSERT INTO FUNCTION s3('s3://<catalog_warehouse_bucket>/nyc/tripdata/data/{_partition_id}/data.parquet')
PARTITION BY toYYYYMM(pickup_date)
SELECT * FROM tripdata
ORDER BY vendor_id, pickup_location_id, pickup_datetime
Then run ice with the --no-copy
flag:
ice insert nyc.tripdata -p --no-copy s3://<catalog_warehouse_bucket>/nyc/tripdata/data/*/data.parquet \
--partition='[{"column":"pickup_date","transform":"month"}]' \
--sort='[{"column":"vendor_id","column":"pickup_location_id","column":"pickup_datetime"}]'
The ice insert command takes data ‘as-is’, registering it properly in the catalog. If you are using different table names, make sure that the table name corresponds to the path inside the warehouse. Note, that we also specify partitioning and table sort order that matches MergeTree one.
We can now run a count query to confirm row count:
select count() from ice."nyc.tripdata"
┌────count()─┐
1. │ 1310903963 │ -- 1.31 billion
└────────────┘
The count query is almost instant. So let’s try something a little more “stressful.”
Climbing the Iceberg with Test Queries
We will use 5 test queries:
Q1. Group by a single column.
SELECT
passenger_count,
avg(total_amount)
FROM tripdata
GROUP BY passenger_count
Q2. Group by two columns.
SELECT
passenger_count,
toYear(pickup_date) AS year,
count(*)
FROM tripdata
GROUP BY passenger_count, year
Q3. Group by three columns.
SELECT
passenger_count,
toYear(pickup_date) AS year,
round(trip_distance) AS distance,
count(*)
FROM tripdata
GROUP BY passenger_count, year, distance
ORDER BY year, count(*) DESC
Q4. Query with a JOIN to taxi_zones table
SELECT
tz.zone AS zone,
count() AS c
FROM tripdata AS td
LEFT JOIN taxi_zones AS tz ON td.pickup_location_id = tz.location_id
GROUP BY zone
ORDER BY c DESC LIMIT 10
Q5. Where condition on the joined table.
SELECT count(*)
FROM tripdata AS td
INNER JOIN taxi_zones AS tz ON td.pickup_location_id = tz.location_id
WHERE tz.zone = 'Midtown East'
The Iceberg queries differ only by the table name: ice."nyc.tripdata"
instead of tripdata
. Table taxi_zones
is always local; it is a small one and there is no reason to put it into an Iceberg table.
Finally, we enabled the following optimizations:
- Iceberg metadata files cache (
use_iceberg_metadata_files_cache=1
) – this cache stores the internal representation of Iceberg metadata, which contains a list of files representing an Iceberg table. Otherwise we would need to read from object storage and parse a huge JSON on every query. This can add 10s of milliseconds (or even 100s of milliseconds) for large tables. - Parquet metadata cache (
input_format_parquet_use_metadata_cache=1
) – this cache stores Parquet metadata for row groups. It helps to skip filtered Parquet files without reading them. - S3 List cache (
use_object_storage_list_objects_cache=1
) – this is essential when working with Hive-partitioned S3 buckets, but may reduce a few network calls when working with Iceberg as well - Filesystem cache (
enable_filesystem_cache=1
)– this is a standard ClickHouse local storage cache for external data. - Iceberg partitioning pruning (
use_iceberg_partition_pruning=1
) – this is essential for efficient filtering of Iceberg data.
Every query was executed 3-5 times, and we recorded the shortest response. We used c7g.8xlarge AWS instances with 32vCPUs:
Results are shown in the table below:
Query | MergeTree | Iceberg |
Q1 | 1.5s | 0.7s |
Q2 | 0.7s | 0.7s |
Q3 | 2.0s | 1.2s |
Q4 | 5.5s | 1.4s |
Q5 | 1.5s | 0.4s |
The results were quite surprising! Iceberg queries were constantly outperforming MergeTree! Why is that? Experienced ClickHouse users might suppose that the MergeTree table is fragmented with several parts per partition, while the Iceberg table stores one Parquet file per partition. We executed OPTIMIZE TABLE FINAL
on the MergeTree table and re-ran the queries – there was no difference.
Even if we disable data cache, which can be done with min_bytes_to_use_direct_io=1
for MergeTree, and enable_filesystem_cache=0
for Iceberg, Iceberg queries were still two times faster.
Iceberg query results demonstrate a huge improvement over what ClickHouse could do with Parquet a year ago, when we were puzzled by ‘What’s Up With Parquet Performance in ClickHouse?’ Queries Q1, Q2 and Q3 correspond to queries 2, 3 and 4 in Mark Litwitschik’s tests. So if we compare it against other DBs in the summary table, ClickHouse Iceberg queries would be at the top 10 performers!
But hold on, what’s up with MergeTree? Isn’t MergeTree the fastest native storage format for analytics?! We were very puzzled and decided to take a look at why MergeTree was slow in initial tests.
Looking Back to MergeTree
We were especially surprised by the slow MergeTree query times Q4 and Q5, so we tried to disable the analyzer with enable_analyzer=0
setting. That helped dramatically. Q4 MergeTree query time went down to 1.2s, and Q5 ran in 0.5s. While the new analyzer has been production ready for many months already, there are still cases when it fails to plan queries properly. Unfortunately, disabling analyzer did not improve results for the first three queries.
Our next theory was that the difference in query response was caused by better data compression in Parquet compared to MergeTree. Parquet uses per-column codecs that are adjusted for every RowGroup automatically. Additionally, it is compressed by ZSTD, which is configured using the output_format_parquet_compression_method
when writing Parquet from ClickHouse. On the other hand, ClickHouse with default settings uses LZ4 compression for MergeTree tables. We can prove the difference in compressed data size by looking into Parquet metadata. Don’t be frightened by this complex query below, it just does the job of collecting statistics:
SELECT column, type, parquet_compressed, ch_compressed, round(ch_compressed / parquet_compressed, 2) as ratio
FROM
(
SELECT column, sum(num_rows), sum(column_uncompressed_size) parquet_uncompressed, sum(column_compressed_size) as parquet_compressed FROM
(WITH arrayJoin(columns) as c
SELECT _path, num_rows, metadata_size, total_uncompressed_size, total_compressed_size, tupleElement(c, 'name') as column, tupleElement(c, 'total_uncompressed_size') as column_uncompressed_size, tupleElement(c, 'total_compressed_size') as column_compressed_size
FROM s3('s3:///nyc/tripdata/data/*/data.parquet', 'ParquetMetadata')) t
GROUP BY column
) a
LEFT JOIN (SELECT name as column, type, data_compressed_bytes ch_compressed, data_uncompressed_bytes ch_uncompressed FROM system.columns WHERE table='tripdata') b
USING (column)
ORDER BY ratio
FORMAT PrettyCompactNoEscapesMonoBlock
┌─column────────────────┬─type───────────────────┬─parquet_compressed─┬─ch_compressed─┬───ratio───┐
1. │ pickup_datetime │ DateTime │ 728003899 │ 14876 │ 0 │
2. │ dropoff_datetime │ DateTime │ 2815366997 │ 19662 │ 0 │
3. │ id │ UInt64 │ 132763 │ 19662 │ 0.15 │
4. │ junk2 │ String │ 106863 │ 19662 │ 0.18 │
5. │ junk1 │ String │ 106863 │ 19662 │ 0.18 │
6. │ tolls_amount │ Float32 │ 100761469 │ 86897084 │ 0.86 │
7. │ dropoff_location_id │ UInt16 │ 59835410 │ 90676470 │ 1.52 │
8. │ dropoff_longitude │ Float32 │ 2436064306 │ 4062369546 │ 1.67 │
9. │ pickup_longitude │ Float32 │ 2380018092 │ 4002029096 │ 1.68 │
10. │ dropoff_latitude │ Float32 │ 2616572327 │ 4545198969 │ 1.74 │
11. │ pickup_latitude │ Float32 │ 2549710411 │ 4492131060 │ 1.76 │
12. │ total_amount │ Float32 │ 1631431696 │ 3640707576 │ 2.23 │
13. │ trip_distance │ Float32 │ 1564198984 │ 3762184282 │ 2.41 │
14. │ tip_amount │ Float32 │ 952820724 │ 2316893979 │ 2.43 │
15. │ passenger_count │ UInt8 │ 293373924 │ 730827155 │ 2.49 │
16. │ fare_amount │ Float32 │ 1238104795 │ 3640683882 │ 2.94 │
17. │ rate_code_id │ String │ 44627379 │ 138066010 │ 3.09 │
18. │ payment_type │ LowCardinality(String) │ 180132201 │ 638007985 │ 3.54 │
19. │ extra │ String │ 29879569 │ 108049577 │ 3.62 │
20. │ store_and_fwd_flag │ String │ 17538744 │ 66921895 │ 3.82 │
21. │ pickup_location_id │ UInt16 │ 134890 │ 579807 │ 4.3 │
22. │ mta_tax │ Float32 │ 9102651 │ 58587786 │ 6.44 │
23. │ improvement_surcharge │ Float32 │ 602351 │ 6201352 │ 10.3 │
24. │ pickup_date │ Date │ 331723 │ 13049273 │ 39.34 │
25. │ vendor_id │ String │ 117996 │ 21085689 │ 178.7 │
└───────────────────────┴────────────────────────┴────────────────────┴───────────────┴───────────┘
As you can see, columns that were used in the queries above – pickup_date, passenger_count, trip_distance, total_amount, pickup_location_id – all those consume much less space in Parquet compared to MergeTree. That plays an important role in query performance.
So, let’s change compression for the MergeTree table to ZSTD. Starting in ClickHouse version 25.4 it is possible to modify default compression at the table level. In earlier versions we can add a configuration snippet:
<clickhouse>
<compression>
<case>
<method>zstd</method>
<level>1</level>
</case>
</compression>
</clickhouse>
Once applied, we need to run OPTIMIZE TABLE tripdata FINAL
again. Let’s check the data sizes now:
MergeTree:
select total_bytes from system.tables where name = 'tripdata'
┌─total_bytes─┐
1. │ 22069192362 │ -- 22.07 billion
└─────────────┘
Parquet in Iceberg:
select sum(size) from (select _path, any(_size) as size from ice."nyc.tripdata" group by 1)
┌───sum(size)─┐
1. │ 21132090256 │ -- 21.13 billion
└─────────────┘
Data sizes are now very close. Let’s re-run test queries on MergeTree and compare results:
Query | MergeTree (defaults) | MergeTree ZSTD | Iceberg |
Q1 | 1.5s | 1.1s | 0.7s |
Q2 | 0.7s | 0.7s | 0.7s |
Q3 | 2.0s | 1.7s | 1.2s |
Q4 | 1.2s | 1.1s | 1.4s |
Q5 | 0.5s | 0.5s | 0.4s |
Results are slightly better, but MergeTree is still slower than Iceberg for most queries.
We had to call our brilliant support engineers that pointed out the local_filesystem_read_method
setting. If we change it to ‘pread’, then MergeTree finally gets the lead.
Query | MergeTree (defaults) | MergeTree ZSTD pread, disabled analyzer | Iceberg |
Q1 | 1.5s | 0.4s | 0.7s |
Q2 | 0.7s | 0.4s | 0.7s |
Q3 | 2.0s | 0.7s | 1.2s |
Q4 | 1.2s | 1.1s | 1.4s |
Q5 | 0.5s | 0.5s | 0.4s |
While we made MergeTree fast again, it was not the best user experience. Most users know very little about compression and analyzer, and very few have ever heard about the magical local_filesystem_read_method
setting. ClickHouse was known to deliver the best performance out of the box, and it is thought-provoking that it is not the case anymore. (At least in this particular benchmark.)
So far we just used a single node ClickHouse. One of the main features of the Antalya project is swarm execution that allows users to add external compute easily to an existing ClickHouse cluster. Let’s take a minute to review how swarms work. (You can skip the next section if you already know this.)
Swarm Execution Model
Antalya swarms are separate ClickHouse clusters of stateless compute-only nodes that existing ClickHouse clusters can use to read data stored in object storage directly or in Iceberg tables. Swarms can be scaled up and down independently thus fulfilling the dream of compute and storage separation. Since swarm nodes are stateless, you can also apply cheap spot instances. This makes the swarm very cost effective as well.
The swarm execution model is designed around two distinct features: swarm discovery, and query execution delegation.
Swarm discovery allows ClickHouse clusters to connect to swarms with minimum configuration. We have improved the existing ClickHouse cluster auto-discovery feature combining it with auxiliary Keeper to accommodate it.
Swarms register themselves in a dedicated Registry Keeper instance using a unique swarm name. ClickHouse clusters that need to run queries with swarm are configured to use the Registry as an auxiliary Keeper. The auxiliary Keeper extends any existing configuration stored in other Keeper servers. With the Registry properly configured, ClickHouse clusters will see swarms in the system.clusters table and will be able to connect and run queries. If swarm nodes are added or removed, ClickHouse handles [de-]registriation automatically, without any external configuration.
In order to implement the swarm execution model we had to re-think how ClickHouse runs distributed queries on externally stored data in general. The traditional ClickHouse approach required every table function that worked with object storage to have a “-Cluster” version, e.g.: s3 – s3Cluster, iceberg – icebergCluster etc. This is not convenient since it obliges a user to change queries when moving from a single node to a cluster. Instead, we have moved the cluster delegation instruction to the settings, so it could be defined in a profile or a query itself. Where ClickHouse would normally write:
SELECT ... FROM s3Cluster('swarm1', ...)
In Antalya we write it as:
SELECT ... FROM s3(...) settings object_storage_cluster='swarm1'
If object_storage_cluster
setting is not set, the query runs locally. If it is set to a particular swarm, the query is executed on that swarm! This is much more convenient.
Using query settings as an interface for query routing instructions allows us to add other features keeping the regular SQL syntax, for example:
- specify number of swarm nodes to use (
object_storage_max_nodes
) – this is similar tomax_threads
but applied to limit number of nodes - Use swarm node as a query initiator (
object_storage_remote_initiator
)
Going forward, this will allow us to offload heavy system workloads completely to the swarm. The obvious example is running OPTIMIZE for Parquet data.
But that’s enough theory. Let’s see it in practice.
Swarm Lands on Iceberg
For the next test we will downscale our initial ClickHouse to a 4vCPU instance, and start a single-node swarm extension using a 32vCPU machine. In Altinity.Cloud this can be easily done from the UI. If you are running ClickHouse by yourself, refer to the Antalya Examples repository to see how to do it in Kubernetes.
Once the swarm is configured, we can run queries providing swarm name as a query setting, like this:
SELECT
passenger_count,
avg(total_amount)
FROM ice."nyc.tripdata"
GROUP BY passenger_count
SETTINGS object_storage_cluster='my-swarm'
Here are the query times:
Query | MergeTree (defaults) | MergeTree ZSTD pread, disabled analyzer | Iceberg | Iceberg Swarm 1 node |
Q1 | 1.5s | 0.4s | 0.7s | 0.9s |
Q2 | 0.7s | 0.4s | 0.7s | 0.8s |
Q3 | 2.0s | 0.7s | 1.2s | 1.3s |
Q4 | 1.2s | 1.1s | 1.4s | 0.4s* |
Q5 | 0.5s | 0.5s | 0.4s | 0.4s* |
So the use of initiator gives just a small overhead; the rest of the workload is effectively offloaded to the swarm.
*) Note Q4 and Q5. Those queries did not work out of the box, since as of writing this article swarm queries do not support joins properly. Here is a bug report; it will be fixed shortly. So we rewrite the queries as follows in order to move JOIN operations to the node initiator.
Q4.
SELECT zone, sum(c) c
FROM (
SELECT
pickup_location_id,
count() AS c
FROM ice."nyc.tripdata"
GROUP BY pickup_location_id
) td
LEFT JOIN taxi_zones AS tz ON td.pickup_location_id = tz.location_id
GROUP BY zone
ORDER BY c DESC LIMIT 10
SETTINGS object_storage_cluster='my-swarm'
Q5.
SELECT sum(c) c
FROM (
SELECT pickup_location_id, count(*) c
FROM ice."nyc.tripdata"
GROUP by pickup_location_id
) td
INNER JOIN taxi_zones AS tz ON td.pickup_location_id = tz.location_id
WHERE tz.zone = 'Midtown East'
SETTINGS object_storage_cluster='my-swarm'
We already tested a single node swarm, so let’s see if we can improve performance by adding 3 extra nodes. As soon as new nodes are up, they register automatically. No changes are needed to queries.
Query | MergeTree (defaults) | MergeTree ZSTD pread, disabled analyzer | Iceberg | Iceberg Swarm 1 node | Iceberg Swarm 4 nodes |
Q1 | 1.5s | 0.4s | 0.7s | 0.9s | 0.4s |
Q2 | 0.7s | 0.4s | 0.7s | 0.8s | 0.4s |
Q3 | 2.0s | 0.7s | 1.2s | 1.3s | 0.6s |
Q4 | 1.2s | 1.1s | 1.4s | 0.4s | 0.25s |
Q5 | 0.5s | 0.5s | 0.4s | 0.4s | 0.3s |
As you can see, the performance scales when there are more swarm resources. If we remove 0.2s overhead on the node initiator, then the remaining part of the query scales almost linearly.
Attaching an extra compute allows us to outperform MergeTree again!
The Tip of the Iceberg
These simple results are very promising. That said, we foresee many improvements to speed up performance.
- Some datatypes supported by ClickHouse are not properly supported by Parquet implementation in ClickHouse, or in Iceberg itself. For example, Enums do not work currently, and unsigned ints cannot be natively inserted using Iceberg libraries.
- There is unnecessary overhead both on query initiator and swarm nodes when planning a query. The list of files is cached in Iceberg metadata file cache already, but ClickHouse still “checks” that files are there with an extra S3 Head request. While this makes sense for s3 table function queries, it is not needed for Iceberg. Iceberg catalog metadata is the point of truth.
- The algorithm for maintaining cache locality for swarm nodes can be improved. Currently if one node is faster than others on a particular query it may start processing files that are cached on other nodes. This often results in unstable query times.
- Finally, joins are still a work in progress.
Our engineers are working on addressing all of these. If you find any other problem or have a performance improvement idea – please file an issue on GitHub. Or better yet, submit a PR. We work on this together!
In the end it’s likely that Iceberg will be just as fast as MergeTree and faster in cases where you can trade off cost for performance by applying a lot of compute. However, MergeTree still has many advantages:
- MergeTree is instantly queryable without compaction
- MergeTree compacts automatically over time
- MergeTree can be changed in ways that are difficult to replicate in Parquet. For example, many ALTER TABLE operations use hard links in order to avoid rewrites of unchanged data. This is not possible with Parquet.
- It is ideal for downsampling/indexing parquet sources using materialized views.
Our goal is to make these two storage types work smoothly together, leveraging their respective strengths.
The next big step is to take the proven design of ClickHouse TTL moves and extend it with an option to move data to Parquet and Iceberg. Once fully implemented it will allow users to work transparently with hot data in local MergeTree, and historical data in Parquet and Iceberg, maintaining reasonable query performance and allowing this data to be accessed by other systems, e.g. AI. Since Iceberg tables can be queried with swarms, it also allows users to manage compute for historical data independently. That way historical queries do not not affect performance on the main ClickHouse clusters.
Conclusion
The recent developments in ClickHouse and Project Antalya bring Iceberg support in ClickHouse to the next level. It is becoming a viable alternative to MergeTree, matching MergeTree performance characteristics but also adding cheap object storage and shared data capabilities. It opens the door to build systems that combine the strengths of MergeTree and Iceberg that are far more scalable and cost-performant than possible.
We’ve known for some time that ClickHouse overall has become much more performant on Iceberg/Parquet data and that Project Antalya features accelerate that performance. That said, the results presented in this article were a pleasant surprise. Our next step is to test against the ClickBench dataset, which has a more representative set of queries that include use of primary key indexes. We believe this will probe additional query paths and expand understanding of differences in performance. Stay tuned, the future has just arrived!
ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.