Real-time Queries on AWS S3 Table Buckets in ClickHouse®

The gifts of earth are brought and prepared, set on the table.
So it has been since creation, and it will go on
– Joy Hairo. Perhaps the World Ends Here

AWS S3 table buckets are a unique new product from AWS introduced at the end of 2024. In a nutshell, table buckets organize files stored on S3 into tables. Supported file formats include Parquet. There are table management features on top that make S3 look like analytic database storage. Most importantly, S3 table buckets expose tables in Apache Iceberg format, which means that ClickHouse can query them easily.

Since Iceberg is one of the foundational features of the Altinity Antalya project, we could hardly overlook this important advance in S3 technology. We built an open source integration from ClickHouse to S3 table buckets that you can access easily in Altinity.Cloud. Let’s try it out!

The Hidden Underside of the Iceberg

Before showing how to use S3 Table Buckets, let me remind you of one of the biggest problems with Apache Iceberg today: compaction.

Imagine a service that collects observability data and pushes a batch of rows every 10 seconds. In 24 hours it will produce 6 * 60 * 24 = 8640 batches. If there are multiple services, the number will multiply accordingly. Inserting such data into a database is not a problem. Databases are designed to process inserts at a high rate and volume. However, this is not true for S3 buckets (not tables!) nor for Iceberg.

In Iceberg, every batch is one or more new Parquet files. As the number of Parquet files increases, the performance of accessing those files for queries will degrade dramatically. Even the metadata access becomes slow! In order to keep the query performance good as new data arrives, Parquet files need to be compacted. It is similar to the merge process in ClickHouse that compacts new parts produced by inserts into bigger ones, but applied to Parquet files.

Picture from AWS blog

Apache Iceberg makes it possible to solve the compaction problem through features like snapshots and optimistic concurrency. It leaves the actual work to external tools. Open source implementations most commonly use Apache Spark for this purpose, but many users find it heavyweight and operationally complex. Topics like automation, upgrade, monitoring, and the like are user problems.

This is one of the main reasons why users were so excited about S3 Table Buckets. AWS adds data compaction as well as other maintenance features like old snapshot removal. That makes it a ready-to-use service without a need for extra management tools.

Connecting ClickHouse to S3 Table Buckets Using ice-rest-catalog

S3 Table Buckets, as well as Iceberg, start with a catalog. A catalog exposes information about tables and files stored in S3 table buckets so other systems including ClickHouse can read those. ClickHouse can connect to the most common catalog types including AWS Glue and Iceberg REST. We have found that Glue requires quite a lot of management overhead, so let’s take a look at REST.

Unfortunately, ClickHouse cannot connect directly to S3 table bucket REST catalogs, due to use of specific authentication parameters that ClickHouse does not support yet. There are also limitations in the REST implementation itself. So we decided to take a different route.

Altinity.Cloud uses an open source ice-rest-catalog that we developed to simplify Iceberg usage in Kubernetes. It allowed us to add a proxy for S3 table buckets without changing ClickHouse. The proxy connects to S3 Table Buckets using AWS Java API, which does not have any REST limitations. Problem solved!

If you want to try it out by yourself, check out this very simple example that uses AWS CLI and ice. In Altinity.Cloud we can deploy a catalog for a new or an existing S3 Table bucket. Please reach out to our support to configure it for your Altinity.Cloud environment.

Once the catalog is configured, ClickHouse can connect to it using the DatalakeCatalog database engine. Users may use the Altinity.Cloud Configure Data Lake Catalogs Wizard for that, or look up credentials from the environment dashboard page. At the end the CREATE DATABASE statement will look as shown below.

CREATE DATABASE s3tables1
ENGINE = DataLakeCatalog('http://ice-rest-catalog-s3table1:5000')
SETTINGS catalog_type = 'rest',
        auth_header = '[HIDDEN]',`
        warehouse = 'arn:aws:s3tables:us-east-1:313342380333:bucket/[HIDDEN]'

The database is empty. So we need to load it with some data. We will use ice for that.

Loading Data Into S3 Table Buckets Using Ice

Ice has a downloadable binary; you can find install instructions on the Ice project releases page. Once downloaded, it can connect to any Iceberg REST catalog. URI for external connections and credentials to Altinity.Cloud catalog can be extracted from Altinity.Cloud and used in .ice.yaml configuration files as follows:

.ice.yaml 
—--------
uri: https://iceberg-catalog-s3table1.altinity-antalya.altinity.cloud
bearerToken: [YOUR_TOKEN HERE]

Now we can run ice and confirm it can connect to catalog:

#:/ ice check
OK

We are ready to load data. For the initial test, let’s use the BTC public AWS dataset. It can be ice’d with a one liner:

ice insert btc.transactions -p --s3-region=us-east-2 --s3-no-sign-request s3://aws-public-blockchain/v1.0/btc/transactions/date=2025-01-*/*.parquet

This will load 31 files for January 2025. If you want to load more data, adjust the date pattern accordingly.

Querying S3 Table Buckets From ClickHouse

Once, the catalog is connected, query on AWS S3 tables is no different from any other query on ClickHouse tables:

SELECT date, sum(output_count)
FROM s3tables1."btc.transactions"
WHERE date between '2025-01-01' and '2025-01-31'
GROUP BY date ORDER BY date

Moreover, if you have a swarm cluster available, it will nicely work as well to provide unlimited performance scalability:

SELECT date, sum(output_count)
FROM s3tables1."btc.transactions"
WHERE date between '2025-01-01' and '2025-01-31'
GROUP BY date ORDER BY date
SETTINGS object_storage_cluster='my-swarm'

Read more about Antalya compute swarms in our blog post.

Setting Up a Real-Time Pipeline

The example above is not very exciting since it does not allow users to benefit from the data management features of S3 Table Buckets, most notably compaction. To see compaction in action, let’s set up a real-time pipeline that constantly adds new data.

In order to set up a real-time ingestion pipeline we configured Amazon FireHose to stream CloudWatch logs into S3 tables buckets. The details of this setup are covered in AWS blog and documentation. Here is the high-level idea:

At the end, we’ve got CloudWatch metrics streamed to S3 Table Buckets under Iceberg. Let’s check it out from the ClickHouse side. The catalog with S3 Table Buckets is named ice_s3t:

Here is the table structure:

DESCRIBE ice_s3t."cloudwatch.metrics"

metric_stream_name Nullable(String)
account_id Nullable(String)
region Nullable(String)
namespace Nullable(String)
metric_name Nullable(String)
timestamp Nullable(Int64)
dimensions Map(String, Nullable(String))
value Tuple(
    min Nullable(Float64),
    max Nullable(Float64),
    count Nullable(Float64),
    sum Nullable(Float64))
unit Nullable(String)

You can see that the timestamp is Int64 – Iceberg does not support DateTime64 datatype yet, so we had to store it as an integer. (Good news: it’s a feature in Iceberg V3, so a fix is on the way.) Meanwhile, we can convert values to Date or DateTime at query time, like this:

SELECT toDate(fromUnixTimestamp64Milli(timestamp)), count()
FROM ice_s3t."cloudwatch.metrics"
GROUP BY 1 ORDER BY 1

   ┌─toDate(fromU⋯timestamp))─┬─count()─┐
1. │               2025-08-19 │    8519 │
2. │               2025-08-20 │   34560 │
...
8. │               2025-08-26 │   23003 │
9. │               2025-08-27 │   10421 │
   └──────────────────────────┴─────────┘

With this data, we can quickly see what’s going on inside the cloud account, for example check the top failed resources:

SELECT dimensions, metric_name, sum(value.count)
FROM ice_s3t."cloudwatch.metrics"
WHERE metric_name like 'StatusCheckFailed%'
GROUP BY 1,2 order by 3 desc
LIMIT 1
FORMAT Vertical

Row 1:
──────
dimensions:       {'InstanceId':'i-0bf059f5cd0548854'}
metric_name:      StatusCheckFailed_AttachedEBS
sum(value.count): 11117

But what about compaction? Is it really happening? We can check it by observing number of unique files for every day

SELECT toDate(fromUnixTimestamp64Milli(timestamp)), count(), uniq(_file)
FROM ice_s3t."cloudwatch.metrics"
GROUP BY 1 ORDER BY 1

   ┌─toDate(fromU⋯timestamp))─┬─count()─┬─uniq(_file)─┐
1. │               2025-08-19 │    8519 │           1 │
2. │               2025-08-20 │   34560 │           1 │
...
8. │               2025-08-26 │   23003 │           1 │
9. │               2025-08-27 │   10482 │          29 │
   └──────────────────────────┴─────────┴─────────────┘

As you can see, all previous days are pretty much compacted. If we look into today:

SELECT _file, arraySort(groupUniqArray(toHour(fromUnixTimestamp64Milli(timestamp))))
FROM ice_s3t."cloudwatch.metrics"
WHERE toDate(fromUnixTimestamp64Milli(timestamp)) = today()
GROUP BY 1 ORDER BY 2

   ┌─_file───────────────────────────────────────────────────────────┬─arraySort(group⋯(timestamp))))─┐
 1. │ 00000-0-848d2dbe-e7b2-4661-afcd-4973893edf64-0-00001.parquet    │ [0,1,2,3,4,5,6,7,8,9,10,11,12] │
 2. │ 00006-159518-0eeaf657-30b3-4a31-91da-1ff1afa2f0a2-00001.parquet │ [12]                           │
 3. │ 00008-159627-a0a3ed77-5aee-412e-8dce-c3cece1fd5d6-00001.parquet │ [12]                           │
 4. │ 00004-159353-fa3a2d6c-937b-42d7-a052-b263fef3d8b0-00001.parquet │ [12]                           │
 5. │ 00000-159238-68f9be81-432b-4dc8-b8a7-bdc4ee6619e8-00001.parquet │ [12]                           │
 6. │ 00000-159573-e32986f4-9519-44c3-bd4e-71f92b47092a-00001.parquet │ [12]                           │
...

We will see that there is one ‘merged’ file and a number of new files for the current hour. Voilà!

Final Words

The new S3 Table Buckets service is a testament to the power of Iceberg that also makes it much easier to use. Built-in compaction and catalog management are game changers. Combined with ClickHouse query performance, it allows users to build real-time data pipelines using open data formats like Parquet with minimal effort. This level of convenience is yet another step in the accelerating adoption of Apache Iceberg.

S3 Table Buckets are an AWS-only service. Start an Altinity.Cloud trial to try them out today. If you are an existing Altinity.Cloud user, contact Altinity support to get data lake support enabled. What if you run your data lakes on GCP, Azure or on-prem? We have a solution coming for other clouds as well, so stay tuned!

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE^® SOLUTIONS

Get in touch with ClickHouse experts.

Real-time Queries on AWS S3 Table Buckets in ClickHouse®

The Hidden Underside of the Iceberg

Connecting ClickHouse to S3 Table Buckets Using ice-rest-catalog

Loading Data Into S3 Table Buckets Using Ice

Querying S3 Table Buckets From ClickHouse

Setting Up a Real-Time Pipeline

Final Words

Related:

Leave a Reply Cancel reply

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE® SOLUTIONS

Get in touch with ClickHouse experts.

The Hidden Underside of the Iceberg

Connecting ClickHouse to S3 Table Buckets Using ice-rest-catalog

Loading Data Into S3 Table Buckets Using Ice

Querying S3 Table Buckets From ClickHouse

Setting Up a Real-Time Pipeline

Final Words

Related:

Altinity.Cloud Launches Managed Iceberg with Antalya Compute Swarms

Getting Started with Altinity’s Project Antalya

Introducing Altinity Ice – A Simple Toolset for Iceberg REST Catalogs

Leave a Reply Cancel reply

CLICKHOUSE^® SOLUTIONS