Size Matters: Best Practices for Trillion Row Datasets on ClickHouse®

Recorded: Wednesday, August 10, 2022
Presenters: Robert Hodges & Altinity Support Engineering
In this advanced tutorial, Altinity CEO Robert Hodges demonstrates how to design ClickHouse schemas and write queries that deliver consistent sub-second response times on trillion-row datasets, using a live sensor dataset on a single cloud VM. He opens with a demo on over a trillion rows running on one Amazon c5.9xlarge (36 vCPUs, ~1.4 TB storage): MAX() returns in 0.4 seconds, argMax() in 0.8 seconds, and a materialized-view-backed last-point query in 12 milliseconds—all on a single machine. The goal he frames is fast, direct access to source data, not just pre-aggregated summaries, which is what makes iterative exploration on production datasets possible.
The session is organized around five principles for large datasets: reduce queries to a single scan, reduce I/O through schema optimization, parallelize the query, favor aggregation over joins, and index interesting events with materialized views. The schema section shows how smaller data types, the DoubleDelta codec, ZSTD compression, time-based PARTITION BY, and a careful ORDER BY shrink storage to 1.34 bytes per row—enough to fit a trillion rows on one VM—while a benchmark shows roughly linear parallel scaling from 1 to 16 threads.
The multi-entity ingest section uses a Null-engine ETL table with a materialized view to pull sensor events from incoming JSON into a single optimized fact table, and demonstrates storing the raw JSON string alongside structured columns at negligible cost (~96% compression with ZSTD) plus a token bloom filter index for needle-in-haystack searches. Finally, Robert covers three signature ClickHouse tricks—countIf/avgIf combinators for multi-entity stats in one scan, groupArray + ARRAY JOIN to simulate cross-entity joins through aggregation, and an AggregatingMergeTree last-point index using argMaxState—before closing with sharding, tiered storage, and approximate aggregates, plus Q&A.
Here are the slides:
Key Moments (Timestamps)
Key moments generated with AI assistance.
- 0:13 – Introduction: Robert Hodges, Altinity
- 1:42 – Altinity overview: enterprise ClickHouse provider, Altinity.Cloud, Kubernetes Operator
- 3:25 – ClickHouse overview: SQL data warehouse, columnar, parallel, vectorized, open source
- 6:02 – Live demo: trillion-row sensor data on a single c5.9xlarge VM
- 6:26 – Demo query 1: count(1) on a trillion-row table
- 7:01 – Demo query 2: MAX(temperature) for sensor_id 2555 — 0.4 seconds
- 8:01 – Demo query 3: argMax(date, temperature) — 0.8 seconds
- 9:07 – Demo query 4: materialized view last-point query — 12 milliseconds
- 11:09 – Goal: sub-second access to source data (not just aggregates) for iterative exploration
- 13:28 – Why fast source data access matters: iterative exploratory queries to find root causes
- 14:04 – Five principles for large datasets
- 14:10 – Principle 1: Reduce to a single scan
- 14:33 – Principle 2: Reduce I/O (smaller data)
- 14:44 – Principle 3: Parallelize
- 14:52 – Principle 4: Lean on aggregation instead of joins
- 15:04 – Principle 5: Index with materialized views
- 15:41 – The key: one unaggregated fact table to rule them all
- 17:07 – Basic design for trillion-row tables
- 17:13 – ClickHouse server architecture: query pipeline, page cache, primary key index, hash tables
- 18:43 – Performance levers: codecs, compression, tiered storage, indexes, projections
- 19:25 – MergeTree table organization: parts, sparse index, columns as file pairs
- 21:47 – Unoptimized baseline table: sub-optimal types, no codecs, no partitioning
- 22:49 – Optimized table: DoubleDelta codec, ZSTD, alias date column, time-based partitioning, ORDER BY
- 24:55 – Storage comparison: unoptimized vs lz4 vs ZSTD → 1.34 bytes/row with LZ4 + codecs
- 25:59 – Parallelization benchmark: linear scaling 1–16 threads, slight drop at 32
- 28:08 – Multi-entity ingest: sensor readings, restarts, errors in a single table
- 28:55 – Storing source JSON alongside structured columns: ~96% ZSTD compression ratio
- 30:15 – Why not use the new JSON data type yet: experimental, multi-entity schema challenges
- 30:52 – Final schema: discriminator column, temperature, error message, raw JSON string
- 31:26 – Loading pattern: Null engine ETL table + materialized view transformation
- 32:41 – Adding a materialized column from JSON: instant ALTER TABLE, optional MATERIALIZE
- 34:05 – Indexing raw JSON with a token bloom filter index
- 35:17 – Bloom filter performance: column vs JSON with tuned bloom filter on high-cardinality values
- 36:00 – Unique ClickHouse tricks for large datasets
- 36:25 – Aggregation properties: single pass, parallelizes across cores and machines, intermediate results reusable
- 37:36 – Pattern 1: countIf/avgIf combinators — multi-entity stats in one scan
- 39:14 – Pattern 2: groupArray + ARRAY JOIN — simulating a join through aggregation
- 40:46 – Live demo: array join query returning sensor readings with uptime since last restart
- 43:51 – Pattern 3: AggregatingMergeTree + argMaxState — last-point index via materialized view
- 46:17 – Storage comparison: 1 TB source table vs 3 GB materialized view
- 47:09 – Why incorrect partitioning on the materialized view causes too many parts
- 48:22 – Summary: single table, good schema choices, store source data, aggregation is the secret power
- 50:37 – Other techniques: sharding/replication, tiered storage, approximate aggregates (sampling, approximate uniques)
- 51:47 – Resources: ClickHouse docs, Altinity blog, Altinity Knowledge Base
- 52:36 – Q&A begins
- 54:05 – Q&A: single fact table even with column duplication? Yes — compression handles it
- 56:10 – Q&A: groupArray/ARRAY JOIN = GROUP BY self-join
- 56:29 – Q&A: filtering arrays efficiently with hasAny
- 57:39 – Q&A: bloom filters didn’t work — tuning tips, high-cardinality requirement
- 59:12 – Q&A: pre-WHERE — useful but can slow complex queries; sometimes turn it off
Webinar Transcript
[0:13] — Introduction and Housekeeping
Robert: Hi everybody, this is Robert Hodges and I will be the presenter today at our webinar called Size Matters: Best Practices for Trillion Row Datasets in ClickHouse. The webinar is being recorded. I’ll send you links to the recording and the slides after this. There should be at least 10 minutes for questions at the end; you can post them into the Q&A box or the chat.
[1:42] — About Altinity
Robert: I’m the CEO of Altinity but for this webinar I’m a database person. I’ve been working on databases for almost 40 years, starting with M204, and have worked with about 20 or more databases. The experience I’ll describe comes from all of Altinity — we have very talented engineers whose experience runs into centuries, and they have helped people build and operate very large data sets on ClickHouse and in some cases operated these clusters themselves. Altinity is an enterprise provider for ClickHouse. We help people build real-time analytic applications through software including Altinity.Cloud and the Altinity Kubernetes Operator for ClickHouse — the thing that enables ClickHouse to run cloud-natively on Kubernetes — as well as through support.
[3:25] — ClickHouse Overview
Robert: ClickHouse is a SQL data warehouse — a database designed to answer questions about large amounts of data very quickly. It understands SQL, runs on everything from bare metal to the cloud, runs great in Kubernetes, uses a shared-nothing architecture (a traditional data warehouse architecture where ClickHouse operates as a process with attached storage and nodes connected over a network). Like most data warehouses it stores data in columns, uses parallel and vectorized execution (both across cores and across nodes, treating data as arrays and using SIMD instructions), scales to petabytes, and is open source. It’s probably the largest analytic project on GitHub — bigger than Elasticsearch, with more contributors and growing very rapidly.
[6:02] — Live Demo: Trillion-Row Sensor Data
Robert: Let me do some quick demos before getting into the talk, to show you what ClickHouse can actually do. I’m running on Altinity.Cloud on a c5.9xlarge — 36 CPUs, about 1.3–1.4 TB of EBS storage. I’m using the built-in ClickHouse web UI, which is a great feature. This is sensor data — temperature readings on sensors, generated within ClickHouse from a sample data set.
Let me count the rows:
SELECT count() FROM readings_multi;
That’s slightly over a trillion rows of data.
Now let me find the maximum temperature ever recorded on sensor 2555 (message type = ‘reading’):
SELECT max(temperature)
FROM readings_multi
WHERE sensor_id = 2555 AND msg_type = ‘reading’;
That came back in 0.4 seconds — pretty fast for a trillion-row table. It was able to locate the relevant rows using the primary key index.
Now let me also find what day that maximum temperature occurred:
SELECT max(temperature), argMax(time, temperature)
FROM readings_multi
WHERE sensor_id = 2555 AND msg_type = ‘reading’;
That’s 0.8 seconds — slower because we’re also pulling the date associated with the maximum temperature. Whenever we see things get slower in ClickHouse we ask: can we make it faster?
With ClickHouse there’s a very powerful way to optimize frequently-run queries: materialized views. I have one that pre-aggregates the temperature data to daily readings:
SELECT max(temperature), argMaxMerge(last_time)
FROM readings_multi_daily_mv
WHERE sensor_id = 2555;
Same data comes back — same day — but it ran in 12 milliseconds. This is drawing off data that comes out of a trillion-row source. Not a particularly powerful VM, but incredibly fast results.
[11:09] — The Goal: Fast Direct Access to Source Data
Robert: What we’re looking at in the rest of this presentation is how to enable fast, cost-efficient, end-user access to trillion-row data sets. A lot of times people frame the problem as “I just need daily aggregates.” We’re actually focused on enabling people to find the actual source data and ask questions directly on it.
When we say fast, we mean consistent sub-second response that scales linearly with resources. As your dataset gets larger, as you add more tenants, as you have longer time stretches of data, you can just add more resources and preserve your response time. We also want costs that are low and predictable — not serverless mystery billing where you don’t know how long things will take or what they’ll cost.
Why do we need fast access to source data? Let’s say you’re seeing temperature sensors fail intermittently but don’t know why. You go in and do quick, iterative exploratory queries on the data: look at particular sensors, particular regions, particular time stretches to find patterns that help you zero in on the root cause. If you can do this effectively you might find that a bad firmware upgrade is the common factor — bang, problem solved. That’s why you want this access, and you want it fast.
[14:04] — Five Principles for Large Datasets
Robert: As a community we’ve developed a lot of experience building data sets that solve these problems. Here are the basic principles we follow when building these systems or advising others:
- Reduce queries to a single scan. If it’s a single scan, it’s linear. Your response time will be predictable — just a factor of how much data you’re scanning.
- Reduce I/O: make the data as small as possible. Less data to read from storage = faster queries.
- Parallelize. Once we have a small linear process, parallelize it. ClickHouse is very good at this.
- Lean on aggregation instead of joins. Aggregation is one of ClickHouse’s secret powers — we’ll show how it lets you avoid large table joins entirely.
- Index information with materialized views. I showed an example in the demo of pre-aggregating data. You can also use materialized views to find and index interesting events inside a large data set.
The key insight is that the closer you get to having a big fact table, the easier it is to achieve fast, consistent, cost-efficient queries. One enormous unaggregated table — maximize applied CPU, minimize I/O overhead, parallelize everything, keep intermediate results in RAM. Joins are fine in ClickHouse, but not big ones.
[17:13] — ClickHouse Server Architecture
Robert: Understanding the architecture helps you make good design decisions. ClickHouse has a three-stage process: query parsing, query interpretation, and a pipeline that executes queries. It can read columnar data from block storage (local SSD or EBS over SAN in the demo) or from object storage.
Block storage goes through filters to the OS page cache — once data is cached, repeated access is faster. In memory, ClickHouse stores primary key indexes (so they always fit in RAM), scanned column blocks, hash tables for joins, and intermediate query results. Memory is a critical resource for large queries.
[18:43] — ClickHouse Performance Levers
Robert: ClickHouse has an amazing number of ways to achieve high performance: codecs, compression, tiered storage, skip indexes, projections, sharding, distributed query, data types, read replicas. If you can think of something that will make a data warehouse run fast, there’s a pretty good chance it’s implemented somewhere in ClickHouse.
For this talk we’ll focus on: compression and codecs, use of indexes, correct data types, and basic parallelization. For further reading on aggregation internals and performance, see ClickHouse aggregation fun, Part 1 and Part 2 on the Altinity blog.
[19:25] — MergeTree Table Organization
Robert: The MergeTree family is the standard for large ClickHouse tables (including ReplicatedMergeTree and other variants). MergeTree breaks data into parts — chunks that can be accessed independently in parallel. Within each part:
- A sparse primary key index (primary.idx) has entries for roughly every 8,000 rows
- For each column, there’s typically a pair of files: a marks file with byte offsets, and a .bin file with chunks of compressed data
ClickHouse also supports skip indexes (min/max, bloom filter) that help decide whether to read a particular block. Getting this structure right means efficient I/O.
[21:47] — Starting with an Unoptimized Table
Robert: When I started putting this dataset together I created an unoptimized table first:
CREATE TABLE readings_unopt (
sensor_id Int64,
sensor_type Int32,
location String,
time DateTime,
date Date DEFAULT toDate(time),
reading Float32
) ENGINE = MergeTree()
PARTITION BY tuple()
ORDER BY tuple();
This has sub-optimal data types, no codecs, and PARTITION BY tuple() and ORDER BY tuple() — which basically tell ClickHouse “partition randomly, don’t bother sorting.” This is a terrible table, but it will work with any data shape and gives you a starting point to understand what you’re working with.
[22:49] — Optimizing the Table
Robert: After experimenting I arrived at an optimized schema. The key changes:
DoubleDelta codec: instead of storing the raw sensor ID integer, store the change-on-the-change (the second derivative of successive values). For slowly-increasing integer values like sensor IDs this is extremely efficient encoding:
sensor_id Int32 CODEC(DoubleDelta, ZSTD)
Alias column for date: a date can be trivially computed from a timestamp, so instead of storing it, define it as an alias:
date Date ALIAS toDate(time)
Time-based partitioning: very common in large ClickHouse datasets — breaks data into nice month-sized blocks:
PARTITION BY toYYYYMM(time)
Optimized ORDER BY: designed to give optimal data locality for your query patterns. Good ordering also helps with compression because sorted data tends to have smaller differences between consecutive values.
Applying these optimizations brings storage down dramatically. The storage comparison:
| Version | Bytes/row |
| Unoptimized | ~8+ bytes |
| LZ4 only | ~4 bytes |
| Codecs + LZ4 | 1.34 bytes |
| Codecs + ZSTD | ~1.0 bytes |
I chose codecs + LZ4 at 1.34 bytes/row because ZSTD is slightly more expensive at query time. At 1.34 bytes/row, a trillion-row dataset fits on a single VM. This optimization step is particularly critical for very large datasets.
[25:59] — Parallelization Scaling
Robert: Once you have a well-optimized single table, ClickHouse’s parallelization scales very well. In a benchmark on one billion rows from this dataset, varying the max_threads setting from 1 to 32 shows linear scaling up to 16 threads and a slight drop at 32 (likely hitting I/O bandwidth limits). If you structure your queries correctly around a single table, you get excellent, predictable parallelization.
[28:08] — Multi-Entity Ingest: Loading a Single Fact Table
Robert: One important realization when building these very large data sets: if you have interesting data you’ll likely have multiple entity types. In our sensor example: readings, restarts, and errors — all arriving for the same sensor IDs. The pattern to avoid is large table joins, which are expensive. Instead, we store them all in a single fact table.
[28:55] — Storing Raw JSON Alongside Structured Columns
Robert: Many large systems don’t just store data in optimized columns. They also include the source data — the raw JSON — in the table. Why? Because with very large datasets you may have event types you don’t yet know what questions to ask about. By storing the compressed JSON string, you preserve full flexibility. In ZSTD compression this particular event type compresses by about 96%, so the storage cost is very small.
Why not use the new JSON data type? Two reasons: it’s still experimental, and more importantly, it doesn’t yet handle multiple entity types well. When the JSON schema changes across event types the data can get mixed up. We recommend storing JSON as a plain String column with compression applied, which has worked reliably for years.
[30:52] — Final Schema
Robert: The optimized schema looks like:
CREATE TABLE readings_multi (
sensor_id UInt32 CODEC(DoubleDelta, LZ4),
msg_type LowCardinality(String),
time DateTime CODEC(DoubleDelta, LZ4),
temperature Nullable(Float32),
error_msg Nullable(String),
json String CODEC(ZSTD) — raw source JSON
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(time)
ORDER BY (msg_type, sensor_id, time);
The discriminator column (msg_type) tells us which kind of event we’re dealing with: reading, restart, or error.
[31:26] — Loading Pattern: Null Engine + Materialized View
Robert: A very common loading pattern for multi-entity JSON data: use a Null engine table as an ETL entry point:
CREATE TABLE readings_etl (json String) ENGINE = Null;
Then a materialized view fires on every insert to readings_etl, runs a transformation query that parses the JSON and extracts fields, and writes the result to readings_multi:
CREATE MATERIALIZED VIEW readings_etl_mv TO readings_multi AS
SELECT
JSONExtractUInt(json, ‘sensor_id’) AS sensor_id,
JSONExtractString(json, ‘msg_type’) AS msg_type,
…
FROM readings_etl;
This is a very common pattern covered in many Altinity materialized views tutorials.
[32:41] — Adding Columns Instantly from JSON
Robert: Even after loading a trillion-row table, you can add new columns at any time. The operation is instant:
ALTER TABLE readings_multi
ADD COLUMN IF NOT EXISTS firmware String
DEFAULT JSONExtractString(json, ‘firmware’);
This completes instantly. The column is immediately queryable — ClickHouse runs the extract function on-the-fly for queries. To force the data to be physically materialized:
ALTER TABLE readings_multi UPDATE firmware = firmware WHERE 1;
This rewrites just that one column. One consequence: once data is loaded into a very large table it’s hard to change. This trillion-row test dataset — generated within ClickHouse itself — still took four hours to load. Plan your schema carefully.
[34:05] — Indexing Raw JSON with a Token Bloom Filter
Robert: An advanced trick: add a token bloom filter index to the raw JSON column. This tokenizes the JSON and applies a probabilistic bloom filter that can tell ClickHouse which blocks don’t contain a particular string:
ALTER TABLE readings_multi
ADD INDEX jsonbf json
TYPE tokenbf_v1(8192, 3, 0)
GRANULARITY 1;
ALTER TABLE readings_multi MATERIALIZE INDEX jsonbf;
The parameters (index size, number of hash functions) require tuning. But when correctly set up:
— Query on dedicated column
SELECT count() FROM readings_multi WHERE firmware = ‘frx23ID0000x2532’;
— Query directly on JSON using hasToken
SELECT count() FROM readings_multi WHERE hasToken(json, ‘frx23ID0000x2532’);
The performance difference between the column query and the tuned bloom filter on JSON is surprisingly small. We see bloom filters knocking out 90–97% of blocks when tuned correctly.
Important caveat: bloom filters only work well on high-cardinality data. For a firmware ID (frx23ID0000x2532) — yes. For common words like “the” or columns with only a few distinct values — no, because every block will match and the filter provides no benefit.
[36:00] — Unique ClickHouse Tricks for Large Datasets
Robert: The real question is: how do we make queries fast on these enormous tables while avoiding large-table joins? The answer is aggregation.
[36:25] — Why Aggregation Is ClickHouse’s Secret Power
Robert: Aggregation has several performance-favorable properties:
- Single pass: scan the data once, maintaining running counts and sums per GROUP BY key in hash tables.
- Parallelizes across cores, machines, and time: each thread independently aggregates its portion. Results can be stored and added to later and still produce correct results.
- Intermediate results are reusable: compute partial aggregates, store them, accumulate more later — still correct.
This is the fundamental mechanism for getting useful information out of large data sets quickly.
[37:36] — Pattern 1: countIf / avgIf Combinators
Robert: Traditional SQL would require multiple queries or a self-join to get per-entity-type statistics. ClickHouse’s If combinators do it in a single scan.
Suppose you want monthly statistics for all event types — how many readings, how many restarts, min/max/avg temperature — from a single table containing all event types:
SELECT
toYYYYMM(time) AS month,
countIf(msg_type = ‘reading’) AS num_readings,
countIf(msg_type = ‘restart’) AS num_restarts,
minIf(temperature, msg_type = ‘reading’) AS min_temp,
avgIf(temperature, msg_type = ‘reading’) AS avg_temp,
maxIf(temperature, msg_type = ‘reading’) AS max_temp
FROM readings_multi
WHERE sensor_id = 2555
GROUP BY month
ORDER BY month;
This is a single scan. No self-join, no data movement, no secondary table. The If combinator can be applied to almost any aggregate function in ClickHouse.
[39:14] — Pattern 2: groupArray + ARRAY JOIN — Join Simulation Through Aggregation
Robert: Sometimes you need to join data across entity types. For example: what do temperature readings look like after each device restart? In a traditional database you’d join the restarts table with the readings table on sensor_id.
In ClickHouse you can simulate this entirely through aggregation using groupArray (which collects values into an array in the order they arrive) and ARRAY JOIN (which unpivots arrays back into rows):
SELECT
sensor_id,
restart_time,
arrayJoin(reading_times) AS reading_time,
arrayJoin(temperatures) AS temperature,
reading_time – restart_time AS uptime
FROM (
SELECT
sensor_id,
anyIf(time, msg_type = ‘restart’) AS restart_time,
groupArrayIf(time, msg_type = ‘reading’) AS reading_times,
groupArrayIf(temperature, msg_type = ‘reading’) AS temperatures
FROM readings_multi
WHERE sensor_id = 2555
AND time BETWEEN ‘2022-01-01’ AND ‘2022-01-31’
GROUP BY sensor_id
)
ORDER BY reading_time;
Let me run this live on the trillion-row table.
[Demo runs] There it is — sensor 2555, reading times offset by a minute at a time, joined on the restart time. We’re computing uptime (time since last restart) and seeing temperatures. This is what amounts to a join done entirely through aggregation on a single table.
I’ll be honest: this code is complicated and not everyone’s cup of tea. But it works, it’s fast, and it’s one of the coolest SQL tricks I’ve ever learned — it came from one of our users.
[43:51] — Pattern 3: AggregatingMergeTree + argMaxState — Last-Point Index via Materialized View
Robert: Another very common need: “when was the last restart on sensor 236?” This is a traditional “last-point query.” Using a subquery it looks like:
— Traditional: subquery approach (can be expensive at scale)
SELECT message
FROM readings_multi
WHERE sensor_id = 236
AND msg_type = ‘error’
AND time = (SELECT max(time) FROM readings_multi
WHERE sensor_id = 236 AND msg_type = ‘error’);
On a trillion-row table, computing this on the fly for every query can be very slow. The solution: build a last-point index using a materialized view with AggregatingMergeTree.
A materialized view fires like a trigger every time a block is inserted into the source table. We set one up to capture, per sensor, the most recent event per event type:
— Target table using AggregatingMergeTree
CREATE TABLE readings_last_restart (
sensor_id UInt32,
last_time SimpleAggregateFunction(max, DateTime),
last_msg AggregateFunction(argMax, String, DateTime)
)
ENGINE = AggregatingMergeTree()
PARTITION BY tuple()
ORDER BY sensor_id;
— Materialized view that feeds it
CREATE MATERIALIZED VIEW readings_last_restart_mv
TO readings_last_restart AS
SELECT
sensor_id,
maxSimpleState(time) AS last_time,
argMaxState(message, time) AS last_msg
FROM readings_multi
WHERE msg_type = ‘restart’
GROUP BY sensor_id;
Each time a new block arrives in readings_multi, this fires, aggregates just that block’s restarts, and merges the result into readings_last_restart. The target table stores partial aggregate states (argMaxState) that the AggregatingMergeTree engine merges automatically in the background.
The result of the demo: this last-point query came back in 12 milliseconds from a trillion-row source table. The key reason: the materialized view target table is only 3 GB compared to the 1+ terabyte source table.
One caution: for the materialized view I used PARTITION BY tuple() (no partitioning), which is a somewhat dubious choice. As data keeps arriving, inserts create new parts and ClickHouse has to rewrite them more often. You get better behavior by partitioning the materialized view table correctly so data lands in longer runs. But as the demo shows, ClickHouse was still incredibly fast even with this sub-optimal choice.
For deeper coverage of ClickHouse materialized views, see the Altinity Knowledge Base.
[48:22] — Summary
Robert: The core principles from this talk:
Using a single large fact table to hold all entities is not a limitation — it’s a design choice that makes queries linear and predictable. The question isn’t “can ClickHouse do joins” — it can. The question is “how do we make query behavior linear,” and the answer is a single table.
Make good implementation choices: correct data types, codecs, compression, partitioning, ORDER BY. ClickHouse is a great database but you do some work to get the best from your schema.
If you have space, include the source data as a compressed JSON string. You can always materialize columns from it later, and it preserves flexibility. Once a trillion-row table is loaded, it’s very hard to change — generating this test dataset within ClickHouse took four hours.
Aggregation is the secret power. The If combinators, the groupArray/ARRAY JOIN pattern for join simulation, and materialized views with AggregatingMergeTree for pre-indexing — these are the features that allow ClickHouse to be incredibly fast and beat just about all comers at this scale. Use them and you’ll have linear scaling, cost efficiency, and users who can actually find answers in the raw data.
[50:37] — Other Techniques for Big Data
Robert: A few more things to consider:
Sharding and replication: with ClickHouse, if you set things up correctly, no problem doing a trillion rows on a single VM. Sharding is available when you need to grow beyond one node.
Tiered storage and object storage: an increasingly important topic for large datasets — how to move older data to cheaper storage while keeping recent data on fast SSD.
Approximate aggregates: as queries grow complex and hold large intermediate results in RAM, approximation can help. ClickHouse has lower-cost versions of uniques (approximate cardinality counting like HyperLogLog variants) and supports SAMPLE clauses for random sampling. We use approximate uniques constantly to make large queries faster.
[51:47] — Resources
Robert: The Altinity Knowledge Base is a gold mine of raw data from production experience — about 200+ customers, and every time our support engineers find something interesting they write it up. Many things in this talk came out of the Knowledge Base. Also see the ClickHouse official docs, the Altinity blog, and the Altinity YouTube channel where this talk will land.
[52:36] — Q&A
Robert: Happy to take questions.
Q: Do you still recommend a single fact table if it results in a lot of column duplication?
Yes, definitely. I think you’re imagining something like a snowflake schema where you spread data across multiple tables to save storage. ClickHouse is sufficiently efficient on compression — including codecs and controlled sort orders — that there’s usually not much storage cost to adding more columns. In this example I was able to reduce an eight-column table to about 1.34 bytes/row — just really amazing. For sparse columns (where many rows have NULL or a default value), ClickHouse compresses especially well. Also, denormalizing means you preserve “the then-current value” of dimension data at event time, which a join can’t give you. Single fact table: yes.
Q: Using bloom filters did not work for us.
Very common experience. Bloom filters are really hard to tune. You have to set the index size (size of the bloom filter per block) and the number of hash functions. You need to play around with it. More importantly: bloom filters only help if you have high-cardinality data. If you search on a common word or a column with only a few distinct values, every block will match and the filter provides no benefit. Where I’ve seen them work: transaction IDs, firmware IDs, serial numbers — high-cardinality by definition. We see them knocking out 90–97% of blocks in those cases. For tangled JSON with many overlapping keys, they won’t work.
Q: What would you suggest for a low-cardinality column?
Compress the daylights out of it and don’t worry about trying to skip values with an index. Even with a set index, if your data has values like 0 and 1 scattered randomly across blocks, the index says “which blocks contain 0” — and every block contains 0 — so it’s worthless. It’s just faster to scan with good compression.
Q: What about pre-WHERE?
Pre-WHERE is a powerful feature that lets ClickHouse use columns as rough-and-ready indexes even when they’re not in the primary key index. It will try to scan and determine which blocks need to be touched before doing the full column scan. However, as this question correctly points out, for complex queries you can actually pay more with pre-WHERE than without it. We’ve had exactly that experience. For straightforward single-column filters it’s great; for complex multi-column conditions you may want to turn it off and benchmark.
Robert: Thank you all very much. I hope this has been helpful. Contact us at altinity.com, join our Slack channel, and we’d be happy to talk about these topics in greater depth. Have a great day.
FAQ
Why is a single fact table recommended over a normalized schema for trillion-row datasets?
Normalization with multiple tables requires joins, and joins in ClickHouse on very large tables require moving data, storing large hash tables in RAM, and often doing multiple scans. A single unaggregated fact table turns queries into linear scans that parallelize perfectly across CPU threads. ClickHouse’s compression, codecs, and column storage make wide denormalized tables extremely storage-efficient — often achieving 1–2 bytes per row even for tables with 10+ columns — so the storage cost of adding dimension columns is minimal.
What is the groupArray + ARRAY JOIN pattern and when should I use it?
This pattern simulates a JOIN between different event types stored in the same table, using aggregation instead of a traditional join. Use groupArrayIf(value, condition) in a GROUP BY query to collect one event type’s values into arrays keyed by the join attribute (e.g., sensor_id). Use argMaxIf or anyIf to collect the anchor value from the other event type. Then ARRAY JOIN (or arrayJoin()) unpivots the arrays back into rows, reconstructing the joined table. This avoids data movement, runs in a single scan, and parallelizes across threads. Use it when you need to join data from different event types in a single fact table.
What is a last-point query and how do you implement it efficiently in ClickHouse?
A last-point query asks for the most recent value of some field per key — for example, the most recent error message per sensor. Computing this on-the-fly with a subquery (WHERE time = (SELECT max(time) FROM … WHERE …)) is expensive at trillion-row scale. The efficient approach is to pre-build a last-point index using a materialized view with an AggregatingMergeTree target table and argMaxState aggregate functions. The MV fires on every insert, incrementally maintains the max-time-and-associated-value per key in a tiny summary table (typically orders of magnitude smaller than the source), and queries against the summary table complete in milliseconds.
How do bloom filter indexes work on raw JSON columns and when are they effective?
A tokenbf_v1 bloom filter index tokenizes all string data in a column and records which tokens are present in each granule (block of ~8,000 rows). When you query with hasToken(json, ‘value’) or a column equality filter, ClickHouse uses the bloom filter to skip blocks that can’t possibly contain the value — without reading the actual column data. This is very effective for high-cardinality data like transaction IDs, firmware identifiers, or serial numbers, where each block is likely to contain only a small subset of all values. It does not help for low-cardinality data where every block contains every value, or for very tangled JSON where tokens overlap unpredictably.
Can a trillion-row dataset really run on a single server?
Yes. ClickHouse on a well-optimized schema with correct codecs, compression, and ORDER BY can achieve 1–2 bytes per row for many real-world datasets, making a trillion rows fit in 1–2 TB of storage — within reach of a single large cloud VM. With good parallelization across 16–36 CPU cores, max queries return in under a second and materialized view-backed queries return in milliseconds. Sharding across multiple nodes is available when you need to grow beyond single-node capacity or write throughput, but it adds complexity and is not necessary until you’ve exhausted single-node capacity.
© 2022 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.
ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.