Webinars

All about JSON and ClickHouse®: Tips, tricks, and new features!

Recorded: Tuesday, July 26 | 12:00 pm PT
Presenters: Robert Hodges & Diego Nieto

In this two-part webinar, Altinity CEO Robert Hodges and support engineer Diego Nieto cover every major approach to handling JSON in ClickHouse, from classic techniques that work in all versions to the new experimental JSON data type introduced in 22.3—with concrete performance numbers to help developers choose between them. Robert opens with the easiest workflow: using the JSONEachRow format to load JSON directly into a structured table where each key maps to a typed column, along with the matching output format for writing data back out.

He then covers the most common real-world pattern—storing the full document as a String “blob”—and shows how materialized columns using JSONExtract functions enable sub-millisecond queries on data that previously took nearly 10 seconds to scan. He explains the trade-off between JSONExtract (full nested parser, typed, slower) and the simpleJSONExtract/visitParam family (fast, but flat JSON only), then closes the first half with two faster alternatives to string storage: paired key/value arrays and the Map data type, both roughly 4–5x faster than querying raw JSON.

Diego then introduces the new JSON data type, which differs fundamentally from the string approach: JSON is parsed at INSERT time rather than query time, fields are stored in native columnar format with automatic type inference, and they’re accessed via simple dot notation instead of function calls. On an 11-million-row Stack Overflow dataset, a topK query drops from 2.1 seconds with JSONExtract to 0.3 seconds with the new type. He covers how it handles evolving schemas (new keys create new columns, merged via OPTIMIZE TABLE FINAL), the wide vs compact part storage model for sparse columns, and a codec comparison showing ZSTD shrinks storage from 1.3 GB to 800 MB with virtually no query-time penalty.

Here are the slides:

Key Moments (Timestamps)

Key moments generated with AI assistance.

  • 0:05 – Introduction: Robert Hodges and Diego Nieto
  • 1:26 – Speaker introductions: Robert Hodges (Altinity CEO), Diego Nieto (Altinity support engineer)
  • 2:02 – Altinity overview: enterprise ClickHouse provider, Altinity.Cloud, Kubernetes Operator
  • 2:34 – Talk structure: classic JSON features first, then new JSON data type
  • 3:07 – The JSON problem: nginx log data, loading and publishing JSON
  • 4:08 – Approach 1: JSONEachRow input format — structured table with one column per JSON key
  • 5:11 – Loading nginx logs with JSONEachRow format
  • 6:27 – JSONEachRow output format: writing data back out as JSON
  • 7:14 – Many JSON input/output format variants: 12–15 available
  • 7:50 – Approach 2: Storing JSON as a String blob
  • 9:27 – Creating a basic string blob table
  • 10:01 – Loading JSON into the blob table (CSV with file and message columns)
  • 11:01 – JSONExtract functions: typed, handles nested JSON, full parser
  • 12:18 – visitParam / simpleJSONExtract functions: fast, flat JSON only
  • 13:44 – Adding an ordered table with a materialized timestamp column
  • 15:07 – Using ALTER TABLE to add materialized columns from JSON fields
  • 17:01 – Forcing full materialization with ALTER TABLE UPDATE
  • 18:02 – Query simplicity after materialization: plain SQL, no functions
  • 18:34 – Performance comparison: 9.8s (JSON string) vs 14ms (materialized columns)
  • 19:20 – Approach 3: Paired arrays — keys in one String array, values in another
  • 21:25 – Loading JSON into paired arrays
  • 22:55 – Querying paired arrays: 4–5x faster than raw JSON string
  • 23:34 – Approach 4: Map data type — map(String, String) column
  • 24:33 – Loading JSON into a Map column using JSONExtractKeysAndValues
  • 25:30 – Materialized views for automatic JSON transformation
  • 26:07 – Handoff to Diego Nieto: the new JSON data type
  • 26:39 – How other databases handle JSON: PostgreSQL JSONB, MySQL JSON
  • 27:56 – How ClickHouse stored JSON until version 22.3: string-based parsers
  • 29:00 – Two ClickHouse JSON parsers: simpleJSONExtract (fast/flat) vs JSONExtract (full/nested)
  • 30:37 – The classic query approach: JSONExtract in SELECT vs CTE with tuple
  • 33:46 – Performance of tuple approach vs direct JSONExtract: two passes vs N passes
  • 35:31 – The new JSON data type: parses at INSERT time, columnar native format
  • 37:07 – Type inference: ClickHouse infers column types from JSON values
  • 37:53 – Dot-notation access: raw.tag instead of JSONExtract(raw, ‘tag’, …)
  • 38:28 – Enabling the new JSON type: allow_experimental_object_type = 1
  • 39:41 – Creating the Stack Overflow table with JSON data type
  • 40:17 – Loading with JSONAsObject input format
  • 41:54 – DESCRIBE TABLE with and without describe_include_subcolumns setting
  • 43:45 – topK benchmark: 2.1s (JSONExtract / string) vs 0.3s (new JSON type)
  • 46:02 – Heterogeneous schemas: what happens when JSON structure changes
  • 50:07 – Wide vs compact part format: handling tables with 1K+ sparse columns
  • 53:34 – Codec comparison: LZ4 vs ZSTD on new JSON type — 1.3 GB to 800 MB
  • 55:04 – Summary: basic JSON secrets and future topics (MVs, skip indexes, bloom filters)
  • 58:40 – Q&A: heterogeneous schema with many event types and thousands of sparse columns
  • 1:04:06 – Closing remarks; new JSON type expected to stabilize in 22.8 LTS

Webinar Transcript

[0:05] — Introduction and Housekeeping

Robert: Hello everybody and welcome to “All About JSON and ClickHouse: Tips, Tricks and New Features.” My name is Robert Hodges; I’ll be one of your hosts today along with my colleague Diego Nieto. We’re going to be digging deep into JSON. We have 43 action-packed slides.

This webinar is being recorded. Everyone who signed up will get a link to both the recording and the slides about 12 hours after we finish. For questions, you can put them in the chat, but the best way is the Q&A box in the Zoom menu. We’ll answer them as we go or at the end.

[1:26] — Speaker Introductions

Robert: I’ve been working on databases since 1983. This includes about 20 different database types, the last of which is ClickHouse and is my current favorite. My day job is CEO of Altinity. Diego Nieto is one of our support engineers for ClickHouse, focused on ClickHouse, PostgreSQL, and database applications, with a long history working with different database types.

Altinity is an enterprise provider for ClickHouse — we provide everything people need to get real-time analytics up, running, and operating efficiently. We have the Altinity.Cloud platform and are also authors of the Altinity Kubernetes Operator for ClickHouse as well as many other open-source projects.

[2:34] — Talk Structure

Robert: One thing we advertised in this talk is the new JSON data type — that’s roughly the second half. But first we want to cover the features that already existed in ClickHouse to deal with JSON, because understanding the foundations makes it easier to see how the new JSON type extends them in useful ways.

[3:07] — The JSON Problem

Robert: Let’s start with the very basics. If you deal with JSON data, you probably see things like nginx log data — structured JSON documents from a web server. The first question is: how do we get this data into a table as efficiently as possible? And if we have it in a table, can we write it back out again for another application to consume?

[4:08] — Approach 1: JSONEachRow — Direct Structured Loading

Robert: The easiest case is when your JSON has a stable structure where keys don’t change. For nginx logs, for example, the format is essentially fixed.

Create a table with one column per JSON key:

CREATE TABLE http_logs (

    timestamp  DateTime,

    client_ip  IPv4,

    status     UInt16,

    request    String,

    size       UInt32

)

ENGINE = MergeTree()

PARTITION BY toStartOfDay(timestamp)

ORDER BY timestamp;

Then load the nginx JSON log file directly:

cat access.log | clickhouse-client \

    –query=”INSERT INTO http_logs FORMAT JSONEachRow”

JSONEachRow is an input format — ClickHouse sees one JSON document per line, parses it, matches each key to the corresponding column by name, and inserts the data. Super simple.

Writing it back out is equally simple. Use JSONEachRow as an output format:

SELECT * FROM http_logs FORMAT JSONEachRow;

You get well-formed JSON output with proper escaping. ClickHouse has 12–15 different JSON format variants (some input-only, some output-only). If you’re dealing with JSON in any standard form there’s a very good chance ClickHouse can already parse or produce it. For the full list of JSON-related input/output formats, see the ClickHouse schema design section of the Altinity Knowledge Base.

[7:50] — Approach 2: Storing JSON as a String Blob

Robert: The most common real-world approach — particularly when JSON schemas change or you’re not sure what you’ll need to query — is to store each JSON document as a string in a single column.

We call this a “blob.” Create a simple table:

CREATE TABLE http_logs_raw (

    file     String,

    message  String    — stores the full JSON document

)

ENGINE = MergeTree()

PARTITION BY file

ORDER BY tuple();

We can’t order by timestamp yet because the timestamp is buried inside the JSON string. Load the data using CSV format where each row has the filename and the full JSON document.

Once the data is loaded, you can query the JSON using JSONExtract functions:

SELECT

    JSONExtractString(message, ‘request’)  AS request,

    JSONExtractUInt(message, ‘status’)     AS status,

    countIf(status > 400)                  AS errors

FROM http_logs_raw

WHERE JSONExtractString(message, ‘timestamp’) BETWEEN ‘…’ AND ‘…’;

[11:01] — JSONExtract vs simpleJSONExtract Functions

Robert: There are two families of JSON parsing functions with different tradeoffs.

JSONExtract functions (also called jsonExtract variants): these are a full JSON parser. They correctly handle nested JSON — if you have a key request nested inside another object, this parser won’t get confused. They are typed: JSONExtractString, JSONExtractUInt, JSONExtractFloat, JSONExtract (with explicit type specification). They’re slightly slower because they do full parsing.

simpleJSONExtract / visitParam functions: a fast, simple parser. They work only for flat (non-nested) JSON. They’re named simpleJSONExtractString, simpleJSONExtractUInt, etc. (the visitParam name is being renamed to simpleJSONExtract to make the distinction clearer). They’re faster but will give incorrect results if a key name also appears in nested sub-documents.

The choice depends on your JSON structure. Both families are widely used.

[13:44] — Improving the Blob Table with a Materialized Timestamp

Robert: To make the blob table more useful, we can extract the timestamp as a proper column at load time:

CREATE TABLE http_logs_ts (

    file       String,

    message    String,

    timestamp  DateTime MATERIALIZED toDateTime(JSONExtractInt(message, ‘time’))

)

ENGINE = MergeTree()

PARTITION BY toYYYYMM(timestamp)

ORDER BY timestamp;

INSERT INTO http_logs_ts (file, message)

SELECT file, message FROM http_logs_raw;

Now the table has a proper PARTITION BY and ORDER BY timestamp, which:

  1. Enables efficient time-range queries using the primary key index
  2. Groups data by month for easier management
  3. Improves compression because sorted data has better compression ratios

[15:07] — Adding Materialized Columns Later with ALTER TABLE

Robert: As you discover what fields users care about, you can add materialized columns at any time — the operation is instant:

— Add a status column extracted from JSON, with ZSTD compression

ALTER TABLE http_logs_ts

ADD COLUMN status UInt16

DEFAULT JSONExtractUInt(message, ‘status’)

CODEC(ZSTD);

This completes instantly. For any new data inserted, the value is properly extracted and stored. For old parts that don’t have the data materialized, ClickHouse computes the expression on-the-fly at query time. If you want to force full materialization:

ALTER TABLE http_logs_ts UPDATE status = status WHERE 1;

This rewrites only those columns that changed. In a 30-day rolling log table you might not even need this — the old parts roll off naturally.

Adding a request column:

ALTER TABLE http_logs_ts

ADD COLUMN request String

DEFAULT JSONExtractString(message, ‘request’);

For the JSONAsString and materialized view as JSON parser pattern — using a Null engine source table and a materialized view to parse JSON on insert — the Altinity Knowledge Base has a detailed step-by-step example.

[18:02] — Query Simplicity and Performance After Materialization

Robert: After adding materialized columns, queries become plain SQL:

SELECT status, count() AS cnt

FROM http_logs_ts

WHERE timestamp BETWEEN ‘2022-01-01’ AND ‘2022-02-01’

GROUP BY status

ORDER BY status;

No extra function calls, no special syntax.

Performance comparison with ~1 million rows:

ApproachQuery time
Parsing JSON in SELECT with JSONExtract9.8 seconds
Materialized columns + primary key index14 milliseconds

That’s nearly a 700x improvement. Two reasons: less I/O (reading narrow typed columns instead of wide string JSON blobs), and the primary key index on timestamp lets ClickHouse skip blocks outside the time range entirely.

[19:20] — Approach 3: Paired Arrays

Robert: For JSON with variable keys or schemas that change over time, another option is paired arrays: store keys in one Array(String) column and the corresponding values in another:

CREATE TABLE http_logs_arrays (

    file       String,

    keys       Array(String),

    values     Array(String),

    timestamp  DateTime MATERIALIZED toDateTime(…)

)

ENGINE = MergeTree()

PARTITION BY toYYYYMM(timestamp)

ORDER BY timestamp;

Loading requires extracting keys and values using array functions:

INSERT INTO http_logs_arrays

SELECT file,

    JSONExtractKeys(message)   AS keys,

    — arrayMap to extract all values as strings

    arrayMap(k -> JSONExtractString(message, k), JSONExtractKeys(message)) AS values

FROM http_logs_raw;

Querying uses array lookup:

SELECT values[indexOf(keys, ‘status’)] AS status, count()

FROM http_logs_arrays

GROUP BY status

ORDER BY status;

Performance: about 4–5x faster than querying raw JSON strings. Still not as fast as fully materialized columns, but works for any key set without needing to know the schema in advance.

For patterns combining paired arrays with JSONEachRow, tuples, maps, and materialized views see the Altinity Knowledge Base.

[23:34] — Approach 4: Map Data Type

Robert: A cleaner alternative to paired arrays is the Map data type:

CREATE TABLE http_logs_map (

    file       String,

    message    Map(String, String),  — keys → values as a proper map type

    timestamp  DateTime MATERIALIZED toDateTime(…)

)

ENGINE = MergeTree()

PARTITION BY toYYYYMM(timestamp)

ORDER BY timestamp;

Loading uses JSONExtractKeysAndValues:

INSERT INTO http_logs_map (file, message)

SELECT file,

    JSONExtractKeysAndValues(raw_message, ‘String’) AS message

FROM http_logs_raw;

Querying is simpler:

SELECT message[‘status’] AS status, count()

FROM http_logs_map

GROUP BY status

ORDER BY status;

Performance: also about 4–5x faster than raw JSON string queries. Map lookup is syntactically cleaner than the paired array indexOf pattern.

[25:30] — Materialized Views for Automatic Transformation

Robert: In real production pipelines you wouldn’t want to manually run SELECT/INSERT to populate the structured tables. The standard technique is a materialized view. In the interest of time we’re not showing the full example here, but the idea is: data lands in http_logs_raw and a materialized view automatically fires to populate http_logs_ts (or the map or arrays variant). This is a completely standard pattern in ClickHouse — see the ClickHouse materialized views Knowledge Base article for details.

[26:07] — Handoff to Diego Nieto: The New JSON Data Type

Robert: That’s the classic features for JSON in ClickHouse — features that have been available and incrementally improving over time. As of version 22.3 there’s a new data type specifically designed to hold JSON data. I’ll hand over to Diego to explain how this works.

Diego: Thank you Robert. Hello everyone.

[26:39] — How Other Databases Handle JSON

Diego: Before diving in — other databases handle JSON in a variety of ways. PostgreSQL has a json type (stores text, validates on insert) and jsonb (binary, stores parsed). MySQL has a native JSON type with its own path syntax. These all basically store JSON as a blob and parse at query time when you access specific fields.

[27:56] — How ClickHouse Stored JSON Until 22.3

Diego: Until recently, ClickHouse stored JSON in String data types with two families of parsers as Robert described. The key problem: parsing overhead happens at query time. For every query that reads a JSON field, the parser runs again.

ClickHouse developed two parsers: a simple one for flat key-value JSON (the simpleJSONExtract / visitParam family), and a full fledged one for nested JSON (the JSONExtract family). The visitParam name is being renamed to simpleJSONExtract to make the distinction clearer.

The classic query approach shows another inefficiency:

— Option A: three separate JSONExtract calls (parser runs 3 times)

SELECT JSONExtractString(json, ‘a’), JSONExtractUInt(json, ‘b’), JSONExtractString(json, ‘c’)

FROM …;

— Option B: parse once with JSONExtract + tuple structure (parser runs 2 times)

WITH JSONExtract(json, ‘Tuple(a String, b UInt32, c String)’) AS parsed

SELECT tupleElement(parsed, ‘a’), tupleElement(parsed, ‘b’), tupleElement(parsed, ‘c’)

FROM …;

Option B is better — parse the JSON once, then reference the parsed structure. But with the new JSON data type you forget about this entirely.

[35:31] — The New JSON Data Type

Diego: The new JSON data type, introduced experimentally in 22.3 and tested on 22.6, fundamentally changes the approach. Enable it with:

SET allow_experimental_object_type = 1;

Key properties:

  1. Parses at INSERT time, not query time. All the parsing overhead moves to ingest, where it’s less noticeable and doesn’t block user queries.
  2. Stores in columnar native format. Each JSON key becomes a column. A JSON document with 20 keys creates 20 columns, with all the benefits of ClickHouse’s columnar storage: compression, vectorized reads, parallelism.
  3. Type inference. ClickHouse automatically infers the data type of each key — integers, strings, arrays, nested objects.
  4. Dot-notation access. Instead of JSONExtractString(raw, ‘tag’) you write raw.tag. For nested paths: raw.user.id.
  5. ARRAY JOIN compatibility. Array fields in JSON work directly with ARRAY JOIN, making it easy to flatten nested arrays.

[38:28] — Creating a Table with the JSON Data Type

Diego: For the Stack Overflow dataset (11 million rows of Q&A data with nested fields like tags, user info, etc.):

SET allow_experimental_object_type = 1;

CREATE TABLE stack_overflow_js (

    raw JSON  — the entire document goes in one column

)

ENGINE = MergeTree()

ORDER BY tuple();

Loading with the JSONAsObject input format:

INSERT INTO stack_overflow_js

SELECT * FROM url(‘https://…’, JSONAsObject);

Or from S3:

INSERT INTO stack_overflow_js

SELECT * FROM s3(‘s3://bucket/stack_overflow_*.jsonl’, JSONAsObject);

The count confirms 11 million rows loaded.

[41:54] — Inspecting the Underlying Column Structure

Diego: To see what ClickHouse created under the covers:

— Default: shows one column called ‘raw’ of type JSON

DESCRIBE TABLE stack_overflow_js;

— With this setting: shows all sub-columns

SET describe_include_subcolumns = 1;

DESCRIBE TABLE stack_overflow_js;

— Shows: raw.answers (nested), raw.answers.date (String), raw.qid (String), raw.tag (Array(String)), etc.

— See each column on its own line

DESCRIBE TABLE stack_overflow_js

INCLUDE SUBCOLUMNS = 1

FORMAT Vertical;

— Shows: raw.qid String, raw.user String, raw.title String, raw.tag Array(String), raw.answers Tuple(…), …

To reconstruct the original JSON from the table:

SELECT raw FORMAT JSONEachRow;

— Returns the exact JSON you ingested, reconstructed from the columnar storage

[43:45] — Performance Benchmark: New JSON Type vs String + JSONExtract

Diego: Let’s compare. The query: get the top 100 most common tags across all Stack Overflow questions.

Old approach with JSONExtract on a string column:

SELECT topK(100)(tag)

FROM (

    SELECT arrayJoin(JSONExtract(raw, ‘tag’, ‘Array(String)’)) AS tag

    FROM stack_overflow_str

);

— Result: 2.1 seconds, 11.2 million rows, 3.73 GB processed

New JSON data type with dot notation:

SELECT topK(100)(tag)

FROM (

    SELECT arrayJoin(raw.tag) AS tag

    FROM stack_overflow_js

);

— Result: 0.3 seconds, 11.2 million rows, 642 MB processed

The new JSON type is 7x faster and reads 6x less data for this query. The reason: the tag array is stored as a proper columnar type, so ClickHouse only reads that one column, already typed and compressed, with no JSON parsing overhead at query time.

For the full benchmark write-up and SQL examples, see the ClickHouse JSON data type, version 22.6 blog post on the Altinity blog.

[46:02] — Handling Evolving Schemas with the New JSON Type

Diego: What if you insert two different JSON structures into the same table?

For example, first insert Stack Overflow data with qid, user, title, tag. Then insert something completely different with foo and bar keys.

ClickHouse creates a new part with the foo/bar structure. Running:

OPTIMIZE TABLE stack_overflow_js FINAL;

causes ClickHouse to merge the parts together, creating a combined structure with all known keys. The merged result has both the original Stack Overflow columns and the new foo/bar columns — older rows simply have NULL (or default values) in the new columns.

This is both a feature and a potential problem:

Feature: your table can evolve naturally as you insert new JSON schemas. You can query any key from any event type using raw.field_name.

Potential problem: schema pollution. If you accidentally insert different types of data, you’ll accumulate columns you didn’t intend. There’s no warning or error. You might end up with unexpected columns mixed into your data. Be deliberate about what you insert into a JSON-typed column.

[50:07] — Wide vs Compact Parts: Handling Large Column Counts

Diego: What about JSON with 1,000 keys? Wouldn’t that create 1,000 columns?

ClickHouse has two part storage formats: wide (one file per column) and compact (multiple columns packed together).

For the new JSON type, ClickHouse uses hashing. If it detects multiple columns with similar data types — for example, many sparse String columns — it hashes them together and stores them as a compact block. This can reduce effective column count from 1,000 to, say, 100 blocks. The data is still queryable by name, but the storage is much more efficient for wide, sparse schemas.

This is controlled by thresholds in ClickHouse settings. When you reach a certain size or column count, it automatically shifts from wide to compact format. You don’t need to configure this manually in most cases — it works flawlessly with ClickHouse’s normal columnar access patterns.

[53:34] — Codec Comparison: LZ4 vs ZSTD on the New JSON Type

Diego: Compression with the new JSON type is worth experimenting with. To change the codec on a JSON column:

ALTER TABLE stack_overflow_js

MODIFY COLUMN raw JSON CODEC(ZSTD);

OPTIMIZE TABLE stack_overflow_js FINAL;

Benchmark results on 11M rows:

QueryOld string + JSONExtractNew JSON type + LZ4New JSON type + ZSTD
topK(100)2.1s0.3s0.31s
Storage1.3 GB800 MB

ZSTD adds only about 1 millisecond of query overhead but reduces storage by about 40% (from 1.3 GB to 800 MB). For large datasets, this storage saving is significant.

[55:04] — Summary and What to Explore Next

Robert: Here’s a quick summary of the secrets to processing JSON in ClickHouse:

  • Input and output formats: check them out — there are 12–15 JSON variants covering almost any standard format.
  • simpleJSONExtract vs JSONExtract: simple is faster for flat JSON; full is required for nested.
  • String blob + materialized columns: the most common production pattern — store the raw JSON, add typed columns as you discover what you need. Instant ALTER TABLE, 700x faster queries.
  • Paired arrays and Map: good for variable-key JSON without full materialization; 4–5x faster than raw string parsing.
  • New JSON data type: clearly the path forward. Parse at insert time, columnar storage, dot notation, no function calls. Still experimental as of 22.6 — recommended to test on the latest build.

Topics to explore further (not covered today due to time):

  • Using materialized views to populate JSON data type columns
  • Indexing JSON data type columns (skip indexes on sub-columns)
  • Bloom filters on string blob columns
  • More compression and codec tricks

For documentation: ClickHouse official docs have good JSON coverage, the Altinity Knowledge Base covers JSON schema design patterns extensively, and for new features the ClickHouse source code tests are the best place to see exactly what currently works.

[58:40] — Q&A: Heterogeneous Schemas and Sparse Columns

Robert: There’s a great question from Marcio: how does the new JSON feature work with heterogeneous schema rows — for example, an events table where the JSON schema is completely different for different event types? And how does it scale with potentially thousands of sub-columns, where each sub-column is only used for a certain type of event?

Diego: If schemas are different, as I showed, it will just insert a new schema. You’ll get a big tuple with all the different columns combined. You can reference the new event types because you know the structure in advance — just use json_column.new_key. The key concern is when you have 1,000+ columns.

That’s where the wide vs compact format comes in. ClickHouse hashes all the columns and if it detects columns with more or less the same data type, it groups them into a compact packed block. This reduces from 1,000 effective columns to maybe 100 compact blocks. The query behavior is the same — you still reference columns by name — but storage is far more efficient.

Robert: I’d add: there isn’t a discriminator yet in the new JSON data type — there’s no mechanism to tell ClickHouse “this is event type A so these columns belong together and those don’t.” That may appear in future versions. For now, if you have a really large number of different event types with very sparse schemas, you may still be better off using string storage with JSONExtract and managing your own discriminator column. But over time I’d expect the new JSON type to get much more efficient for this use case — it’s one of the most requested features in ClickHouse and has a lot of contributors.

Diego: In the 22.8 LTS we expect the new JSON type to be mature enough to use in production. Until then: try it out, test it on 22.6 or later, and report issues. The more feedback it gets now the better it will be when 22.8 ships.

Robert: Thank you all for attending. Feel free to contact us at altinity.com, join our Slack workspace, and definitely try out the new JSON data type and report back. See you soon.

Diego: Thanks, everyone!

FAQ

What is the difference between JSONExtract and simpleJSONExtract / visitParam functions in ClickHouse?

JSONExtract functions (also called json_extract_*) are a full JSON parser that correctly handles nested documents and type-safely extracts values. They are slightly slower because they parse the entire JSON structure. The simpleJSONExtract / visitParam functions (being renamed from visitParam for clarity) are a faster, simpler parser designed only for flat key-value JSON. They do not correctly handle nesting — if your JSON has nested sub-documents, use JSONExtract. If your JSON is flat and you need maximum speed, use simpleJSONExtract.

What are the main approaches to storing and querying JSON in ClickHouse before version 22.3?

There are four main approaches. First, use JSONEachRow input format to load structured JSON directly into a table with one column per key — best for fixed schemas. Second, store each JSON document as a String column (blob) and use JSONExtract functions to pull out values at query time — universal but slow. Add materialized columns extracted from the JSON to dramatically improve query speed (up to 700x in examples). Third, use paired arrays (one Array(String) for keys, one for values) — works for variable schemas, about 4–5x faster than raw string parsing. Fourth, use the Map(String, String) data type — similar performance to paired arrays with cleaner query syntax.

What is the new JSON data type in ClickHouse and how does it differ from storing JSON as a String?

The new JSON data type (experimental since 22.3) parses JSON at INSERT time rather than query time. Each JSON key is stored as a separate column in ClickHouse’s native columnar format, with types inferred automatically. Accessing fields uses simple dot notation (raw.tag) instead of JSONExtract function calls. This eliminates parsing overhead from queries, stores data more compactly in native format, and enables all ClickHouse columnar optimizations (compression, vectorized execution, primary key indexes on sub-columns). In benchmarks on 11M rows, a topK query on a nested array field was 7x faster and read 6x less data compared to the string+JSONExtract approach.

How does the new JSON data type handle schemas that change between inserts?

When you insert JSON with a different set of keys than what was previously inserted, ClickHouse creates a new part with the new structure. Running OPTIMIZE TABLE … FINAL merges the parts, combining both schemas into a single structure where older rows have default/null values for columns they don’t have. This is flexible but can lead to unexpected schema accumulation if you accidentally mix unrelated JSON schemas. There is currently no discriminator mechanism to keep different event schemas isolated within a JSON column.

How does the new JSON type handle tables with thousands of sparse columns?

ClickHouse uses two part storage formats: wide (one file per column) and compact (multiple columns packed together). For tables with many sparse columns created by the new JSON type, ClickHouse can detect columns with similar data types, hash them together, and store them in compact format. This reduces the effective file count dramatically — potentially from 1,000 logical columns to 100 compact blocks — while preserving full queryability by field name. The threshold for switching between wide and compact is controlled by ClickHouse settings.

When should I use string+materialized columns vs the new JSON data type?

Use string blob with materialized columns when: you need production stability (22.3 and later versions where the JSON type is still experimental), you have a very heterogeneous schema with many different event types and thousands of sparse sub-columns, or you need full control over exactly which columns are stored and how. Use the new JSON data type when: you can run on a recent ClickHouse version (22.6+) in a non-critical environment, your JSON has a relatively stable structure, you want simpler query syntax with dot notation, or you need maximum query performance on nested JSON without manually managing materialized columns.


© 2022 Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

Join our Slack

ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.

Related:

Leave a Reply

Your email address will not be published. Required fields are marked *