{ ClickHouse: “JSON data type”, version: 22.6 }

Starting with the 22.3 LTS release there are many fixes and improvements but also many nice features that will make developer life easier. And behold! ClickHouse also has a new JSON data type! Beware: it is EXPERIMENTAL. But there’s more than enough there to be worth looking at closely. JSON data type is going to be one of the great features of ClickHouse for 2022. 

ClickHouse has for years offered the ability to store JSON objects using the String data type.  Because JSON is a text format, it required complex parsing with specialized functions to get the information out. 

With the new data type ClickHouse parses JSON data at INSERT time. It automatically creates the underlying table structure and stores the JSON in a columnar ClickHouse native format using as named tuples and arrays. 

Now we can benefit from the powerful query notation that we’re used to like `array[x]` and `tuple.element` in a performant way. Using this named tuple notation we can manipulate the JSON object via SQL in a more intuitive manner that is as convenient as languages like Python.

Important note on JSON support and ClickHouse versions

The new JSON data type is experimental but improving quickly. You should use the latest ClickHouse build version to try out JSON data type capabilities. Examples shown here were tested on 22.6. 

Storing JSON the old school way

All ClickHouse versions can store JSON objects in String types. To retrieve a value we have to use two sets of `JSON` functions, each set supported by a parser. The first set is a family of simpleJSON functions that only work for simple non-nested `JSON` files. The second set is a family of JSONExtract functions that can fully parse any JSON object. 

The functions are fast but you have to know the type of the field to extract. If the JSON data is nested, you may need to parse the JSON many times, once for each column. In the query below we showcase how to extract directly a JSON field with the functions JSONExtractUInt/Array and also nested elements with the `tupleElement` function:

WITH JSONExtract(json, 'Tuple(a UInt32, b UInt32, c Nested(d UInt32, e String))') AS parsed_json
SELECT
    JSONExtractUInt(json, 'a') AS a,
    JSONExtractUInt(json, 'b') AS b,
    JSONExtractArrayRaw(json, 'c') AS array_c,
    tupleElement(parsed_json, 'a') AS a_tuple,
    tupleElement(parsed_json, 'b') AS b_tuple,
    tupleElement(parsed_json, 'c') AS array_c_tuple,
    tupleElement(tupleElement(parsed_json, 'c'), 'd') AS `c.d`,
    tupleElement(tupleElement(parsed_json, 'c'), 'e') AS `c.e`
FROM
(
SELECT '{"a":1,"b":2,"c":[{"d":3,"e":"str_1"}, {"d":4,"e":"str_2"}, {"d":3,"e":"str_1"},  {"d":4,"e":"str_1"}, {"d":7,"e":"str_9"}]}' AS json
)
FORMAT Vertical

Let’s dive into the above difficult-to-read-query. First, we need to know the structure of the JSON parsing it using the `JSONExtract` function and generate a tuple structure. We can use a WITH statement to do this. Next we can use the `tupleElement` function to extract the tuples. As you can see, if the JSON has nested JSONs or arrays we have to call different `tupleElement` on top of another to get the nested tuples. 

We also can use nested JSONExtractUInt/Array functions to extract the values directly, but this approach is slower because for each column the JSON must be parsed. In this case it was parsed 3 times, two ints (a and b) and an array (array_c). 

As you can see,  both methods are cumbersome and ugly.

New JSON data type

Starting in 22.3 ClickHouse introduces a new JSON data type. With this type we don’t need to know the type we’re retrieving and we don’t need to use the JSONExtract functions to get values from fields, no matter if they are nested or not. To showcase the new JSON data type we’re going to use a dump of StackOverflow posts retrieved as of June 10, 2016. You can check here the structure and info of the dataset.

So let’s start by setting the new feature flag and creating a very basic table with only one column:

SET allow_experimental_object_type = 1;

CREATE TABLE json_test.stack_overflow_js
(
    `raw` JSON
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS index_granularity = 8192

Using file() functions we can process the JSON file as a String and let ClickHouse parse and insert rows. 

INSERT INTO stack_overflow_js SELECT json FROM file('stack_overflow_nested.json.gz', JSONAsObject);

SELECT count(*)  FROM stack_overflow_js;
…
11203029 rows in set. Elapsed: 2.323 sec. Processed 11.20 million rows, 3.35 GB (4.82 million rows/s., 1.44 GB/s.)

So how did ClickHouse generate the storage structure? As we commented earlier, it parsed the JSON and broke it down into different tuples and nested structures (arrays) and assigned to each a type. After that it created one column for each element. 

If you’re curious you can check the underlying structure by setting `describe_extend_object_types` like this:

SET describe_extend_object_types = 1;
DESCRIBE TABLE stack_overflow_js
FORMAT Vertical;

Row 1:
──────
name:               raw
type:               Tuple(answers Nested(date String, user String), creationDate String, qid String, tag Array(String), title String, user String)
default_type:       
default_expression: 
comment:            
codec_expression:   
ttl_expression:     

1 rows in set. Elapsed: 0.002 sec. 

If you want to have a very detailed description of each column just set the `describe_include_subcolumns` property and describe the table again:

SET describe_include_subcolumns = 1;
DESCRIBE stack_overflow_js;

Row 1:
──────
name:               raw
type:               Tuple(answers Nested(date String, user String), creationDate String, qid String, tag Array(String), title String, user String)
default_type:       
default_expression: 
comment:            
codec_expression:   
ttl_expression:     
is_subcolumn:       0

Row 2:
──────
name:               raw.answers
type:               Nested(date String, user String)
default_type:       
default_expression: 
comment:            
codec_expression:   
ttl_expression:     
is_subcolumn:       1

Using a more visually appealing schema, we can illustrate what ClickHouse has done internally:

Now let’s see how to get some interesting info using the new notation:

SELECT raw FROM stack_overflow_js LIMIT 1 FORMAT Vertical

Row 1:
──────
raw: ([('2009-06-16T09:55:57.320','Elton John (22595)'),('2009-06-17T12:34:22.643','Jack Black (77153)')],'2009-06-16T07:28:42.770','1000000',['vb6','progress-bar'],'Display Progress Bar at the Time of Processing','Jash')

1 rows in set. Elapsed: 0.023 sec.

The older syntax to access the tuple elements by ID still works. 

SELECT raw.1 FROM stack_overflow_js LIMIT 1 FORMAT Vertical

Row 1:
──────
tupleElement(raw, 1): [('2009-06-16T09:55:57.320','Elton John (22595)'),('2009-06-17T12:34:22.643','Jack Black (77153)')]

1 rows in set. Elapsed: 0.016 sec.

Since it is hard to guess the order in the tuple, we are better off using named tuples like `tupleElement(raw, ‘field_name’)` or just `raw.field_name`.

To get the structure of the nested JSON we must simply set the property `output_format_json_named_tuples_as_objects` like this:

SET output_format_json_named_tuples_as_objects = 1;
SELECT raw FROM stack_overflow_js  LIMIT 1 FORMAT JSONEachRow;

{"raw":{"answers":[{"date":"2009-06-16T09:55:57.320","user":"Elton John (22595)"},{"date":"2009-06-17T12:34:22.643","user":"Jack Black (77153)"}],"creationDate":"2009-06-16T07:28:42.770","qid":"1000000","tag":["vb6","progress-bar"],"title":"Display Progress Bar at the Time of Processing","user":"Jash"}}

1 rows in set. Elapsed: 0.007 sec.

Now, let’s use some of these name fields to get some data, using the named tuple notation:

SELECT raw.answers FROM stack_overflow_js LIMIT 1 FORMAT Vertical

The above is the same as:

SELECT raw.1 FROM stack_overflow_js LIMIT 1 FORMAT Vertical;

Row 1:
──────
tupleElement(raw, 1): [('2009-06-16T09:55:57.320','Elton John (22595)'),('2009-06-17T12:34:22.643','Jack Black (77153)')]

1 rows in set. Elapsed: 0.008 sec. 


So what if we wanted to get the topK tags?. How would the new approach compare to the old approach? Let’s find out, using the magic of array functions. But first let’s create a table for storing the JSON object as a `String` and load the JSON file:

CREATE TABLE nested_json.stack_overflow_str
(
    `raw` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS index_granularity = 8192

INSERT INTO stack_overflow_str SELECT json FROM file('fixtures/stack_overflow_nested.json.gz', JSONAsString);

Now we have two tables: `stack_overflow_js` with the new JSON data type and `stack_overflow_str` using the classic JSON approach. `topK` is a function that calculates the top most frequently used tags in order. These tags are stored in an array nested inside the JSON object, their values are double-quoted because of JSON syntax and we need to feed a list of ‘cleaned’ values to the topK function. So how to achieve this?:

  1. `JSONExtract` extracts the array in native format.
  2. Generate a row per each element of the array using `arrayJoin()` so the `topK` function can do its magic. Here we could use `topKArray()` instead of `topK()` function and get rid of `arrayJoin()`. This will be a little more performant but for the sake of didactics let’s keep the topK() form.

This is the final result. Pretty ugly, but it works reasonably well:

SELECT topK(100)(arrayJoin(JSONExtract(raw, 'tag','Array(String)'))) FROM stack_overflow_str;

['javascript','java','c#','php','android','jquery' …..]

1 rows in set. Elapsed: 2.101 sec. Processed 11.20 million rows, 3.73 GB (5.33 million rows/s., 1.77 GB/s.)

Let’s see how to do this using the new approach:

SELECT topK(100)(arrayJoin(raw.tag)) FROM stack_overflow_js

['javascript','java','c#','php','android','jquery',.... ]

1 rows in set. Elapsed: 0.331 sec. Processed 11.20 million rows, 642.07 MB (33.90 million rows/s., 1.94 GB/s.)

That’s 0.3 seconds vs 2.1 seconds for 11M rows. 

What happened using the new JSON data type?. ClickHouse just retrieves the array, converts all the elements into rows and applies the `topK` function. In order to get similar performance using the `JSONExtract` method, we would need to create some materialized columns and extract the tag array and other interesting fields like date or users to columns in a destination table like this:

CREATE TABLE stack_overflow_qa
(
    	`date` Datetime MATERIALIZED parseDateTime32BestEffort(JSONExtractString(json, 'creationDate')),
	`tags` Array(LowCardinality(String)) MATERIALIZED JSONExtract(raw, 'tag','Array(String)'),
`json` JSON Codec(ZSTD)
)
ENGINE = MergeTree
ORDER BY (date)
SETTINGS index_granularity = 8192

Some ClickHouse hints: 

  • Tags array could be redefined as a LowCardinality.
  • If our `JSON` is big we could change the default compression from `LZ4` to ZSTD and reduce the space in storage.
  • The date field looks like a `ISO 8601` timestamp so we would need the `parseDateTime32BestEffort()` function to convert it to `DateTime`.

What’s missing from the JSON data type? 

When we inserted the JSON file we specified a string format (`JSONAsString`) that treats every JSON row as a `String` without structure. After that ClickHouse parses the String and infers the structure. Trying to completely flatten a JSON and insert it into one column for each value is not a very good practice because it will make the ETL process more complex. We could benefit from the simplicity of extracting the values when we need to, because the performance penalty will be negligible.  ClickHouse is very fast, and with this feature, we can get the fields of a JSON object fast as well. 

But there are some drawbacks. What if there are JSON schema changes like new keys or type changes in values etc. Let’s see by inserting a totally different JSON object:

INSERT INTO stack_overflow_js VALUES ('{ "bar": "hello", "foo": 1 }');

It worked! But how does ClickHouse cope with these changes?. We can check with this query:

SELECT table,
    column,
    name AS part_name,
    type,
    subcolumns.names,
    subcolumns.types,
    subcolumns.serializations
FROM system.parts_columns
WHERE table = 'stack_overflow_js'
FORMAT Vertical

Checking the results we can see that there are 2 parts, the first one for the previous INSERTs with the original JSON structure (stack_overflow) and the second one with the new structure:

Row 1:
──────
table:                     stack_overflow_js
column:                    raw
part_name:                 all_12_22_5
type:                      Tuple(answers Nested(date String, user String), creationDate String, qid String, tag Array(String), title String, user String)
subcolumns.names:          ['answers','answers.size0','answers.date','answers.user','creationDate','qid','tag','tag.size0','title','user']
subcolumns.types:          ['Nested(date String, user String)','UInt64','Array(String)','Array(String)','String','String','Array(String)','UInt64','String','String']
subcolumns.serializations: ['Default','Default','Default','Default','Default','Default','Default','Default','Default','Default']

Row 2:
──────
table:                     stack_overflow_js
column:                    raw
part_name:                 all_23_23_0
type:                      Tuple(Bar String, foo Int8)
subcolumns.names:          ['foo','foo']
subcolumns.types:          ['String','String']
subcolumns.serializations: ['Default','Default']

So, what happens when we execute  `OPTIMIZE TABLE FINAL` to consolidate the structure?

Row 1:
──────
table:                     stack_overflow_js
column:                    raw
part_name:                 all_12_23_6
type:                      Tuple(bar String, answers Nested(date String, user String), creationDate String, foo Int8, qid String, tag Array(String), title String, user String)
subcolumns.names:          ['bar','answers','answers.size0','answers.date','answers.user','creationDate','foo','qid','tag','tag.size0','title','user']
subcolumns.types:          ['String','Nested(date String, user String)','UInt64','Array(String)','Array(String)','String','String','String','Array(String)','UInt64','String','String']
subcolumns.serializations: ['Default','Default','Default','Default','Default','Default','Default','Default','Default','Default','Default','Default']

Row 2:
──────
table:                     stack_overflow_js
column:                    raw
part_name:                 all_12_24_7
type:                      Tuple(bar String, answers Nested(date String, user String), creationDate String, foo Int8, qid String, tag Array(String), title String, user String)
subcolumns.names:          ['foo','answers','answers.size0','answers.date','answers.user','creationDate','bar','qid','tag','tag.size0','title','user']
subcolumns.types:          ['String','Nested(date String, user String)','UInt64','Array(String)','Array(String)','String','String','String','Array(String)','UInt64','String','String']
subcolumns.serializations: ['Default','Default','Default','Default','Default','Default','Default','Default','Default','Default','Default','Default']

Well, ClickHouse has created 1 new part updating the existing tuple structure, mixing the stack overflow tuple and the new tuple. If by some accident we insert some malformed JSON, we won’t get any errors and ClickHouse will update the structure polluting our data. Also with this new notation in selects we won’t be able to read slices (raw.answers.*). 

Another question we may ask ourselves is: if ClickHouse creates 1 column per key/subkey and we have a pretty large JSON with thousands of unique keys, is there a maximum number of keys our JSON should have?. Well, columns will be combined almost identical as compact parts but based on their hash. 

Columns with the same hash will form a group. So we can have many unique keys up until less than 1K because they will be converted into compact parts to optimize space and improve performance. In this github PR there is an interesting discussion about how this is solved.

Compression is another interesting place where we can make improvements. We cannot apply any specialized codecs: for example the Delta codec. It is normally used with temporal data because  it calculates the difference between adjacent values to achieve a good compression ratio. In our case having a JSON string without any temporal pattern does not seem like a good fit for the Delta codec. The delta codec could be used in the materialized column `date` defined in the `stack_overflow_qa` table.

As we are working with JSON objects that basically are strings we can use LZ4 or ZSTD generic compression codecs.

So let’s go check if changing a CODEC will improve storage with no query time penalty:

Both `stack_overflow_js` and `stack_overflow_str` tables use the default LZ4 compression. We can change the CODEC on both tables to test it. But first let’s see the actual values for LZ4:

tablecolumncompresseduncompressed
stack_overflow_strraw1.73 GiB3.73 GiB
stack_overflow_jsonraw1.30 GiB 2.29 GiB
ALTER TABLE stack_overflow_str MODIFY COLUMN raw CODEC(ZSTD(3));
ALTER TABLE stack_overflow_js MODIFY COLUMN raw CODEC(ZSTD(3));
OPTIMIZE TABLE stack_overflow_str FINAL;
OPTIMIZE TABLE stack_overflow_js FINAL;

If we query the compression level of both columns:

SELECT
    table,
    column,
    formatReadableSize(sum(column_data_compressed_bytes)) AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed
FROM system.parts_columns
WHERE table IN ('stack_overflow_js', 'stack_overflow_str') AND column IN ('raw') AND active = 1
GROUP BY
    table,
    column
tablecolumncompresseduncompressed
stack_overflow_strraw1.23 GiB3.73 GiB
stack_overflow_jsonraw866.77 MiB 2.29 GiB

We can see that the compression level for ZSTD is significant, over 50%. Let’s see if this affects query time by using the `topK` query for each table:

– Query to stack_overflow_js (ZSTD)

SELECT topK(100)(arrayJoin(raw.tag)) FROM stack_overflow_js;

1 rows in set. Elapsed: 0.433 sec. Processed 11.20 million rows, 642.07 MB (25.84 million rows/s., 1.48 GB/s.)


– Query to stack_overflow_str (ZSTD)

SELECT topK(100)(arrayJoin(arrayMap(x -> replaceAll(x, '"', ''), JSONExtractArrayRaw(raw, 'tag')))) FROM stack_overflow_str;

1 rows in set. Elapsed: 2.899 sec. Processed 12.25 million rows, 4.09 GB (4.23 million rows/s., 1.41 GB/s.)

For the `stack_overflow_js` we had 0.3s with LZ4, which now is 0.4s with ZSTD(3).  As for the `stack_overflow_str` example, we moved from 2.1s to 2.8s so there is no big impact in query time and if we apply a generic codec like ZSTD we would get some big storage savings.

JSON is evolving rapidly and there are still a few pitfalls for the unwary. To check the ongoing maturity of this feature you can check the current JSON issues and PRs

Conclusion and further thoughts

The new JSON data type is the most user-friendly ClickHouse feature to arrive in 2022. The old ClickHouse way stores JSON in String columns and processed values using cumbersome JSONExtract* and JSONVisit* functions. It works over a wide range of JSON data but required complex SQL code to parse values.

Using the JSON data type, ClickHouse now automatically parses values, inferring the nested structure. It also provides a much easier query syntax. You can refer to keys generically using Tuple references or use property names. Either way queries are far more readable, and it is easier to work with JSON data.

If you have enjoyed this article and would like to try out the new JSON data type, sign up for a free Altinity.Cloud trial and run the latest ClickHouse version in a few clicks. Feel free to contact us to discuss your analytic needs or join our slack channel. We love to help users build innovative applications on ClickHouse. The new JSON data type will make that even easier.

Share

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.