Harnessing the Power of ClickHouse Arrays – Part 1

ClickHouse contributors regularly add analytic features that go beyond standard SQL. This design approach is common in successful open source projects and reflects a bias toward solving real-world problems in creative ways. Arrays are a great example. Most modern databases have array data types, but their capabilities are often limited. ClickHouse is quite different: it unites arrays creatively with aggregation and functional programming to create an entirely new processing model. 

This multi-part article is an overview of arrays in ClickHouse, driven by specific examples (make sure to check out harnessing the power of ClickHouse arrays part 2 and part 3). We start with basic features like defining and loading array data, using array functions, and unrolling arrays to tabular format.  These enable ClickHouse users to handle time series with arbitrary parameters.  We build from there to explore how aggregation and arrays are related, as well as show examples of functional programming using lambdas and complex array functions. The advanced features enable us to solve problems like building analytic funnels and linear interpolation. The goal is to show what arrays are as well as why you should use them. 

Array basics

In ClickHouse arrays are just another column data type that represents a vector of values. Here is a simple example of array use in a ClickHouse table. 

CREATE TABLE array_test (
  floats Array(Float64),
  strings Array(String),
  nullable_strings Array(Nullable(String))
) ENGINE=TinyLog

Using arrays is easy, especially if you are familiar with popular programming languages like Python or Javascript.  You can create array values using square brackets and select individual values out of them, again with square brackets.  Here is an example of inserting array values and selecting them back again in different ways. Following SQL tradition, ClickHouse array indexes start at 1, not 0. 

INSERT INTO array_test 
VALUES([1.0, 2.5], ['a', 'c'], ['A', NULL, 'C'])

SELECT floats[1] AS v1, strings[2] AS v2, nullable_strings 
FROM array_test

┌─v1─┬─v2─┬─nullable_strings─┐
│  1 │ c  │ ['A',NULL,'C']   │
└────┴────┴──────────────────┘

ClickHouse has a rich library of array functions. You do not even need a table to demonstrate use.  Instead, it is easy to define array constants as the following example shows. This is a great way to test out function behavior quickly. 

SELECT 
    [1, 2, 4] AS array,
    has(array, 2) AS v1,
    has(array, 5) AS v2,
    length(array) AS v3

┌─array───┬─v1─┬─v2─┬─v3─┐
│ [1,2,4] │  1 │  0 │  3 │
└─────────┴────┴────┴────┘

So much for basic array behavior. For more details check out the ClickHouse documentation on the Array data type. Meanwhile we will start to put arrays to work solving analytic problems. 

Modeling variable data structures

Variable data structures like lists of key-value pairs (aka dictionaries or maps) occur constantly in analytic problems, especially those involving time series data. Take for example monitoring VMs running a public cloud. Particular VMs have different properties we want to measure (like values specific to SSD storage) as well as tags that vary by the team operating the VM (like application type). Each monitoring record therefore contains two key-value lists, whose keys may change between different VMs and over time.

We can represent each key-value list using a pair of arrays. One array provides the property name, and the other provides a value at the same array index. Here is how we would model VM monitoring data in a table definition. Since there are two types of key-values, there are two sets of arrays: one for metric data and another for tag data. 

CREATE TABLE vm_data (
  datetime DateTime,  
  date Date default toDate(datetime),
  vm_id UInt32,  
  vm_type LowCardinality(String),  
  metrics_name Array(String),
  metrics_value Array(Float64),
  tags_name Array(String),  
  tags_value Array(String)
)
ENGINE = MergeTree()
PARTITION BY date 
ORDER BY (vm_type, vm_id, datetime)

You can load arrays directly using nested JSON structures like the following, shown in a pretty-printed format. 

[ { 
  "datetime": "2020-09-03 00:00:10",
  "vm_id": 6220,
  "vm_type": "m5.large",
  "metrics_name": [ "usage_idle", "ebs1_cap_gib", "ebs1_used_gib" ],
  "metrics_value": [ 80.2, 10.0, 7.6 ],
  "tags_name": [ "name", "group" ],
  "tags_value": [ "sfg-prod-01", "rtb" ]
},
{
  "datetime": "2020-09-03 00:00:12",
  "vm_id": 6221,
  "vm_type": "m5ad.xlarge",
  "metrics_name": [ "usage_idle", "ssd1_cap_gib", "ssd1_used_gib" ],
  "metrics_value": [ 59.19, 75.0, 21.9 ],
  "tags_name": [ "name", "group", "owner" ],
  "tags_value": [ "mt-prod-65", "marketing", "casey" ]
} ]

We use the following command to load our sample data into ClickHouse.  The jq command strips records out of the JSON array and delivers each record on a single line to match ClickHouse JSONEachRow input format

cat vm_data.json |jq -c .[] | clickhouse-client --database arrays \
--query="INSERT INTO vm_data FORMAT JSONEachRow"

Once the data are loaded we can now operate on them using SQL. We’ll start with selecting back individual records, which is trivial: 

SELECT *
FROM vm_data
LIMIT 1\G

Row 1:
──────
datetime:      2020-09-03 00:00:10
date:          2020-09-03
vm_id:         6220
vm_type:       m5.large
metrics_name:  ['usage_idle','ebs1_cap_gib','ebs1_used_gib']
metrics_value: [80.2,10,7.6]
tags_name:     ['name','group']
tags_value:    ['sfg-prod-01','rtb']

As mentioned already, ClickHouse offers a wealth of array functions to process data directly within arrays. For example, here is a quick way to find any VM that lacks ‘name’,  ‘group’, and ‘owner’ tags.  We can use the hasAll() function, which verifies that the first array argument contains the subset of values defined by the second argument. 

WITH ['name', 'group', 'owner'] AS required_tags
SELECT distinct vm_id
FROM vm_data
WHERE NOT hasAll(tags_name, required_tags)

Note the use of WITH to define an array for use in the query.  This is an example of a Common Table Expression or CTE.  CTEs help reduce query complexity by removing constant expressions from the main query and are a best practice in ClickHouse.  We’ll use them in other examples to keep things more readable. 

ClickHouse array functions are quite varied and cover a wide range of use cases. Here is how to look for the names of VMs with the ‘group’ tag value of ‘rtb’. As you can probably guess, the indexOf() function returns the index for a value. We can use it to reference values in another array This allows us to relate values between the tags_name and tags_value arrays.  

SELECT distinct vm_type
FROM vm_data
WHERE tags_value[indexOf(tags_name, 'group')] = 'rtb'

Joining array and table data

Working directly with array functions to locate and process tag values can be cumbersome, especially when working across arrays in several columns.  Fortunately, ClickHouse has a very handy ARRAY JOIN operator that makes it easy to ‘unroll’ the array values into a table of name value pairs.  Here’s an example of ARRAY JOIN in use. 

SELECT date, vm_id, vm_type, name, value
FROM vm_data
ARRAY JOIN tags_name AS name, tags_value AS value
ORDER BY date, vm_id, name

ARRAY JOIN works as follows.  The columns from the left-side table vm_data (date, vm_id, vm_type) “join” against the values in the arrays listed after ARRAY JOIN (tags_name, tags_value).  ClickHouse creates a column for each listed array and populates values in the same order from each one.  The result looks like the following.  Array values are in bold to show the join effects more clearly.

┌───────date─┬─vm_id─┬─vm_type─────┬─name──┬─value───────┐
│ 2020-09-03 │  6220 │ m5.large    │ group rtb         │
│ 2020-09-03 │  6220 │ m5.large    │ name  │ sfg-prod-01 │
│ 2020-09-03 │  6221 │ m5ad.xlarge │ groupmarketing   │
│ 2020-09-03 │  6221 │ m5ad.xlarge │ name  │ mt-prod-65  │
│ 2020-09-03 │  6221 │ m5ad.xlarge │ ownercasey       │
└────────────┴───────┴─────────────┴───────┴─────────────┘

The ClickHouse docs have a nice write-up on ARRAY JOIN that illustrates its flexibility.  Here’s an example to whet your appetite. The following query adds a sequence number and sorts rows in the array sequence order using the handy arrayEnumerate() function, which returns array index values in ascending order. 

SELECT date, vm_id, vm_type, name, value, seq
FROM vm_data
ARRAY JOIN 
  tags_name AS name, 
  tags_value AS value, 
  arrayEnumerate(tags_name) AS seq
ORDER BY date, vm_id, seq

┌───────date─┬─vm_id─┬─vm_type─────┬─name──┬─value───────┬─seq─┐
│ 2020-09-03 │  6220 │ m5.large    │ name  │ sfg-prod-01 │   1 │
│ 2020-09-03 │  6220 │ m5.large    │ group │ rtb         │   2 │
│ 2020-09-03 │  6221 │ m5ad.xlarge │ name  │ mt-prod-65  │   1 │
│ 2020-09-03 │  6221 │ m5ad.xlarge │ group │ marketing   │   2 │
│ 2020-09-03 │  6221 │ m5ad.xlarge │ owner │ casey       │   3 │
└────────────┴───────┴─────────────┴───────┴─────────────┴─────┘

ARRAY JOIN is helpful for presenting output, since query results containing arrays are hard for humans to read and may require specialized deserialization logic in client applications.  It’s also very helpful to reduce query complexity.  Using ARRAY JOIN we can minimize or even eliminate array function expressions.  The following example is a rewrite of the earlier example to find VM types used by group ‘rtb’. 

SELECT distinct vm_type FROM (
  SELECT date, vm_id, vm_type, name, value
  FROM vm_data
  ARRAY JOIN tags_name AS name, tags_value AS value
  WHERE name = 'group' AND value = 'rtb'
)

It would be remiss to end our introduction of data modeling using arrays without mentioning the arrayJoin() function.  This function can be added within the SELECT column list to generate unrolled results as shown in the following example. 

SELECT 1, 2, arrayJoin(['a', 'b']) AS a1

┌─1─┬─2─┬─a1─┐
│ 1 │ 2 │ a  │
│ 1 │ 2 │ b  │
└───┴───┴────┘

This is exactly equivalent to the following query with ARRAY JOIN. 

SELECT 1, 2 FROM system.one ARRAY JOIN ['a', 'b'] AS a1

There is a critical difference, however.  As we saw above ARRAY JOIN allows multiple arrays and unrolls the values across all arrays in parallel.  The arrayJoin() function behaves differently.  If there are multiple arrayJoin() calls they generate results like the following. 

SELECT  1,  2, 
  arrayJoin(['a', 'b']) AS a1, arrayJoin(['i', 'ii']) AS a2

┌─1─┬─2─┬─a1─┬─a2─┐
│ 1 │ 2 │ a  │ i  │
│ 1 │ 2 │ a  │ ii │
│ 1 │ 2 │ b  │ i  │
│ 1 │ 2 │ b  │ ii │
└───┴───┴────┴────┘

As you can see, the result is a cartesian product of the array values, which may not be what you intended. For the rest of this document we’ll focus on ARRAY JOIN, as it allows us to process arrays with related values. This behavior is critical for more advanced applications of arrays. 

Conclusion and more to come…

The article you just read introduced basic usage of arrays in ClickHouse.  We showed how to represent variable data using paired arrays, how to extract data using array functions, and how to join array values against table rows using ARRAY JOIN and arrayJoin(). 

The array features we have covered already go beyond the capabilities of many SQL databases.  For ClickHouse they are just the beginning. In the next article we will show how arrays and SQL GROUP BY are closely connected. The integration between arrays and aggregation enables users to identify sequences of events as well as build funnels, which track progress toward desired goals in marketing, sales, and other domains. This is an important analytic tool for a wide range of interesting applications. Stay tuned!

Share

2 Comments

  1. Very helpful article. Just a small correction, vm_data json has an extra brace ( “},{” ) before the 2nd json row

Comments are closed.