Harnessing the Power of ClickHouse Arrays – Part 2

ClickHouse Arrays

Our previous article on ClickHouse arrays laid out basic array behavior.  We introduced basic array syntax, use of arrays to model key-value pairs, and how to unroll array values into tables using ARRAY JOIN.  As we noted, these features already offer substantial power to users, but there’s much more to come. 

In the current article we will dig into the integration between arrays and GROUP BY clauses. This integration gives users tools to solve problems like identifying event sequences and performing funnel analysis. Funnel analysis is an important technique that measures progress toward a specific goal, such as getting a user to buy products on a website. We’ll show how it works using arrays, then show an alternative approach using the ClickHouse windowFunnel() function. 

Build sequences with groupArray(), arraySort(), and lambdas

Tracking sequences is a common problem in analytic applications.  It comes up in many use cases, from tracking the path a user takes through an online service to computing aircraft itineraries.  In this section we will explore how to use arrays to track a sequence of events.  We will seek to solve the following question: show the longest itinerary a single commercial aircraft completes in one day.

Our dataset is the popular Airline On-Time Performance Data, which is available for ClickHouse. It can be downloaded following instructions from the ClickHouse docs.  The result is a single table named ‘ontime’ that contains a row for each commercial airline flight between a US origin and destination airport.  

To track the path that an aircraft traverses in a single day, we’ll need to find all flights of that aircraft, put them into order, and then count the resulting number of hops to order them. Aircraft belong to carriers and are identified by their tail number. Let’s start by figuring out the maximum number of hops for any aircraft. This does not require arrays and is easy to do with GROUP BY.  Here is an example to find the aircraft with the most flights on January 15th, 2017.

SELECT Carrier, TailNum, count(*) AS Hops
FROM ontime
WHERE (FlightDate = toDate('2017-01-15')) AND (DepTime < ArrTime)
GROUP BY Carrier, TailNum
ORDER BY Hops DESC, Carrier, TailNum
LIMIT 5

┌─Carrier─┬─TailNum─┬─Hops─┐
│ HA      │ N488HA  │   14 │
│ HA      │ N492HA  │   14 │
│ HA      │ N493HA  │   14 │
│ HA      │ N483HA  │   12 │
│ HA      │ N489HA  │   12 │
└─────────┴─────────┴──────┘

The lucky winner is actually a tie:  N488HA, N492HA, and N493HA. All aircraft belong to Hawaiian Airlines. US airplane buffs will not be surprised: Hawaiian Airlines runs short hop flights between islands, with distances as little as 50 miles between airports.  (For extra fun, you can look up aircraft tail numbers here to find out more about the plane in question.  N488HA is a Boeing 717 jet.) 

That was easy, but we are far from solving the entire problem. What does the winning itinerary actually look like?  At this point we’ll need to introduce arrays.  Let’s start with just a part of the problem by adding the time of departure for each hop. 

SELECT Carrier, TailNum,
  groupArray(DepTime) as Departures,
  length(groupArray(DepTime)) as Hops
FROM ontime
WHERE (FlightDate = toDate('2017-01-15')) AND (DepTime < ArrTime)
GROUP BY Carrier, TailNum
ORDER BY Hops DESC, Carrier, TailNum
LIMIT 5

┌─Carrier─┬─TailNum─┬─Departures───────────────────────────┬─Hops─┐
│ HA      │ N488HA  │ [1157,1813,1921,1042,658,544,170...] │   14 │
│ HA      │ N492HA  │ [718,824,938,1512,1622,1056,1743...] │   14 │
│ HA      │ N493HA  │ [627,728,1012,1647,1801,904,1523...] │   14 │
│ HA      │ N483HA  │ [845,725,1800,1904,1259,1407,113...] │   12 │
│ HA      │ N489HA  │ [2047,1936,1031,1602,1438,1256,9...] │   12 │
└─────────┴─────────┴──────────────────────────────────────┴──────┘

This query introduces groupArray(), which is an aggregate function.  It gathers up values within each GROUP BY subset and adds them to the array in the order that the rows are processed. If you call groupArray() on different columns, it will create an array for each one. Most importantly, the resulting arrays will always be the same size, and have the same order. In other words, groupArray() is the inverse of the ARRAY JOIN operator we studied in the previous blog article.  

There is more work to construct the flight paths. First, we only have departure times but no airports. Second, the departure times are not sorted.  Let’s fix that in the next example, which adds more arrays and sorts them in departure time order. 

SELECT Carrier, TailNum,
  arraySort(groupArray(DepTime)) as Departures,
  arraySort((x,y) -> y, groupArray(Origin), groupArray(DepTime)) as Origins,
  length(groupArray(DepTime)) as Hops
FROM ontime WHERE (FlightDate = toDate('2017-01-15')) AND (DepTime < ArrTime)
GROUP BY Carrier, TailNum
ORDER BY Hops DESC, Carrier, TailNum
LIMIT 5 \G

Row 1:
──────
Carrier:    HA
TailNum:    N488HA
Departures: [544,658,818,930,1042,1157,1322,1443,1556,1708,1813,1921,2025,2126]
Origins:    ['HNL','KOA','HNL','LIH','OGG','LIH','HNL','OGG','LIH','OGG','HNL','OGG','HNL','OGG']
Hops:       14
...

As the example shows we are leaning on a new and very versatile function: arraySort(). The previous query illustrates the different ways it can be used.  In the simplest form, arraySort() takes a single array argument and returns sorted values.  Here is an example. 

SELECT arraySort([1224, 1923, 1003, 745]) AS s

┌─s────────────────────┐
│ [745,1003,1224,1923] │
└──────────────────────┘

We can make things more interesting by adding a lambda expression that directs arraySort how to process each array value it encounters.  The lambda expression is always the first argument if present. The following example uses a lambda expression to reverse the sort order. The lambda is applied to each successive array value, and ClickHouse uses the resulting values as sorting keys for the values in the array.

SELECT arraySort(x -> (-x), [1224, 1923, 1003, 745]) AS s

┌─s────────────────────┐
│ [1923,1224,1003,745] │
└──────────────────────┘

Lambdas are not always necessary. You can also do this another way using the arrayReverseSort() function.  The following query returns exactly the same answer. 

SELECT arrayReverseSort([1224, 1923, 1003, 745]) AS s

Finally, we can add a different key entirely.  The following example has two arrays and uses a lambda expression to pick the second one as the sorting key. ClickHouse will sort the first array using the values of the second array as keys. 

SELECT 
  arraySort((x, y) -> y, [1224, 1923, 1003, 745], [2, 1, 3, 4]) AS s

┌─s────────────────────┐
│ [1923,1224,1003,745] │
└──────────────────────┘

In summary, groupArray() builds arrays from grouped values, and sortArray() sorts the values, using one array as the sorting keys for the other arrays. Now all that remains is to construct the full query.  We’ll use ARRAY JOIN to unroll our result in a more readable format. 

SELECT Carrier, TailNum, Depart, Arr, Origin, Dest
FROM
(
  SELECT Carrier, TailNum, 
    arraySort(groupArray(DepTime)) AS Departures,
    arraySort(groupArray(ArrTime)) AS Arrivals,
    arraySort((x, y) -> y, groupArray(Origin), groupArray(ArrTime)) AS Origins,
    arraySort((x, y) -> y, groupArray(Dest), groupArray(ArrTime)) AS Destinations,
    length(groupArray(DepTime)) AS Hops
  FROM ontime
  WHERE (FlightDate = toDate('2017-01-15')) AND (DepTime < ArrTime)
  GROUP BY Carrier, TailNum
  ORDER BY Hops DESC, Carrier, TailNum
)
ARRAY JOIN Departures AS Depart, Arrivals AS Arr,
    Origins AS Origin, Destinations AS Dest
LIMIT 15

┌─Carrier─┬─TailNum─┬─Depart─┬──Arr─┬─Origin─┬─Dest─┐
│ HA      │ N488HA  │    544 │  632 │ HNL    │ KOA  │
│ HA      │ N488HA  │    658 │  749 │ KOA    │ HNL  │
│ HA      │ N488HA  │    818 │  853 │ HNL    │ LIH  │
│ HA      │ N488HA  │    930 │ 1020 │ LIH    │ OGG  │
│ HA      │ N488HA  │   1042 │ 1126 │ OGG    │ LIH  │
│ HA      │ N488HA  │   1157 │ 1229 │ LIH    │ HNL  │
│ HA      │ N488HA  │   1322 │ 1406 │ HNL    │ OGG  │
│ HA      │ N488HA  │   1443 │ 1535 │ OGG    │ LIH  │
│ HA      │ N488HA  │   1556 │ 1646 │ LIH    │ OGG  │
│ HA      │ N488HA  │   1708 │ 1742 │ OGG    │ HNL  │
│ HA      │ N488HA  │   1813 │ 1857 │ HNL    │ OGG  │
│ HA      │ N488HA  │   1921 │ 1958 │ OGG    │ HNL  │
│ HA      │ N488HA  │   2025 │ 2107 │ HNL    │ OGG  │
│ HA      │ N488HA  │   2126 │ 2207 │ OGG    │ HNL  │
│ HA      │ N492HA  │    501 │  542 │ HNL    │ LIH  │
└─────────┴─────────┴────────┴──────┴────────┴──────┘

It’s nice to see that origin and departure airports line up properly for our N488HA aircraft.  This gives us confidence that the result is correct. 

Creating funnels with arrayFilter

As mentioned in the introduction, funnel analysis is an important technique to track progress toward goals in time series data.  We can use funnels to answer questions like the following:

  1. How many website customers put something in their shopping cart and then buy it in the same session?
  2. How do marketing leads progress from contacts to enrolled customers?  
  3. How many aircraft fly to Chicago and then to Atlanta during the same day?

Since we have flight data handy we will use it to build a funnel that answers the last question. To add interest we will aggregate results by year over the entire dataset. Reading more data makes it easier to evaluate performance.

As with the previous example of tracking sequences we will start with a simpler query and add features until we have a solution.  Our initial query generates a list of daily destinations of each aircraft ordered by time.  To keep things quick we’ll just look at data for a single day.

SELECT FlightDate, Carrier, TailNum,
    arraySort(groupArray(ArrTime)) AS Arrivals,
    arraySort((x, y) -> y, groupArray(Dest), groupArray(ArrTime)) AS Dests
  FROM ontime
  WHERE DepTime < ArrTime AND TailNum != ''
    AND toYYYYMM(FlightDate) = toYYYYMM(toDate('2017-01-01'))
  GROUP BY FlightDate, Carrier, TailNum
  ORDER BY FlightDate, Carrier, TailNum
LIMIT 5\G

Row 1:
──────
FlightDate: 2017-01-01
Carrier:    AA
TailNum:    N001AA
Arrivals:   [1021,2213]
Dests:      ['MIA','ATL']
. . . 

Our initial query gives us properly ordered lists, but we now need to look for flights that pass through Chicago.  We’ll restrict the search to codes for O’Hare (‘ORD’) and Midway (‘MDW’), which are the main airports that serve the city of Chicago.  To check results more quickly, we will add a HAVING clause so that we only look at routes that include a stop in Chicago. 

SELECT FlightDate, Carrier, TailNum,
    arraySort(groupArray(ArrTime)) AS Arrivals,
    arraySort((x, y) -> y, groupArray(Dest), groupArray(ArrTime)) AS Dests,
    arrayFilter((arr, dest) -> (dest in ('MDW', 'ORD')), Arrivals, Dests)[1] as CHI_Arr,
    length(Arrivals) as Hops
  FROM ontime
  WHERE DepTime < ArrTime AND TailNum != ''
    AND toYYYYMM(FlightDate) = toYYYYMM(toDate('2017-01-01'))
  GROUP BY FlightDate, Carrier, TailNum
  HAVING CHI_Arr > 0
  ORDER BY FlightDate, Carrier, TailNum
LIMIT 5 \G

Row 1:
──────
FlightDate: 2017-01-01
Carrier:    AA
TailNum:    N005AA
Arrivals:   [1238,1757,2158]
Dests:      ['JAC','ORD','ATL']
CHI_Arr:    1757
Hops:       3
. . . 

Our sample query depends on a new function: arrayFilter(), which takes two or more arguments.  The first is a lambda expression, and the remaining arguments are arrays. arrayFilter() returns an array of elements from the first array argument where the lambda expression has a non-zero result.  Here’s how to return string array elements that are longer than one character. 

WITH ['a', 'bc', 'def', 'g'] AS array
SELECT arrayFilter(v -> (length(v) > 1), array) AS filtered

┌─filtered─────┐
│ ['bc','def'] │
└──────────────┘

Here is another example that processes paired arrays to return the first value corresponding to tag ‘name’.  There could be multiple such tags but we use the [1] index to return the first one in the resulting array. 

WITH 
    ['name', 'owner', 'account'] AS tags,
    ['joe', 'susie', 'website'] AS values
SELECT arrayFilter((v, t) -> (t = 'name'), values, tags)[1] AS name

┌─name─┐
│ joe  │
└──────┘

We use exactly the preceding approach to find the arrival time that corresponds to a Chicago airport in the example query above. From this point onward, it is a simple matter to add another arrayFilter function to find flights that land in Chicago and have a later arrival time at Atlanta (ATL). 

A query to build a funnel showing flights for each year looks like the following. 

SELECT
  toYear(FlightDate) AS year,
  count(*) AS Total,
  sum(CHI_Arr != 0) AS CHI_tot,
  sum(ATL_Arr != 0) AS ATL_tot
FROM (
  SELECT FlightDate, Carrier, TailNum,
    arraySort(groupArray(ArrTime)) AS Arrivals,
    arraySort((x, y) -> y, groupArray(Dest), groupArray(ArrTime)) AS Dests,
    arrayFilter((arr, dest) -> (dest in ('MDW', 'ORD')), Arrivals, Dests)[1] as CHI_Arr,
    arrayFilter((arr, dest) -> (dest = 'ATL' AND arr > CHI_Arr And CHI_Arr != 0), Arrivals, Dests)[1] as ATL_Arr
  FROM ontime
  WHERE DepTime < ArrTime AND TailNum != ''
  GROUP BY FlightDate, Carrier, TailNum
  ORDER BY FlightDate, Carrier, TailNum
) GROUP BY year ORDER By year

┌─year─┬───Total─┬─CHI_tot─┬─ATL_tot─┐
│ 1995 │ 1079754 │  216539 │    7738 │
│ 1996 │ 1086087 │  217737 │    8911 │
│ 1997 │ 1101177 │  218928 │   10171 │
│ 1998 │ 1098116 │  217532 │    9797 │
│ 1999 │ 1142869 │  220902 │   10396 │
│ 2000 │ 1183406 │  219758 │    9279 │
│ 2001 │ 1257365 │  224794 │    8747 │
│ 2002 │ 1205584 │  226225 │    8970 │
│ 2003 │ 1316343 │  244669 │    9418 │
. . . 
23 rows in set. Elapsed: 11.463 sec. Processed 173.82 million rows, 6.87 GB (15.16 million rows/s., 599.05 MB/s.) 

The full query gets us results but seems slow. It also eats a lot of memory. To get it to run you may need to raise the available memory for the query. The following SET command allows up to 16B bytes of RAM to be allocated. 

SET max_memory_usage = 16000000000

What this shows is that arrays can be used to build funnels but may not be especially performant in every case. Fortunately, funnel analysis is prominent in many analytic use cases, so ClickHouse provides specialized functions that address it more directly.  Let’s have a look. 

Funnel analysis using windowFunnel()

The windowFunnel() function iterates over a set of timestamped rows looking for matches on successive conditions.  In the following example windowFunnel has 3 types of arguments. The first is the “window” or range of seconds to look for condition matches. The next argument is a time value to order the rows. The remaining values are conditions that successive events should match. The function returns how many conditions actually matched.

SELECT windowFunnel(260000)(d, w='Cloudy', w='Sunny') AS steps
FROM
(
  SELECT toDate('2020-01-02') AS d, 'Cloudy' AS w
  UNION ALL
  SELECT toDate('2020-01-03') AS d, 'Rainy' AS w
  UNION ALL
  SELECT toDate('2020-01-04') AS d, 'Sunny' AS w
)

┌─steps─┐
│     2 │
└───────┘

Just in case the preceding code is not obvious, here’s a translation to English. Within a span of three days look for a cloudy day followed by a sunny day.  Return how many successive conditions matched. With this data there were two matches, or “steps,” through the sequence.

It might seem a little obscure at first, but windowFunnel() is a powerful function. It can exactly duplicate what we just did with arrays.  Here’s proof. The query output is identical. 

SELECT
    toYear(FlightDate) AS year, count(*) AS Total,
    countIf(funnel_level > 0) AS CHI_tot,
    countIf(funnel_level = 2) AS ATL_tot
FROM
(
    SELECT FlightDate, Carrier, TailNum,
        windowFunnel(86400)(toUInt32(ArrTime), Dest IN ('MDW', 'ORD'), Dest = 'ATL') AS funnel_level
    FROM ontime
    WHERE DepTime < ArrTime AND TailNum != ''
    GROUP BY FlightDate, Carrier, TailNum
    ORDER BY FlightDate ASC, Carrier ASC, TailNum ASC
)
GROUP BY year ORDER BY year ASC

┌─year─┬───Total─┬─CHI_tot─┬─ATL_tot─┐
│ 1995 │ 1079754 │  216539 │    7738 │
│ 1996 │ 1086087 │  217737 │    8911 │
│ 1997 │ 1101177 │  218928 │   10171 │
│ 1998 │ 1098116 │  217532 │    9797 │
│ 1999 │ 1142869 │  220902 │   10396 │
│ 2000 │ 1183406 │  219758 │    9279 │
│ 2001 │ 1257365 │  224794 │    8747 │
│ 2002 │ 1205584 │  226225 │    8970 │
│ 2003 │ 1316343 │  244669 │    9418 │
. . .
23 rows in set. Elapsed: 5.526 sec. Processed 173.82 million rows, 6.87 GB (31.45 million rows/s., 1.24 GB/s.) 

Aggregation with windowFunnel() is more efficient than arrays in this case. The query is simpler, uses less memory (about 8GB of RAM vs 13GB for the array approach), and runs in about half the time.  Unless we add conditions to the problem it looks like a good solution. 

By contrast, arrays are well suited for problems where you need to perform calculations across elements or visualize graphs, as we did when showing airline flight paths. For cases like finding the maximum number of hops or building funnels, ClickHouse offers alternatives. Try them all and pick the solution that works best for you. 

Conclusion of part 2…

In Part 2 of our series on ClickHouse arrays we showed how arrays connect neatly with GROUP BY and introduced additional array functions.  With these we can visualize sequences and even build funnels. To keep things balanced we also showed a couple examples of alternatives to arrays. 

We will finish our series in the next blog article, where we will delve further into the functional programming model used in array functions. We will finish off with an example of linear interpolation, which arrays are well-suited to solve. We hope you enjoy these articles and look forward to finishing the series. Stay tuned!

Note: Special thanks to my colleague Dmitry Titov for the nice windowFunnel() example.

Share

2 Comments

  1. Hey guys, great article. In the window funnel example, what is the granularity of the FlightDate statement? Is it by second, minute, hour, day, etc? It seems that different granularities effects the result.

Comments are closed.