Blog

Math Games in ClickHouse — Solving Project Euler Problems

ClickHouse is an extremely versatile DBMS. When presenting it to newbies, I sometimes describe it as a combination of four “F”-s: Fast, Flexible, Fun and Free. We often benchmark ClickHouse in different scenarios to demonstrate how fast it is. Case Studies and specific problem cases prove flexibility. Finally, thanks to the Apache 2.0 license it is free to use and modify. But what about fun? In this short article, I will show you a few fun things about ClickHouse that make it look… well, very different from a traditional DBMS.

Prime number generators

Many software developers participate in programming contests before they start to work hard, and do not have enough free time anymore. If you still dream about that time, you may do some practice at Project Euler. It contains a lot of good math and programming exercises. Many of them involve prime numbers. Is it possible to do it in SQL? Of course, if it is ClickHouse.

For example, the query below will print the 1000th prime number:

SELECT number FROM numbers(10000) 
 WHERE number>1 and not arrayExists(n -> number % n = 0, range(2, number))
LIMIT 999,1

What is going on here? We used arrayExists() function that checked if candidate numbers could be divided by any one below it. It is an example of a higher-order function that takes a function as a parameter. We will be using higher-order functions a lot in this article, so it is important to understand how it works:

  • n -> number % n = 0 is a lambda expression or a function that takes argument ‘n’ (the name is arbitrary here) and returns ‘number % n = 0’. 
  • range(2, number) is the source array of numbers in range [2, number-1]
  • arrayExists() returns true if there is at least one element of an array for which lambda expression returns true – in our case if the number has a non-trivial factor.

The query completes in 600ms, not super fast. Let’s make a little bit of optimization. When checking for factors, we actually do not need to test all numbers below the candidate number, but only those that are below the square root of a candidate number.

SELECT number FROM numbers(10000) 
 WHERE number>1 and not arrayExists(n -> number % n = 0, range(2, toUInt16(sqrt(number))+1))
LIMIT 999,1

This is much faster with only 10ms! Let’s try to apply it to the problem: 

Euler Problem #7: Find the 10001st prime number

After adjusting the upper limit we get it:

SELECT number FROM numbers(1000000) 
 WHERE number>1 and not arrayExists(n -> number % n = 0, range(2, toUInt32(sqrt(number))+1))
LIMIT 10000,1

┌─number─┐
│ 104743 │
└────────┘

It takes 335 ms!

As usual with ClickHouse, the first thing I ask myself: can we improve it even more? Let’s note that when testing for factors, we can just use only prime factors, which are of course fewer. So, one of the standard optimizations is to test the next candidate number against previously found primes. That requires us to store the state somewhere. Since late 2023, there is an arrayFold() function that does the job. This one is more complex than other array functions. The lambda expression is used in order to modify the state that is carried over between iterations. Let’s look at this example:

WITH arrayFold( (primes, n) -> 
        if(arrayExists(p -> n % p = 0, primes), primes, arrayConcat(primes, [n])), /* lambda that modifies the state */
        range(2,1000000), /* iterator */
        emptyArrayUInt32() /* initial state */
        ) as primes
SELECT primes[10001]

It starts with the empty array, and adds prime numbers when it finds those. The same array is used to probe candidate numbers.

Unfortunately, it is extremely slow! Why? This is optimal from the linear algorithmic perspective, but it does not take into account parallelization. In the previous queries, ClickHouse could split numbers() into multiple buckets and execute those in parallel using multiple cores. By contrast, arrayFold() has to be executed sequentially. So for the 101st prime number it takes 450ms. For the 10001st it works forever. 

Ok, now let’s try to use a prime number generator to solve some other Project Euler problems.

Euler Problem #3: What is the largest prime factor of the number 600 851 475 143?

WITH 600851475143 as test_number,
     primes as (SELECT number as prime FROM numbers(toUInt32(sqrt(test_number))) 
      WHERE number>1 and not arrayExists(n -> number % n = 0, range(2, toUInt16(sqrt(sqrt(test_number)))+1))
      order by 1 desc)
SELECT * FROM primes
WHERE test_number % prime == 0
LIMIT 1

┌─prime─┐
│  6857 │
└───────┘

We got 6857 in 3 seconds, which is the correct answer! 

Few notes

  1. All examples were run on 32 vCPU Altinity.Cloud demo server, so timings may be different for your environment.
  2. Use the latest ClickHouse 24.3 version. In earlier versions, for example, ClickHouse could not unfold constants inside CTE, so you will need to replace sqrt(test_number) with sqrt(600851475143).

Fibonacci sequence

We have tried arrayFold() for prime numbers, and it was not very efficient. What about the Fibonacci sequence? It seems to be a much better case since the Fibonacci sequence has a natural recursive pattern. Each new term in the Fibonacci sequence is generated by summation of the previous two terms. There is a very elegant way to do it in ClickHouse:

SELECT arrayFold( acc,x -> arrayConcat([acc[1]+acc[2]], acc), range(10), [1::UInt64])

That will return a reversed Fibonacci sequence:

[89,55,34,21,13,8,5,3,2,1,1]

Now let’s look if we can apply it to Euler Problem #2:

Problem #2: By considering the terms in the Fibonacci sequence whose values do not exceed four million,
find the sum of the even-valued terms. 

WITH 4000000 as max_number
SELECT 
      arraySum(
        arrayFilter(x -> x<max_number and x % 2 == 0,
         arrayFold( acc,x -> arrayConcat([acc[1]+acc[2]], acc), range(1000), [1::UInt64])
      )) as euler_2

┌─euler_2─┐
│ 4613732 │
└─────────┘

If you forget about the SELECT statement, it looks like a beautiful functional program, not an SQL at all! It completed very fast with a correct answer. It did a lot of extra work, though, by manipulating long arrays. We will explain the array size a bit later. Now let’s note that we do need to store all Fibonacci numbers, we may only store the last two numbers and a sum of even-valued ones! We can do it as a tuple of 3 integers: first two values will store Fibonacci, and the last one accumulate the sum:

WITH 4000000 as max_number
SELECT arrayFold( acc,x -> 
       (acc.1+acc.2,
        acc.1, 
        acc.3 + if(acc.1 <= max_number and acc.1 % 2 == 0, acc.1, 0)), 
       range(1000), (1::UInt64, 1::UInt64, 0::UInt64)) as euler_2

┌─euler_2───────────────────────────────────────────┐
│ (9897335391534825784,9079565065540428013,4613732) │
└───────────────────────────────────────────────────┘

The last number in the tuple 4613732 is the correct answer to the problem #2.

Now let’s notice that we have calculated a lot of extra numbers in the sequence; 9897335391534825784 is way above 4 million. We used 1000 first Fibonacci numbers, but it is too much. Let’s remember some math. In fact, if F(n) is the n-th number in the Fibonacci sequence, then F(n+1)/F(n) is approximately 1.6. So the sequence will reach 4 million in logarithm of 4 000 000 by base 1.6 steps, which is approximately 32:

WITH 4000000 as max_number
SELECT arrayFold( acc,x -> 
       (acc.1+acc.2, acc.1, acc.3 + if(acc.1 <= max_number and acc.1 % 2 == 0, acc.1, 0)), 
       range(toUInt32(ln(max_number)/ln(1.6))), (1::UInt64, 1::UInt64, 0::UInt64)) as euler_2

That gives a correct result instantly, without unneeded overhead!

┌─euler_2───────────────────┐
│ (5702887,3524578,4613732) │
└───────────────────────────┘

With this efficient algorithm, we can calculate much bigger sums by increasing max_number value. For example, let’s take four billion and get:

┌─euler_2────────────────────────────┐
│ (7778742049,4807526976,1485607536) │
└────────────────────────────────────┘

Or even four trillion:

┌─euler_2─────────────────────────────────────┐
│ (6557470319842,4052739537881,2026369768940) │
└─────────────────────────────────────────────┘

In both cases, results are almost instant.

Lychrel numbers

A Lychrel number is a natural number that cannot form a palindrome through the iterative process of repeatedly reversing its digits and adding the resulting digits. For example, 47 is not a Lychrel number because 47 + 74 = 121. Some numbers require more iterations in order to produce a palindrome. For example, 349 takes 3 iterations:

349 + 943 = 1292
1292 + 2921 = 4213
4213 + 3124 = 7337

For Lychrel numbers the process above will never end with a palindrome. The Euler problem #55 asks: 

Euler Problem #55: How many Lychrel numbers are below 10 000? 

Since this is an example of a never ending process, there is also a hint that if in 50 iterations a palindrome was not found, then we can stop. This is not true in general, but true for numbers below 10 000. Let’s try it with ClickHouse.

This time we will first create a function to reverse the number digits:

CREATE OR REPLACE FUNCTION reverseUInt AS 
  (x) -> toUInt256(reverse(toString(x)))

We used a UInt256 data type that can hold up to 77 digits. Next, let’s try defining a recursive function that checks if a number is the Lychrel one or not:

CREATE OR REPLACE FUNCTION isLychrel AS 
  (x, depth) -> multiIf(depth == 0, 1,
                        reverseUInt(x) == x, 0, 
                        isLychrel(x + reverseUInt(x), depth-1))

Unfortunately, ClickHouse does not allow recursive functions, and throws an error. We need to find a workaround. arrayFold() is coming to the rescue again:

CREATE OR REPLACE FUNCTION isLychrel AS 
  (x) -> arrayFold( x, i -> 
       if(x == reverseUInt(x), 0, x + reverseUInt(x)),
       range(51), x)>0

If it finds the palindrome, it sets the accumulator to 0, and the remaining iterations are all zeros. For Lychrel numbers iterations continue, and we end up with a positive number. For example, here are first ten Lychrel numbers:

SELECT number FROM numbers(10000) WHERE isLychrel(number::UInt256)

┌─number─┐
│    196 │
│    295 │
│    394 │
│    493 │
│    592 │
│    689 │
│    691 │
│    788 │
│    790 │
│    879 │
└────────┘

With this function, we can now solve the problem #55 with a very simple query:

SELECT count() FROM numbers(10000) WHERE isLychrel(number::UInt256)

┌─count()─┐
│     246 │
└─────────┘

Conclusion

Array functions are a very powerful programming tool. With the recent introduction of arrayFold() it is even possible to implement recursive algorithms. That allows us to write functional-style programs in ClickHouse and solve simple algorithmic and math problems from Project Euler. The code is pretty compact. It is not as compact as Scala or Haskell, but it is much more fun to do this in ClickHouse. In fact, I often use ClickHouse as the only tool for query, math and even content generation from the data. Stay tuned to learn more tricks!

Share

Related: