Into the Deep End: Learning SQL by Taking On Project Euler Puzzles (Part 1)

Into the Deep End: Learning SQL by Taking On Project Euler Puzzles (Part 1)

Young man, in mathematics you don’t understand things. You just get used to them.

John von Neumann
in The dancing Wu Li Masters by Gary Zukav

Mathematics teaches us that the intuition of a new idea is easier to grasp after you’ve used it for a while and seen it tie together with everything else. It’s pointless to flail with analogies and book learning without getting your hands dirty, so to speak. So when tasked with learning SQL in ClickHouse and solving a few problems from the Project Euler archives, I got a brief overview of SQL syntax and went straight to work on the problems. I wanted to learn SQL by jumping into the deep end with a project, starting first with an application and letting the understanding happen along the way. Here are the details of how I did it and what I learned from each one; I hope it makes writing SQL in ClickHouse feel more accessible and fun for programmers of all levels and persuasions!

I worked on a MacBook Pro with 2 GHz Quad-Core Intel Core i5 processor with hyper-threading enabled and 16GB memory. I used ClickHouse version to run my SQL queries on my computer and edited my queries in DBeaver. For each of my algorithms I give execution time as the average time of five trials. This article is inspired by a similar one by Alexander Zaitsev at Altinity: “Math Games in ClickHouse®– Solving Project Euler Problems”.

Problem 8: Largest Product in a Series

My first challenge was Problem 8, which reads as follows: “Find the thirteen adjacent digits in the 1000-digit number [shown below] that have the greatest product. What is the value of this product?” For example, we can consider the first two sets of 13 adjacent digits in the big number: {7,3,1,6,7,1,7,6,5,3,1,3,3} and {3,1,6,7,1,7,6,5,3,1,3,3,0}. The elements of the first set multiply out to 5,000,940, and the elements of the second set multiply out to 0. Clearly, the former has the greatest product, but the object of the problem is to repeat this for all other sets of 13 adjacent digits.  

Algorithm 1: splitting the whole number into 13-grams 

This was my first real project in ClickHouse; rated at 5% difficulty on Project Euler, it was a warm-up for the tougher puzzles ahead. After picking the problem, I read through all of the official documentation to find the tools I could put in my pockets. That was how I found the nifty ngrams() function, which takes in a string and returns an array of all substrings of length n. Here’s a simple example with n = 2.

SELECT ngrams('abcd', 2)

┌─ngrams('abcd', 2)─┐
│ ['ab','bc','cd']  │

There’s nothing like ngrams() to use on an integer, which is why I had to enter big_num as a string, split that up into 13-grams, and turn each element of the 13-gram back into an integer before I could start multiplying. At this point, I also learned about arrayMap() and arrayFilter() functions, which work well here with ngrams(). In this problem and the others that I worked on, arrays and array functions were essential to my queries. arrayMap() and arrayFilter() are similar functions; both take in a lambda function func and an array arr to work over. Then arrayMap() applies func to each element of arr and returns this new array, while arrayFilter() returns an array containing only those elements x in arr such that func(x) neq 0.

This simple query demonstrates how they work on range(1,10), which is the array of integers 1 through 9.

SELECT arrayFilter(x -> x > 5, range(1,10)) AS arr_filt

│ [6,7,8,9] │

SELECT arrayMap(x -> x * 2, range(1,10)) AS arr_map

│ [2,4,6,8,10,12,14,16,18] │

In the above example, the arrayFilter() function keeps only those numbers x in the array range(1,10) where x > 5. The arrayMap() function takes all numbers x in the array and multiplies them by 2. With all three functions in hand, I was ready to write my first query solving Problem 8.

WITH   '7316717653133062491922511967442657474235534919493496983520312774506326239578318016984801869478851843858615607891129494954595017379583319528532088055111254069874715852386305071569329096329522744304355766896648950445244523161731856403098711121722383113622298934233803081353362766142828064444866452387493035890729629049156044077239071381051585930796086670172427121883998797908792274921901699720888093776657273330010533678812202354218097512545405947522435258490771167055601360483958644670632441572215539753697817977846174064955149290862569321978468622482839722413756570560574902614079729686524145351004748216637048440319989000889524345065854122758866688116427171479924442928230863465674813919123162824586178664583591245665294765456828489128831426076900422421902267105562632111110937054421750694165896040807198403850962455444362981230987879927244284909188845801561660979191338754992005240636899125607176060588611646710940507754100225698315520005593572972571636269561882670428252483600823257530420752963450' AS big_num,
    ngrams(big_num, 13) AS big_num_as_13grams,
    length(big_num_as_13grams) AS num_of_13grams,
    arrayMap(a -> ngrams(a, 1), big_num_as_13grams) AS `13grams_to_1grams`,
    arrayMap(x -> arrayMap(y -> toUInt256(y), x), `13grams_to_1grams`) AS str_arr_to_int_arr,
    arrayMap(w -> arrayProduct(w), str_arr_to_int_arr) AS arr_product
    arrayMax(arr_product) AS max_arr_product,
    argMaxArray(str_arr_to_int_arr, arr_product) AS array_w_max_product

The result is, which is the same as the Project Euler solution:

│     23514624000 │ [5,5,7,6,6,8,9,6,6,4,8,9,5] │

How Algorithm 1 works

The query may look complicated, but it’s easy to understand if we break it up into small pieces.

  • I put most of the lines in the WITH clause to print a shorter result. This was just a formatting choice and had no effect on computational efficiency.
  • ngrams() breaks the string big_num into all possible adjacent groups of 13, and puts these in an array of strings named big_num_as_13grams.
  • To compare with the next algorithm, count the number of ngrams in this array – the result is 988.
  • In big_num_as_13grams, for each string of 13 digits in the array, turn that string into an array of 13 single digits. Now we have 13grams_to_1grams, which is an array of arrays of strings.
  • For each array in 13grams_to_1grams, turn each string (a single digit) into an integer.
  • For each array of integers in 13grams_to_1grams, take the product of the 13 elements.
  • The maximum of these products comes out to 23,514,624,000.
  • The final step is not necessary for the problem, but it shows us which array of integers has the max product. It turns out to be [5,5,7,6,6,8,9,6,6,4,8,9,5]. That’s nice for checking the math.

ClickHouse solved this in 18.8 milliseconds and used 4.20 MiB of memory. Not bad, but since this isn’t too complex of a problem, I figured this could be reduced.

Algorithm 2: removing all zeroes

The big number has lots of zeros. When considering the maximum product of adjacent digits, any 13-gram containing a zero will automatically be unnecessary. I improved the query by splitting up the big number into an array of numbers that don’t contain 0’s, which removes many 13-grams from later calculations in the query.

WITH   '7316717653133062491922511967442657474235534919493496983520312774506326239578318016984801869478851843858615607891129494954595017379583319528532088055111254069874715852386305071569329096329522744304355766896648950445244523161731856403098711121722383113622298934233803081353362766142828064444866452387493035890729629049156044077239071381051585930796086670172427121883998797908792274921901699720888093776657273330010533678812202354218097512545405947522435258490771167055601360483958644670632441572215539753697817977846174064955149290862569321978468622482839722413756570560574902614079729686524145351004748216637048440319989000889524345065854122758866688116427171479924442928230863465674813919123162824586178664583591245665294765456828489128831426076900422421902267105562632111110937054421750694165896040807198403850962455444362981230987879927244284909188845801561660979191338754992005240636899125607176060588611646710940507754100225698315520005593572972571636269561882670428252483600823257530420752963450' AS big_num,
    splitByChar('0', big_num) AS split_big_num,
    arrayFilter(x -> (length(x) >= 13), split_big_num) AS filtered_arr,
    arrayMap(x -> ngrams(x, 13), filtered_arr) AS arr_of_arr_of_13grams,
    arraySum(x -> length(x), arr_of_arr_of_13grams) AS num_of_13grams,
    arrayMap(x -> arrayMap(y -> arrayProduct(z -> toUInt8(z), ngrams(y, 1)), x), arr_of_arr_of_13grams) AS arr_of_arr_product,
    arrayMap(x -> arrayMax(x), arr_of_arr_product) AS arr_of_arr_max
SELECT arrayMax(arr_of_arr_max) AS final_arr_max

The result is:

│   23514624000 │

How Algorithm 2 works

This algorithm is similar to Algorithm 1, with a few differences that make it run faster. I found it a little tricky to follow which “level” of the array I was in. Let’s look at what each line does while carefully tracking the data types.

  • Again starting with the big number as a string, we cut up that string anywhere that it contains a zero. Then we get an array of strings called split_big_num, where none of these strings contains a zero. 
  • Of all of the strings in the array split_big_num, we remove the ones that have length <13, and name this new array of strings filtered_arr.
  • For each string in filtered_arr, we use the ngrams() function that gives us an array of all 13-grams out of that string. We have an array of arrays of strings named arr_of_arr_of_13grams
  • Count the number of 13-grams in each array of 13-grams, then sum these to get the total number of 13-grams – the result is 263, which is much fewer than in Algorithm 1. 
  • Take each 13-gram and use the ngrams() function to turn it into an array of 1-grams, one digit each. Turn these 13 1-grams into integers, then take their product. We have an array of arrays of Float64 named arr_of_arr_product.
  • For each of the arrays of Float64 (each float is a product), we take the maximum value. We have an array of Float64 named arr_of_arr_max
  • Finally, we take the max of arr_of_arr_max, that is, the maximum of those previously calculated products. Our final answer is again 23,514,624,000.

ClickHouse computed this in 6.4 milliseconds and used 3.99 MiB of memory, so the optimizations noticeably decreased the execution time. Nice! That one wasn’t too hard, which gave me a pep in my step as I started the next problem. 

Problem 100: Arranged Probability

I liked playing with array functions and found another application in Euler Problem 100. This problem is harder than the first, rated 30% difficulty; on the plus side, we get to play with colors. The problem statement reads: “If a box contains twenty-one coloured discs, composed of fifteen blue discs and six red discs, and two discs were taken at random, it can be seen that the probability of taking two blue discs, P(BB) = \frac{15}{21} \cdot \frac{14}{20} = 12. The next such arrangement, for which there is exactly 50% chance of taking two blue discs at random, is a box containing eighty-five blue discs and thirty-five red discs. By finding the first arrangement to contain over 1,000,000,000,000 = 10^{12} discs in total, determine the number of blue discs that the box would contain.”

Algorithm 1: solution on a smaller scale

There’s a great saying that weeks of coding can save you hours of planning. In an effort to avoid this trap, I tried to optimize this problem by first considering the math. Let b be the number of blue discs in the pot before drawing and let r be the number of red discs in the pot. The desired b and r satisfy \frac{b}{b + r} \cdot \frac{b - 1}{b + r - 1} = \frac{1}{2}. For the very large b and r that I’m looking for in this problem, note that \frac{b}{b + r} and \frac{b - 1}{b + r - 1} will be incredibly close together. Therefore, as b and r increase, \frac{b}{b + r} and \frac{b - 1}{b + r - 1} both get closer and closer to \frac{1}{\sqrt{2}}. Now, this told me \frac{b}{b + r} \approx \frac{1}{\sqrt{2}} \Leftrightarrow \sqrt{2}b \approx b + r \Leftrightarrow r \approx (\sqrt{2} - 1)b, which gave a better picture of what kind of r I needed for a given b. 

To illustrate this, I wrote an elementary and inefficient query, just to see what the first few answers look like, and if my mathematical assumptions would be useful:

┌─────b─┬─r──────┬─b/(b+r)──────────────┬────────────1/sqrt2─┬──────(sqrt2 - 1)*b─┐
│     3 │ [1]    │ [0.75]               │ 0.7071067811865475 │ 1.2426406871192854 │
│    15 │ [6]    │ [0.7142857142857143] │ 0.7071067811865475 │  6.213203435596427 │
│    85 │ [35]   │ [0.7083333333333334] │ 0.7071067811865475 │  35.20815280171309 │
│   493 │ [204]  │ [0.7073170731707317] │ 0.7071067811865475 │ 204.20728624993592 │
│  2871 │ [1189] │ [0.7071428571428572] │ 0.7071067811865475 │ 1189.2071375731562 │
│ 16731 │ [6930] │ [0.7071129707112971] │ 0.7071067811865475 │  6930.207112064255 │

In this table we see that for the increasing pairs (b, r), we get \frac{b}{b + r} very close to \frac{1}{\sqrt{2}}, and r gets very close to (\sqrt{2} - 1)b. I suspected that this trend would continue for larger solutions. Since this was an approximation, I needed to leave a small range in which to find r. For this algorithm, I defined this range as range_for_num_red = [(\sqrt{2} - 1)b, (\sqrt{2} - 1)b), where the lower end of this range rounds \sqrt{2} down at the 32,000th decimal point, and the upper end of the range rounds \sqrt{2} up at the 32,000th decimal point. 32,000 is as high a parameter as the floor() and ceil() functions will take, so I made this window as tiny as possible to optimize my calculations. 

Here’s another mathematical optimization that I found: if I have r \approx (\sqrt{2} - 1)b, and I want only those (b, r) where b + r \geq 10^{12}, then b + (\sqrt{2} - 1)b \gtrapprox 10^{12} \Leftrightarrow \sqrt{2}b \gtrapprox 10^{12} \Leftrightarrow b \gtrapprox \frac{10^{12}}{\sqrt{2}}. This gave me an approximate minimum value for b, but I still checked values slightly below this.

This algorithm relies on the numbers(a, b) function, which returns a table with a number column listing all integers from a through a + b – 1, where a and b are integers. A major obstacle to solving this problem was that the numbers() function takes a lot of time and storage to read 10^{12} rows. Since I didn’t know specifically where to find the target pair (b, r), I tried to run numbers() over the range from min_blue to min_blue + 10^{12}. After 6 minutes of ClickHouse running the query, I cut it off – maybe it would have completed eventually, but not within a desirable time period. However, when I replaced 10^{12} with 10^9 in the code, it completed, as shown below.

    sqrt(2) AS sqrt_2,
    toUInt64(floor(1e9 / sqrt_2)) - 1e4 AS min_blue,
    floor(sqrt_2, 32000) - 1 AS round_down,
    ceil(sqrt_2, 32000) - 1 AS round_up
    arrayFilter(r -> r + b >= 1e9, range_for_num_red) AS filt_by_sum,
    arrayMap(r -> toDecimal64((b / (b + r)) * ((b - 1) / (b + r - 1)), 18), filt_by_sum) AS arr_map_to_product,
    arrayFilter(x -> (abs(x - 0.5) < toDecimal64(1e-18, 18)), arr_map_to_product) AS arr_filt_by_product
        number AS b,
        range(toUInt64(floor(round_down * b)), toUInt64(ceil(round_up * b))) AS range_for_num_red
    FROM numbers(min_blue, 1e10)
) AS range_for_num_red_subquery
WHERE notEmpty(arr_filt_by_product)

The result is:

│ 710477454 │ [294289397]       │ [294289397] │ [0.5]              │ [0.5]               │

How Algorithm 1 works

As I mentioned before, this query doesn’t solve the actual Problem 100, but it does solve the same problem for a pot containing \geq 10^9 discs total. Algorithm 1 was a step in the right direction that I knew wouldn’t get me all the way there. I’ll explain my thoughts behind this query.

  • The WITH clause contains Common Table Expressions (CTE’s) for numbers that are complex to calculate. By defining sqrt_2, round_down, and round_up as CTE’s, ClickHouse only has to calculate these numbers once and call on them later, as opposed to calculating them again for each b value.
  • In the subquery range_for_num_red_subquery, we consider values of b in the range [\frac{10^9}{\sqrt{2}} - 10^4, 10^{10}). For each b, define range_for_num_red as values of r \approx (\sqrt{2} - 1)b
  • In the main query, for each  b, use the arrayFilter() function on range_for_num_red to keep only those r values satisfying r + b \geq 10^9.
  • Apply the arrayMap() function to send each r value in filt_by_sum to the product \frac{b}{b + r} \cdot \frac{b - 1}{b + r - 1}, which is the probability of drawing two blue in a pot containing b + r discs. I found that turning this into a decimal instead of the default float makes the next step more accurate, so I turn this product into a Decimal64 with accuracy up to the 18th decimal point, which is the maximum accuracy for Decimal64.
  • Apply arrayFilter() again to filter out anything from arr_map_to_product that isn’t equal to 0.5. ClickHouse easily confuses something like 0.49999999917781 for exactly 0.5, so we have to be very careful here. We require that the absolute difference between a probability in arr_map_to_product and 0.5 is less than 10^{-18} to consider them equal.
  • Most b candidates in the range [\frac{10^9}{\sqrt{2}} - 10^4, 10^{10}) don’t have an r value satisfying both these conditions, so we get a lot of b values that give us empty arrays for arr_filt_by_product. Use WHERE and LIMIT clauses to show only the first (b, r) pairing that solves the problem. 

This takes 27.21 MiB of memory and 0.103 seconds to run when solving the problem up to 10^9. This query surely had room for improvement, but I could see that numbers() would hold the whole query back from reaching 10^{12}. As a student of pure math, sometimes it’s amazing to see what computers can do that my mind and pencil can’t; there are also times like this where I have to use math to rewrite the problem before I can get anything useful from the computer. At this point, I realized I had overlooked a simpler and more mathematical path to the solution.

Algorithm 2: using the arrayFold() function 

Let’s look again at the table of the first solutions that I showed earlier. 

┌─────b─┬─r──────┬─b/(b+r)──────────────┬────────────1/sqrt2─┬──────(sqrt2 - 1)*b─┐
│     3 │ [1]    │ [0.75]               │ 0.7071067811865475 │ 1.2426406871192854 │
│    15 │ [6]    │ [0.7142857142857143] │ 0.7071067811865475 │  6.213203435596427 │
│    85 │ [35]   │ [0.7083333333333334] │ 0.7071067811865475 │  35.20815280171309 │
│   493 │ [204]  │ [0.7073170731707317] │ 0.7071067811865475 │ 204.20728624993592 │
│  2871 │ [1189] │ [0.7071428571428572] │ 0.7071067811865475 │ 1189.2071375731562 │
│ 16731 │ [6930] │ [0.7071129707112971] │ 0.7071067811865475 │  6930.207112064255 │

When I first started this problem, my hunch was that there might be a recursive pattern between the pairs of b and r values, and then I moved on. When nothing else was working later, I came back to this hunch. By staring for a long time at the values for b and r, I noticed a pattern: r_{n+1} = 2b_n + r_n - 1 and b_{n+1} = 5b_n + 2r_n - 2. This is a much better way to solve this problem that doesn’t require ClickHouse to read 10 trillion rows through the numbers() function. 

While ClickHouse doesn’t exactly support recursive functions, the closest thing is the arrayFold() function, and it’s perfect for this problem. arrayFold() takes in a lambda function, a range to work over, and an accumulator that the function modifies. The accumulator is an n-tuple (acc.1, acc.2, …, acc.n). Here’s a simple example of how arrayFold() works in ClickHouse: 

SELECT arrayFold(
(acc, x) ->(acc.1 + x, acc.2, acc.3 * x, acc.1), 
(3::Int64, 2::Int64, 10::Int64, 0::Int64)
) AS final_result

arrayFold() starts with the accumulator (3,2,10,0) and it iterates for x in range(1,5). The lambda function is (acc, x) -> (acc.1 + x, acc.2, acc.3 * x, acc.1). So, the consecutive iterations of the program look like this:

Initial state of the accumulator(3,2,10,0)

That’s what’s happening under the hood. The final answer that ClickHouse returns is:

│ (13,2,240,9) │

I hope this shows how arrayFold() can be useful for recursive functions. It lets users modify, save, or move each element of the n-tuple while iterating many times. For Problem 100, I needed an arrayFold() that would use the (n-1)-th step to calculate the n-th step using the recursive relation I found.

arrayFold(acc, x -> if((acc.3 + acc.4 <= 1e12), (acc.3, acc.4, 5*acc.3 + 2*acc.4 - 2, 2*acc.3 + acc.4 - 1, acc.5 + 1),
(3::Int64, 1::Int64, 15::Int64, 6::Int64, 2::Int64))
AS "[blue(n-1), red(n-1), blue(n), red(n), iteration]"

This gives the result:

┌─[blue(n-1), red(n-1), blue(n), red(n), iteration]───────┐
│ (129858761425,53789260175,756872327473,313506783024,16) │

How Algorithm 2 works

This algorithm is short and powerful, returning the final answer in just a single call to arrayFold(). It’s a bit more complex than the arrayFold() example I gave above but it works the same, with an accumulator, lambda function, and range.

  • We have an accumulator of the form: (acc.1, acc.2, acc.3, acc.4, acc.5) = (b_{n-1}, r_{n-1}, b_n, r_n, n), where n is the number of solutions found so far.
  • The initial state of the accumulator holds the first and second solutions to the equation \frac{b}{b + r} \cdot \frac{b - 1}{b + r - 1} = \frac{1}{2}, which are (b,r) = (3, 1) and (15, 6), respectively. The solution counter starts at 2, because there are already 2 solutions stored in the accumulator. So for the n-th iteration, we have acc.5 = n + 2. The initial accumulator looks like (3, 1, 15, 6, 2). 
  • For the range, I somewhat arbitrarily started with range(20), which will repeat the lambda function on the accumulator 20 times. The final solution count will show that 20 is large enough. 
  • The lambda function takes the current accumulator and applies the recursive functions for b and r. In the n-th iteration of the function, we move b_{n-1} from the b_n spot to the b_{n-1} spot, move r_{n-1} from the r_n spot to the r_{n-1} spot, set the new b_n as 5b_{n-1} + 2r_{n-1} - 2, and set the new r_n as 2b_{n-1} + r_{n-1} - 1. Then we increase acc.5 by one to mark that we found another solution. 
  • Repeat this 20 times, or while acc.3 + acc.4 = b_n + r_n \leq 10^{12}. Once b_n + r_n > 10^{12}, then the lambda function leaves the accumulator the same for the remaining iterations. 
  • The solution counter stops when  b_n + r_n > 10^{12} and this tells us how many pairs of solutions there are in total: 16, which means there were 14 iterations where arrayFold() modified the accumulator (since we started with two solutions).
  • The answer to Problem 100 is the first b_n satisfying the criteria. ClickHouse returns this under b_n, and we get a final answer of 756,872,327,473. 

Hooray! This uses 4.0 MiB of memory and only 8.2 ms to run, which is orders of magnitude more efficient. I was sorry to part with my first algorithm, but the takeaway from this problem is to step back and look for an easier (and at all feasible) way to solve it. Rethinking the problem allowed me to play to the strengths of ClickHouse – I was shocked by how quickly it solved this complex problem. And just like that, Problem 100 is complete.

A note on correctness

Using this recursive function to solve the problem is a bit cheap unless I prove that it describes all solutions to the problem without including anything extra. Of course, the Project Euler website confirmed that I arrived at the correct solution with this method, so if that’s proof enough for you, feel free to skip this section. But I’m ashamed to admit, I can’t offer you a complete proof. Here’s what I would need to prove for this solution to be watertight for all values: A pair (b_n, r_n) is a solution to Problem 100 if and only if it satisfies the recursive functions. In symbols, that is

\frac{b_n}{b_n + r_n} \cdot \frac{b_n - 1}{b_n + r_n - 1} = \frac{1}{2} \Leftrightarrow r_n = 2b_{n-1} + r_{n-1} - 1, b_n = 5b_{n-1} + 2r_{n-1} - 2.

A constructive proof that modifies the last ClickHouse query can prove one side of this if-and-only-if statement. Suppose r_n = 2b_{n-1} + r_{n-1} - 1, b_n = 5b_{n-1} + 2r_{n-1} - 2 for some pair (b_n, r_n). To show that \frac{b_n}{b_n + r_n} \cdot \frac{b_n - 1}{b_n + r_n - 1} = \frac{1}{2} (\bigstar), I added a line to the query that will alert us if it finds a pair by recursion that does not satisfy (\bigstar).

arrayFold((acc, x) -> multiIf(
round((acc.3 / (acc.3 + acc.4))*((acc.3 - 1) / (acc.3 + acc.4 - 1)), 10) != 0.5, (-100,-100,-100,-100,-100), 
(acc.3 + acc.4 <= 1e12), (acc.3, acc.4, 5*acc.3 + 2*acc.4 - 2, 2*acc.3 + acc.4 - 1, acc.5 + 1),
(acc.1, acc.2, acc.3, acc.4, acc.5)),
(3::Int64, 1::Int64, 15::Int64, 6::Int64, 2::Int64)
) AS "[blue(n-1), red(n-1), blue(n), red(n), iteration]"

What does this mess mean? While if() has the form if(condition, then do x, else do y), multiIf() has the form multiIf(condition 1, then do x, else if condition 2, then do y, … , else do z), for some actions x, y, z. Here, my first condition says to check that b_n = acc.3 and r_n = acc.4 satisfy (\bigstar); if they don’t, then multiIf() sets every element of the accumulator to -100. If a pair (b_n,r_n) sneaks in that shouldn’t be included, this red flag will return negative numbers for any subsequent iterations. The rest of the multiIf() statement is the same as the if() statement I showed earlier, so this query also returns the final answer. I ran this and got the same answer as in my query above – most importantly, no negatives.

┌─[blue(n-1), red(n-1), blue(n), red(n), iteration]───────┐
│ (129858761425,53789260175,756872327473,313506783024,16) │

This proves that the recursive function doesn’t include any pairs that aren’t solutions to (\bigstar) – that’s half of the proof. Now we need to prove the other half: that the recursive function doesn’t skip any solutions to (\bigstar). Then we would be able to conclude that the recursive function includes all the solutions to (\bigstar) and no extras. For the second half, a proof by induction is necessary to prove that any solution to (\bigstar) has the described recursive relation to the steps before and after it. This would require very tricky separation of variables to get r_n and b_n in terms of b_{n-1} and r_{n-1}. I leave this as an exercise for unusually motivated readers. If you get to the point where it starts to look a little too tricky, I invite you to throw mathematical rigor to the wind and accept the hand-wavy recursive relation that “looks right,” because it fits the first few solutions and gives the correct final answer. 


I never had a brilliant “aha” moment about SQL while playing with these problems. To circle back to the quote from John von Neumann, this project was all about “getting used to” the mechanics of SQL in ClickHouse. The puzzles were an opportunity to look at ClickHouse from various angles and learn how to use features I liked, especially array functions, and how to work around ClickHouse weaknesses, like the limitations of ngrams().

Years of studying math have instilled the famous mathematician’s laziness (some prefer to call it “efficiency”), which taught me to use logic to find better paths to the same result. This approach applied well to breaking down complex problems into fast queries in ClickHouse, proving that even abstract math can have applications in computer science.

Math will continue to be useful for my final Project Euler problem, covered in Part 2 of this series, where I added another interesting feature to my understanding of ClickHouse: C++ user-defined functions. It will come out shortly, so watch this space.


Thank you to Alsu Giliazova, Alexander Zaitsev, Arthur Passos, and Robert Hodges at Altinity for your help with writing programs and putting this article together.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.