Blog

Into the Deep End: Learning SQL by Taking On Project Euler Puzzles (Part 2)

Into the Deep End: Learning SQL by Taking On Project Euler Puzzles (Part 2)
Contents:

We can’t solve problems by using the same kind of thinking we used when we created them.

Albert Einstein

As a math student and SQL rookie, I’ve been absorbing the ins and outs of ClickHouse® in order to solve math problems. Math and programming go hand in hand; math has helped me design queries that ClickHouse can quickly compute once I’ve found the right SQL functions.

This article builds off my first article in the series chronicling how I learned SQL in ClickHouse and used it to solve a few Project Euler problems. Most problems can be solved with a range of tools, some more efficient than others. Sometimes, it’s faster to reframe the problem for the tools you have available and then find the solution. To solve my next problem, Problem 70 from the Project Euler archives, I needed to start with this kind of reframing. As I got more comfortable with ClickHouse, my strategy for Problem 70 evolved from making small optimizations in ClickHouse to making one big optimization through a user-defined function written in C++.

As for the previous article, I worked on a MacBook Pro with 2 GHz Quad-Core Intel Core i5 processor with hyper-threading enabled and 16GB memory. I used ClickHouse version 24.6.1.2493 to run SQL queries on my computer. For each of my algorithms, I give execution time as the average time of five trials. Once again, this article is inspired by a similar one by Alexander Zaitsev at Altinity: “Math Games in ClickHouse®– Solving Project Euler Problems”.

Project Euler Problem 70: Totient Permutation

This problem rated at 20% difficulty, which is in between the other two problems I worked on. It reads as follows: “Euler’s totient function, \phi(n)…is used to determine the number of positive numbers less than or equal to n which are relatively prime to n. For example, as 1, 2, 4, 5, 7, and 8 are all less than 9 and relatively prime to nine, \phi(9) = 6. The number 1 is considered to be relatively prime to every positive number, so \phi(1) = 1. Interestingly, \phi(87109) = 79180, and it can be seen that 87109 is a permutation of 79180. Find the value of n, 1 < n < 107, for which \phi(n) is a permutation of n and the ratio \frac{n}{\phi(n)} produces a minimum.”

In the example above with the pair 87109 and 79180, the ratio \frac{87109}{79180} is about 1.100. Below, I will search for pairs of numbers that produce as small a ratio as possible.

Algorithm 1: counting coprimes

My plan for the first algorithm followed the problem statement pretty directly. 

  • Find the relative primes (coprimes) of each number n that are \leq n
  • Analyze the totient function \phi(n) and decide if it is a permutation of n
  • If so, keep n, and find the n with the smallest ratio \frac{n}{\phi(n)}

As I worked on this first algorithm, I realized that an even number n will never have a small ratio, because n is not coprime with any even numbers below n. So one optimization I added to this algorithm is considering only odd numbers for my_num in a Common Table Expression (CTE). The WITH clause means ClickHouse first makes an array of odd numbers through 44,000 and then the main query only uses numbers from this array odd_numbers.

WITH
odd_numbers AS (
	SELECT number AS my_num
	FROM numbers(44000)
	WHERE my_num > 1 AND my_num % 2 = 1
)
SELECT
my_num,
arrayCount(x -> gcd(x, my_num) = 1, range(1, my_num)) AS totient_func,
my_num / totient_func AS ratio
FROM odd_numbers
WHERE arraySort(ngrams(toString(totient_func), 1)) = arraySort(ngrams(toString(my_num), 1))
ORDER BY ratio
LIMIT 5

ClickHouse returned the following result. In this query running up to 44,000, my_num in the top line is the winner because it has the smallest ratio.

┌─my_num─┬─totient_func─┬──────────────ratio─┐
│  20617 │        20176 │ 1.0218576526566217 │
│  22471 │        21472 │ 1.0465257078986587 │
│  35683 │        33568 │ 1.0630064346997141 │
│   4435 │         3544 │ 1.2514108352144468 │
│  29817 │        19872 │ 1.5004528985507246 │
└────────┴──────────────┴────────────────────┘

How Algorithm 1 works

A few functions do a lot of heavy lifting, so let’s break down what’s happening. See Part 1 of this blog series for another fun application of the ngrams() function!

  • Two numbers n and m are coprime if they have no common factors greater than 1, i.e. the greatest common denominator (gcd) of n and m is 1. 
  • arrayCount() is a higher-order array function that takes in a lambda function func and an array to work over. It returns the number of elements x in the array where func(x) \neq 0. Here, the lambda function gcd(x, my_num) = 1 checks if x is coprime to my_num for all x in the array range(1, my_num), which is the array of integers from 1 through my_num – 1. arrayCount() returns for how many numbers this is true – this is the totient function of my_num.
  • Take the ratio my_num/totient_func.
  • To check if totient_func is a permutation of my_num, turn both numbers into strings, then use ngrams() to split each string into an array of digits (strings of length 1). The arraySort() function orders the elements of each array in ascending order. If the two sorted arrays are equal, then we have a permutation.
  • The results are ordered by ratio.

This query takes 42.939 seconds and 352 KB of memory – not great. It works for odd numbers up to about 44,000, but for 45,000 or more, ClickHouse throws the following error message:

DB::Exception: A call to function range would produce 500036682 array elements, which is greater than the allowed maximum of 500000000: while executing 'FUNCTION range(1_UInt8 : 1, __table1.my_num : 0) -> range(1_UInt8, __table1.my_num) Array(UInt64) : 3'. (ARGUMENT_OUT_OF_BOUND)

This means it runs out of array space on the line that finds the relative primes, where it has to check every number up to my_num. Thus, I was far from the problem goal of getting up to 10^7 – the query didn’t work at all on the scale I needed.

Algorithm 2: Euler’s product formula

I looked around and found a different approach to calculate the totient function. We can calculate the totient function for each number using Euler’s product formula: \phi(n) = n\prod_{p|n}(1 - \frac{1}{p}), where each p is a distinct prime factor of n. Of course, finding coprimes and finding prime factors are similar processes, but I hoped this might be more efficient, because n has fewer prime factors than coprimes. 

I used the arraySort() function again in this algorithm, as well as arrayFilter(), arrayExists(), and arrayProduct(). I explained arrayFilter() in my first article, so I’ll just give an overview of arrayExists() and arrayProduct() here. arrayExists() is a higher-order function that takes a lambda function func and an array arr to work over, similar to other array functions we’ve looked at. It returns 1 if there is at least one element x in arr such that func(x) \neq 0. Otherwise it returns 0. Here is a small example checking for even numbers, first in range(1,10), the array of integers from 1 through 9, and then in an array of odd integers from 1 through 9.

SELECT arrayExists(x -> x % 2 = 0, range(1,10)) AS result1, 
arrayExists(x -> x % 2 = 0, array(1,3,5,7,9)) AS result2

This one isn’t too complex, and it gives the result we expect:

┌─result1─┬─result2─┐
│       1 │       0 │
└─────────┴─────────┘

arrayProduct() is another cool array function that is useful for this query. arrayProduct() can take an array of numbers as an argument, or it can take a lambda function and an array of numbers. In the first case, arrayProduct() simply multiplies the elements of the array. In the second case, it is a higher-order function, so it first applies the function to each element of the array, then multiplies together the results. Here’s an example of both cases:

SELECT arrayProduct(range(1,4)) AS result1,
arrayProduct(x -> x*2, range(1,4)) AS result2

For result1, we get the product 1 \cdot 2 \cdot 3 and for result2, we get the product (1 \cdot 2) \cdot (2 \cdot 2) \cdot (3 \cdot 2).

┌─result1─┬─result2─┐
│       6 │      48 │
└─────────┴─────────┘

Now, using these new array functions, here is my next algorithm for Problem 70, which uses the prime factors of a number to calculate its totient function.

WITH
odd_numbers AS (
	SELECT number AS my_num
	FROM numbers(44000)
	WHERE my_num > 1 AND my_num % 2 = 1
)
SELECT
my_num,
arrayFilter(x -> my_num % x = 0, range(2, my_num+1)) AS nontriv_factors,
arrayFilter(x -> not arrayExists(n -> x % n = 0, range(2, toUInt16(sqrt(x)) + 1)), nontriv_factors) AS prime_factors,
arrayProduct(arrayMap(x -> 1 - (1.0/x), prime_factors)) AS arr_product,
round(my_num * arr_product) AS totient_func,
my_num / totient_func AS ratio
FROM odd_numbers
WHERE arraySort(ngrams(toString(totient_func), 1)) = arraySort(ngrams(toString(my_num), 1))
ORDER BY ratio
LIMIT 5

ClickHouse returns the same list of numbers as for Algorithm 1, which is reassuring.

┌─my_num─┬─nontriv_factors───────┬─prime_factors─┬────────arr_product─┬─totient_func─┬──────────────ratio─┐
│  20617 │ [53,389,20617]        │ [53,389]      │  0.978609885046321 │        20176 │ 1.0218576526566217 │
|  22471 │ [23,977,22471]        │ [23,977]      │ 0.9555426994793289 │        21472 │ 1.0465257078986587 │
|  35683 │ [17,2099,35683]       │ [17,2099]     │ 0.9407280777961494 │        33568 │ 1.0630064346997141 │
│   4435 │ [5,887,4435]          │ [5,887]       │ 0.7990980834272831 │         3544 │ 1.2514108352144468 │
│  29817 │ [3,9,3313,9939,29817] │ [3,3313]      │ 0.6664654391789919 │        19872 │ 1.5004528985507246 │
└────────┴───────────────────────┴───────────────┴────────────────────┴──────────────┴────────────────────┘

How Algorithm 2 works 

This algorithm has a bit more going on than the first one, but hopefully the functions involved are familiar by now.

  • Again, check only odd numbers, which helps a bit with memory. 
  • Find the non-trivial factors of my_num, then use arrayFilter() to keep only the prime factors. A prime factor p has the quality that no integer in between 2 and \sqrt{p} divides p, so arrayFilter() removes any numbers in nontriv_factors where this is not the case. 
  • Use the Euler product formula on the prime factors of my_num to get the totient function. When multiplying my_num * arr_product, we need to apply the round() function so that ClickHouse turns 3544.0000000000005 into 3544, for example. This ensures that the comparison of totient_func and my_num is accurate.
  • The rest of the program works the same as Algorithm 1 to check permutation and take the desired ratio. 

The query completes in 38.687 seconds and uses 352 KB memory. This is slightly faster than the first algorithm, but still slow. It gets stuck at numbers(45,000) when it runs out of array space and gives the same error message as before:

DB::Exception: A call to function range would produce 500036682 array elements, which is greater than the allowed maximum of 500000000: while executing 'FUNCTION range(2_UInt8 :: 4, plus(__table1.my_num, 1_UInt8) :: 3) -> range(2_UInt8, plus(__table1.my_num, 1_UInt8)) Array(UInt64) : 8'. (ARGUMENT_OUT_OF_BOUND)

The issue arises from getting the prime factors of each number. This limitation isn’t a big surprise because the processes of finding prime factors and coprimes should use similar amounts of array space and memory. This is where I ran into a roadblock, as prime factorization looked like the most efficient way to calculate this problem. Unfortunately, I couldn’t find a way to optimize this calculation up to 10^7 in ClickHouse. I had to think outside the box and use the tools that ClickHouse does offer.  

Algorithm 3: fun with UDF’s

A better path forward was to use a User-Defined Function (UDF) inside my query to outsource the prime factorization to C++. With a UDF, users can write code in a separate file, in any language, and instruct ClickHouse how to run it. My goal was for the UDF to take in a series of numbers (this is where ClickHouse feeds in numbers(1e7)) and for each number, return the distinct prime factors in a string. I have a bit of experience with C++, so this was an opportunity to practice another programming language! I had in mind what I wanted my final query to look like. Below, prime_factors_cpp() is the C++ program that computes prime factors.

WITH
odd_numbers AS (
	SELECT number AS my_num
	FROM numbers_mt(1e7)
	WHERE my_num > 1 AND my_num % 2 = 1
)
SELECT
my_num,
splitByNonAlpha(prime_factors_cpp(my_num)) AS prime_factors_arr_str,
arrayMap(x -> toUInt32(x), prime_factors_arr_str) AS prime_factors,
arrayProduct(x -> 1 - (1.0/x), prime_factors) AS arr_product,
round(my_num * arr_product) AS totient_func,
my_num / totient_func AS ratio
FROM odd_numbers
WHERE arraySort(ngrams(toString(totient_func), 1)) = arraySort(ngrams(toString(my_num), 1))
ORDER BY ratio
LIMIT 1

You’ll have to read until the end to see the final answer to this query. First I’ll show the details of the C++ code, how I turned this into a UDF, and then my solution to Problem 70 using the final ClickHouse query.

factors.cpp code

The C++ main() program runs using a few auxiliary functions, primesUpTo() and distinctPrimeFactors(). Since returning the prime factors of every number up to 10^7 is a monstrous effort, I thought it would be more efficient to first make a list of all the possible prime factors, i.e., all primes up to \sqrt{10^7}. Then for each n between 1 and 10^7, I can search that list and see if any of those primes divide n. The code for the primesUpTo() function is based on pseudocode from the Sieve of Eratosthenes Wikipedia site. I put it in a file named factors.cpp.

#include <iostream>
#include <vector>
#include <cmath>

/**************************  Function bank ***************************/
                                   
std::vector<int> primesUpTo(int n)
/* 
Input: integer n.
Description: make a vector of all prime integers up to and including n.
Uses the Sieve of Eratosthenes method: all multiples of primes are themselves not prime.
*/
{
      std::vector<bool> isPrime(n+1, true);
      //isPrime stores whether each number is prime. First assume all are prime.
      std::vector<int> primes;
      isPrime[0] = isPrime[1] = false;
      //description of for loop: if isPrime tells us that a number i is prime, 
      //then add i to the vector primes. when this loop reaches any multiple j of i, it 
      //will then mark j as not prime. Only numbers that aren't multiples of any smaller 
      //number retain "true" in the isPrime vector.
      for (int i = 2; i <= n; i++)
      {
            if (isPrime[i])
            {
                  primes.emplace_back(i);
                  for (int j = i*i; j <= n; j += i)
                  {
                        isPrime[j] = false; 
                  }
            }
      }
      return primes;
}

/*********************************************************************/

std::vector<int> distinctPrimeFactors(int n, const std::vector<int> 
&primes)  
/*
Input: a number n to factorize and a vector of integers. Assumes that this is a vector of primes.
Description: returns a vector of the distinct prime factors of n.
*/
{
    std::vector<int> factors;
    if (n > 1) 
    {
          //description of for loop: runs through all elements up to 
          //sqrt(n) in primes. if the loop finds a primeNumber that is     
          //a factor of n, it adds primeNumber to factors.
          for(const auto primeNumber : primes)
          {
                if (primeNumber * primeNumber > n)
                {
                      break;
                }
                if (n % primeNumber == 0)
                {
                      factors.emplace_back(primeNumber);
                } 
          }
   }
   //if factors is empty, then we know n had no factors <= sqrt(n),
   //therefore n is prime and it is its own prime factor.
   if (factors.empty()) 
   {
        factors.emplace_back(n);
   }
   return factors;
}

/**************************   Main program   *************************/

int main() 
{
   static constexpr std::size_t max_num = 1e7;
   auto max_num_sqrt = static_cast<int>(sqrt(max_num)) + 1;
   std::vector<int> primes = primesUpTo(max_num_sqrt);   
   //for the specific purposes of this problem, we need only the primes    
   //up to(sqrt(1e7)
   std::size_t number;
   while (std::cin >> number)
    { 
          if (number > 1 && number <= max_num)   
          {
               std::vector<int> factors = distinctPrimeFactors(number, 
    primes);
               std::cout << "[" << factors[0];
               for (std::size_t i = 1; i < factors.size(); i++) 
               {
                  std::cout << ", "<< factors[i];
               }
               std::cout << "]";
          }
          std::cout << "\n"; 
    }
    return 0;
}

Making a UDF

It was challenging to turn factors.cpp, the file implementing prime_factors_cpp(), from a C++ file into a UDF that I could use in clickhouse-client. I already had clickhouse-client downloaded into a directory called clickhouse. First, I compiled factors.cpp into an executable called factors, moved factors to the directory clickhouse/user_scripts, and changed the file permission of factors.

% g++ -o factors -Wall -Wextra factors.cpp  
% mv HOME/atom/workspace/factors HOME/clickhouse/user_script
% chmod +x clickhouse/user_scripts/factors

As I explain each step of making the UDF, this diagram may be a helpful bird’s-eye view of the process. If you are using APT or RPM packages, mentally translate $HOME/clickhouse to /etc/clickhouser-server.

Next, I created a function definition in a new file called function_def.xml. The inside of the file looked like this: 

<functions>
	<function>
    		<type>executable</type>
    		<name>prime_factors_cpp</name>
    		<return_type>String</return_type>
    		<argument>
        		<type>UInt64</type>
        		<name>number</name>
    		</argument>
    		<format>TabSeparated</format>
    		<command>factors</command>
	</function>
</functions>

I also defined a function path, which tells ClickHouse where on my computer to find the function definition. Inside a new file called function_def_path.xml I put the text:

<clickhouse>
  <user_defined_executable_functions_config>*_function.xml</user_defined_executable_functions_config>
</clickhouse>

Back in terminal, I moved function_def.xml into clickhouse, and function_def_path.xml into a new directory called clickhouse/config.d: And that’s a finished UDF!

% mv HOME/function_def.xml clickhouse
% mkdir clickhouse/config.d
% mv HOME/function_def_path.xml clickhouse/config.d

Running prime_factors_cpp() in clickhouse-client

At this point, I could run prime_factors_cpp() in clickhouse-client like any other function: 

SELECT number, prime_factors_cpp(number) FROM numbers(20)

ClickHouse returns the prime factors of all numbers up to 20.

┌─number─┬─prime_factors_cpp(number)─┐
│      0 │                           │
│      1 │                           │
│      2 │ [2]                       │
│      3 │ [3]                       │
│      4 │ [2]                       │
│      5 │ [5]                       │
│      6 │ [2, 3]                    │
│      7 │ [7]                       │
│      8 │ [2]                       │
│      9 │ [3]                       │
│     10 │ [2, 5]                    │
│     11 │ [11]                      │
│     12 │ [2, 3]                    │
│     13 │ [13]                      │
│     14 │ [2, 7]                    │
│     15 │ [3, 5]                    │
│     16 │ [2]                       │
│     17 │ [17]                      │
│     18 │ [2, 3]                    │
│     19 │ [19]                      │
└────────┴───────────────────────────┘

Final ClickHouse query

Finally, I implemented prime_factors_cpp() on a larger scale to solve Problem 70. This final ClickHouse query turns the string returned by the UDF into an array using the splitByNonAlpha() function, and from there the query is the same as Algorithm 2. A small adjustment to my query is changing the number of threads of execution my computer runs, with the SQL command

SET max_threads = 8

Then, I called numbers_mt(1e7) as opposed to numbers(1e7). The latter tells ClickHouse to run the UDF on all numbers sequentially (i.e. calculating \phi(3), \phi(4), \phi(5), …), all on a single thread, which doesn’t use all cores available on my laptop CPU. Including “_mt” tells ClickHouse to run on multiple threads at the same time, which speeds up the computation time by 3x. Here is the final ClickHouse query that calls prime_factors_cpp():

WITH
odd_numbers AS (
	SELECT number AS my_num
	FROM numbers_mt(1e7)
	WHERE my_num > 1 AND my_num % 2 = 1
)
SELECT
my_num,
splitByNonAlpha(prime_factors_cpp(my_num)) AS prime_factors_arr_str,
arrayMap(x -> toUInt32(x), prime_factors_arr_str) AS prime_factors,
arrayProduct(x -> 1 - (1.0/x), prime_factors) AS arr_product,
round(my_num * arr_product) AS totient_func,
my_num / totient_func AS ratio
FROM odd_numbers
WHERE arraySort(ngrams(toString(totient_func), 1)) = arraySort(ngrams(toString(my_num), 1))
ORDER BY ratio
LIMIT 1

When I run prime_factors_cpp() on its own through clickhouse-client on all odd numbers in numbers_mt(1e7), it completes in 13.667 seconds and uses 83.24 GiB of memory. The final ClickHouse query, including the function prime_factors_cpp(), completes in 13.748 seconds and uses 83.24 GiB of memory – about the same time as Algorithms 1 and 2, but on a much larger scale. It returns the final answer, 8,319,823.

┌──my_num─┬─prime_factors_arr_str─┬─prime_factors─┬────────arr_product─┬─totient_func─┬──────────────ratio─┐
│ 8319823 │ ['2339','3557']       │ [2339,3557]   │  0.999291451272461 │      8313928 │ 1.0007090511248113 │
└─────────┴───────────────────────┴───────────────┴────────────────────┴──────────────┴────────────────────┘

I’m pretty content with this program. Of course, one part of the algorithm that can be improved is making the C++ program run dynamically and be flexible to any user input. Right now, it has 10^7 hard-coded into the main() function in order to generate the vector of primes up to 10^7. Ideally, that wouldn’t be hard-coded in, and the prime_factors_cpp() function could run through any number the user chooses. I leave this as an exercise for the reader. As you range through the numbers, you may want to store the primes you’ve found so far in a way that helps you find the next ones. This could be a variation of the Sieve of Eratosthenes.

Conclusion 

Problem 70 was a challenging yet rewarding project. I might have saved myself some execution time and a lot of fuss by solving this problem entirely with C++. While I generally don’t believe in making a problem harder than it has to be, restricting myself to ClickHouse made a 20% difficulty problem more interesting. It forced me to understand the mechanics of ClickHouse, C++, and my computer better.

If you enjoyed this article and want to try math problems yourself, Altinity has a treat for you. We have a contest going on right now to test your math skills. You need to use SQL to find prime numbers whose digits also sum to prime numbers. If you can get it to run in under 200 milliseconds, Altinity will send you a cool T-Shirt!

Acknowledgements

Several people helped me to learn and succeed. Thank you to Alsu Giliazova, Alexander Zaitsev, Arthur Passos, and Robert Hodges at Altinity for your help with writing programs and putting this and the previous article together.

Share

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.