Blog

ClickHouse® as a storage engine for MySQL? Let’s start an experiment

I don’t only write blogs myself, I also love reading them, and it’s not always about productive value, but to open your mind into different ways of thinking.

Because of my own MySQL history, I follow the great LeFred regularly, and once I stumbled over his blog about  How to use MySQL FEDERATED engine?

His blog reminded me of an experiment I did myself several years ago, about possibilities to introduce ClickHouse into our production system of my company during that time: Is it possible to use ClickHouse as a storage engine within your MySQL environment?

In short: Yes, you can, but most of the time, you shouldn’t!

Prerequisites

To be able to do the experiments yourself, you need to have a relatively basic setup. I used an t3.large AWS EC2 instance (2 vCPU Cores, 8GB Ram, 256GB gp3 storage) with Ubuntu 22.04 preinstalled.

On top of it, I installed Percona Server for MySQL 8.0, I installed Percona Server for MySQL 8.0, though other MySQL variants or even MariaDB should also be compatible.

Finally, I installed the latest ClickHouse Server Version 23.3 LTS on the same host.

As test data I used my well beloved weather dataset, containing daily weather data since 1900, as described in my (De-)Normalization Blog Post last year. If you want to recreate the experiment yourself, the exact dataset is not that important. It just needs to be big enough to test real use cases and see problems early. With that in mind you should be able to build your own setup with your own data quite easily.

Configure MySQL

As mentioned in LeFreds blog, per default MySQL has not enabled the federated engine, as it is not under active development right now (State: April 2023). But fortunately it’s not hard to enable it.

If you use Percona Server 8 as I do, all you need to do is just add one line to your /etc/mysql/my.cnf file in section [mysqld]:

federated

After a restart of MySQL server you can check if the setting works by issuing the SHOW ENGINES; command within your MySQL client:

SHOW ENGINES\G
...
...
*************************** 7. row ***************************
      Engine: FEDERATED
    Support: YES
    Comment: Federated MySQL storage engine
Transactions: NO
          XA: NO
  Savepoints: NO

Nothing more to prepare for MySQL.

Configure ClickHouse

Next we need to make sure that ClickHouse can be queried via the MySQL wire protocol. Open /etc/clickhouse-server/config.xml and check for this setting:

<mysql_port>9004</mysql_port>

You can double check everything by connecting to ClickHouse via the mysql client:

mysql -h 127.0.0.1 -P 9004 -u default --password
Enter password:

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 23.3.1.2823-ClickHouse

Copyright (c) 2009-2023 Percona LLC and/or its affiliates
Copyright (c) 2000, 2023, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Start the experiment

Create structure in MySQL

First of all, we will create our basic structure in MySQL, to have something to compare our later results with:

CREATE DATABASE weather;
USE weather;
CREATE TABLE `sensor_data` (
  `station_id` char(11) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
  `date` date NOT NULL,
  `tempAvg` int DEFAULT NULL,
  `tempMax` int DEFAULT NULL,
  `tempMin` int DEFAULT NULL,
  `precipitation` int unsigned DEFAULT NULL,
  `snowfall` int unsigned DEFAULT NULL,
  `snowDepth` int unsigned DEFAULT NULL,
  `percentDailySun` tinyint unsigned DEFAULT NULL,
  `averageWindSpeed` int unsigned DEFAULT NULL,
  `maxWindSpeed` int unsigned DEFAULT NULL,
  `weatherType` enum('Normal','Fog','Heavy Fog','Thunder',
      'Small Hail','Hail','Glaze','Dust/Ash', 
      'Smoke/Haze', 'Blowing/Drifting Snow',
      'Tornado','High Winds','Blowing Spray',
      'Mist','Drizzle', 'Freezing Drizzle','Rain','Freezing Rain',
      'Snow','Unknown Precipitation','Ground Fog','Freezing Fog')
      DEFAULT NULL,
  PRIMARY KEY (`date`,`station_id`),
  KEY `station` (`station_id`)
) ENGINE=InnoDB;

After filling in the data (1 billion rows), it results in about 70GB, so nearly ten times of total memory of the instance itself.

We of course don’t expect a high speed, but we can get results in a somewhat reasonable time:

Q1:
SELECT COUNT(DISTINCT date) AS days, COUNT(DISTINCT station_id) AS stations, COUNT(*) AS data
FROM weather.sensor_data
WHERE date BETWEEN '2020-01-01' AND '2021-12-31'
  AND station_id LIKE 'US%';

+------+----------+----------+
| days | stations | data     |
+------+----------+----------+
|  731 |32429  | 16020472    |
+------+----------+----------+
1 row in set (26.38 sec)

Q2:
SELECT ROUND(AVG(tempAvg) / 10, 1) AS avgTemp, 
  TRUNCATE(YEAR(date), -1) AS dekade
FROM weather.sensor_data
WHERE date >= '1970-01-01'
GROUP BY dekade
ORDER BY dekade;
+---------+--------+
| avgTemp | dekade |
+---------+--------+
|     8.9 |   1970 |
|     9.9 |   1980 |
|    10.2 |   1990 |
|    10.8 |   2000 |
|    10.7 |   2010 |
|    10.8 |   2020 |
+---------+--------+
6 rows in set (8 min 42.05 sec)

Next we create the same structure in ClickHouse.

Create structure in ClickHouse

We create the structure in ClickHouse quite similar to MySQL, just use some small ClickHouse specific optimizations

CREATE DATABASE weather;
USE weather;
CREATE TABLE sensor_data
(
  `station_id` LowCardinality(String),
  `date` Date32 CODEC(Delta(4), ZSTD(1)),
  `tempAvg` Nullable(Int32),
  `tempMax` Nullable(Int32),
  `tempMin` Nullable(Int32),
  `precipitation` Nullable(UInt32),
  `snowfall` Nullable(UInt32),
  `snowDepth` Nullable(UInt32),
  `percentDailySun` Nullable(UInt8),
  `averageWindSpeed` Nullable(UInt32),
  `maxWindSpeed` Nullable(UInt32),
  `weatherType` Nullable(Enum8('Normal' = 0, 'Fog' = 1, 
        'Heavy Fog' = 2, 'Thunder' = 3, 'Small Hail' = 4, 'Hail' = 5, 
        'Glaze' = 6, 'Dust/Ash' = 7, 'Smoke/Haze' = 8, 
        'Blowing/Drifting Snow' = 9, 'Tornado' = 10, 
        'High Winds' = 11, 'Blowing Spray' = 12, 'Mist' = 13, 
        'Drizzle' = 14, 'Freezing Drizzle' = 15, 'Rain' = 16, 
        'Freezing Rain' = 17, 'Snow' = 18, 
        'Unknown Precipitation' = 19, 'Ground Fog' = 21, 
        'Freezing Fog' = 22))
)
ENGINE = MergeTree
PARTITION BY toYear(date)
ORDER BY (station_id, date)
SETTINGS index_granularity = 8192

In ClickHouse because of very good compression, the same data only takes up 3.4GB, so it would even fit into memory of the machine, but as we are running ClickHouse on the same host as MySQL, the memory usage will be shared so we won’t be able to profit from this.

Still, as we all expected, the queries on ClickHouse run a lot faster. Even without the possibility to use multiple cores (as the instance only has 2 vCPU, hence issuing only half of them)

Q1:
SELECT COUNT(DISTINCT date) AS days, 
  COUNT(DISTINCT station_id) AS stations, COUNT(*) AS data
FROM weather.sensor_data
WHERE date BETWEEN '2020-01-01' AND '2021-12-31'
AND station_id LIKE 'US%';

Query id: 9587c875-24af-443b-aadd-6c5b3264cdf0

┌─days─┬─stations─┬─────data─┐
│  731 │    32429 │ 16020472 │
└──────┴──────────┴──────────┘

1 row in set. Elapsed: 1.173 sec. Processed 16.03 million rows, 145.68 MB (13.67 million rows/s., 124.22 MB/s.)

Q2:
SELECT
  ROUND(AVG(tempAvg) / 10, 1) AS avgTemp,
  TRUNCATE(YEAR(date), -1) AS dekade
FROM weather.sensor_data
WHERE date >= '1970-01-01'
GROUP BY dekade
ORDER BY dekade ASC


┌─avgTemp─┬─dekade─┐
│     8.9 │   1970 │
│     9.9 │   1980 │
│    10.2 │   1990 │
│    10.8 │   2000 │
│    10.7 │   2010 │
│    10.8 │   2020 │
└─────────┴────────┘

6 rows in set. Elapsed: 4.757 sec. Processed 596.29 million rows, 5.37 GB (125.35 million rows/s., 1.13 GB/s.)

Connect MySQL to ClickHouse

As ClickHouse can speak the MySQL wire protocol like we’ve seen before, it can simulate to be a MySQL Server. And as the sensor_data table in ClickHouse only needs 1:20 of space compared to MySQL, we would like to access the data in ClickHouse directly from within MySQL.
That’s where the federated storage engine comes in, as it let us do exactly that.

First of all, we need to create a connection from within MySQL to the ClickHouse server by using the CREATE SERVER command:

CREATE SERVER localch
FOREIGN DATA WRAPPER mysql
OPTIONS (USER 'default', HOST '127.0.0.1', PORT 9004, DATABASE 'weather', PASSWORD '*****');

We gave the connection the name localch and pointed it to our ClickHouse instance based on the same host. Within the instance we named the database weather to be our source database.

Next we only need to create our remote table. As we already have a table in MySQL with the same structure, we can just copy the definition, and just change the engine part:

CREATE TABLE `sensor_data_ch` (
  `station_id` char(11) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
  `date` date NOT NULL,
  `tempAvg` int DEFAULT NULL,
  `tempMax` int DEFAULT NULL,
  `tempMin` int DEFAULT NULL,
  `precipitation` int unsigned DEFAULT NULL,
  `snowfall` int unsigned DEFAULT NULL,
  `snowDepth` int unsigned DEFAULT NULL,
  `percentDailySun` tinyint unsigned DEFAULT NULL,
  `averageWindSpeed` int unsigned DEFAULT NULL,
  `maxWindSpeed` int unsigned DEFAULT NULL,
  `weatherType` enum('Normal','Fog','Heavy Fog','Thunder',
      'Small Hail','Hail','Glaze','Dust/Ash', 
      'Smoke/Haze', 'Blowing/Drifting Snow',
      'Tornado','High Winds','Blowing Spray',
      'Mist','Drizzle', 'Freezing Drizzle','Rain','Freezing Rain',
      'Snow','Unknown Precipitation','Ground Fog','Freezing Fog')
      DEFAULT NULL,
  PRIMARY KEY (`station_id`, `date`),
  KEY `date` (`date`)
) ENGINE=FEDERATED
DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
CONNECTION 'localch/sensor_data';

This actually happens in an instant, as MySQL does not store any real data, only metadata information, all the data remains on the remote MySQL (or in this case remote ClickHouse).

So to check if everything is running, we can issue our first Query directly on the federated table in MySQL:

Q1:
SELECT COUNT(DISTINCT date) AS days, COUNT(DISTINCT station_id) AS stations, COUNT(*) AS data
FROM weather.sensor_data_ch
WHERE date BETWEEN '2020-01-01' AND '2021-12-31'
  AND station_id LIKE 'US%';

+------+----------+----------+
| days | stations | data     |
+------+----------+----------+
|  731 |    32429 | 16020472 |
+------+----------+----------+
1 row in set (43.05 sec)

As we can see, the connection from MySQL to our local ClickHouse works perfectly, and we can query data from ClickHouse directly in MySQL. But there is an overhead in the process, as the same query in MySQL took only 26 seconds and now it takes 43 seconds.

This overhead comes because of the implementation of the federated engine in MySQL. It doesn’t send the query itself to the remote server, but just a few easy conditions like ‘=’, ‘>’, ‘<’… Then all rows matching these easy conditions are sent back to the initiator MySQL where all the calculations etc are done.

This can be seen, when we want to execute query 2 on that connection:

SELECT ROUND(AVG(tempAvg) / 10, 1) AS avgTemp, 
  TRUNCATE(YEAR(date), -1) AS dekade
FROM weather.sensor_data_ch
WHERE date >= '1970-01-01'
GROUP BY dekade
ORDER BY dekade;

ERROR 2013 (HY000): Lost connection to MySQL server during query

Actually I needed to greatly reduce the time condition to get any result at all. Reason being is because MySQL simply issues a query like this on the remote server:

SELECT tempAvg, date
FROM weather.sensor_data_ch
WHERE date >= '1970-01-01'

So it really only uses the remote ClickHouse as the storage, but it doesn’t make use of ClickHouse’s power much. It just selects the data and sends it over the ‘internal’ wire.

In this case, it means MySQL tries to get that half billion rows into memory for the later calculation and therefore gets OOM killed quite frankly.

Working use-case

So does this mean, all is lost?

Actually, no. If you’re considering using ClickHouse solely for data storage while keeping all the logic in MySQL, then the federated engine is not the right choice.

I would kindly lead you to TokuDB (unfortunatelly deprecated after Percona Server 8.2) or RocksDB in those cases. Both have compression by default and work quite well when it comes to handling a few billion rows, with a low disk footprint.

However, if you want to leverage ClickHouse’s power while primarily using MySQL for your application, you’ll need to take an additional step.

As we saw, the federated engine on the initiator doesn’t really care about how the table is structured on the remote server. It just sends something like

SELECT a,b FROM source_table WHERE c = ‘x’ 

and then works on the resulting dataset, like you would do with your application

But this quite simple approach can be actually utilized to achieve our goal. If the inquiring MySQL doesn’t care about how the remote server handles the tables internally, would it complain if the tables on the remote server are not even tables?

The answer is: No. It doesn’t care at all.

As long as it gets a result set for those simple SELECTs then all is fine.

That means, it should be possible to create a view on ClickHouse and just expose the view as a federated table.

So within ClickHouse let’s create a new view, which will be doing the hard work for us:

CREATE VIEW weather.avg_temp_per_dekade AS
SELECT 
    ROUND(AVG(tempAvg) / 10, 1) AS avgTemp, 
    TRUNCATE(YEAR(date), -1) AS dekade  
FROM weather.sensor_data  
GROUP BY dekade  
ORDER BY dekade;

And within MySQL instead of going to the base table we access this view as a federated table:

CREATE TABLE weather.avg_temp_per_dekade (
  dekade INT PRIMARY KEY,
  avgTemp FLOAT
) ENGINE=FEDERATED
CONNECTION 'localch/avg_temp_per_dekade';

Our second Query in MySQL now looks a bit different, but also a lot easier!

Q2:
SELECT * 
FROM weather_view.avg_temp_per_dekade 
WHERE dekade >= 1970;

+--------+---------+
| dekade | avgTemp |
+--------+---------+
|   1970 |     8.9 |
|   1980 |     9.9 |
|   1990 |    10.2 |
|   2000 |    10.8 |
|   2010 |    10.7 |
|   2020 |    10.8 |
+--------+---------+
6 rows in set (4.94 sec)

And as you can see, we are now down to the query-speed of ClickHouse, while still being able to work with the final result in MySQL.

Of course in the end, this approach might lead to quite a lot of created views in ClickHouse, but as those are just metadata, you don’t have to worry about disk space etc.
At least this could be a solution to avoid creating a lot of pre-aggregated tables in MySQL, allowing you to save a lot of disk space.

Final thoughts

In the end, it is possible to use ClickHouse as a storage engine in MySQL, by utilizing the federated engine shipped with MySQL. But it is still missing a lot of polishment and quality to consider it as a good solution.

The fact that during my experiment I got the MySQL instance killed multiple times by a single query, just because it produced too many internal results, does not make me feel comfortable, when I think about any production use cases.

A better approach is to do it the other way around: use ClickHouse to connect to MySQL for OLTP data while performing all OLAP queries within ClickHouse.

ClickHouse is actively being developed, making its MySQL integration features far more mature for connecting from ClickHouse to MySQL than the other way around, whether from MySQL to ClickHouse or another MySQL instance.

Still, if you only have one or two queries in your application, where you need to access ClickHouse, and all others can run on MySQL, it might be a way to go with the federated engine.

Either way, if you want to utilize the federated engine with ClickHouse, want to optimize the code, or just have an idea for a similar blog post, feel free to reach out.

Originally published by Stefan while he was at Double.Cloud.

Share

ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.

Table of Contents:

Leave a Reply

Your email address will not be published. Required fields are marked *