Using Version 2.0 of the Altinity ClickHouse Sink Connector

Altinity recently released version 2.0 of the Altinity ClickHouse Sink Connector. This open-source, Apache 2.0-licensed project lets you mirror any changes from a MySQL or PostgreSQL database into ClickHouse. Our earlier blog post discussed what’s new in the 2.0 release. In this article we’ll start a test environment and put the Sink Connector to work. 

Running the Sink Connector

We’ll use docker compose to get a complete environment up and running. The environment contains a virtual network with containers for MySQL, ClickHouse, Zookeeper, Grafana, Prometheus, and the Sink Connector itself. 

If you’d like to skip ahead, we go through some basic tests of the Sink Connector, including a look at fault tolerance. Next we look at a sample application that creates a 5000-record database. With that database created, we look at some advanced analytics queries that show the power of using ClickHouse to analyze data from MySQL. 

To the command line we go!

First, clone the repo: 

> git clone https://github.com/Altinity/clickhouse-sink-connector.git

There’s a directory with YAML files, shell scripts, and configuration files. Switch to it: 

> cd clickhouse-sink-connector/sink-connector-lightweight/docker

Set the CLICKHOUSE_SINK_CONNECTOR_LT_IMAGE environment variable so Docker knows which version (which tag) of the Sink Connector image to download: 

> export CLICKHOUSE_SINK_CONNECTOR_LT_IMAGE=altinity/clickhouse-sink-connector:2.0.1-lt

(This tag value is current as of April 2024; be sure to check the Sink Connector documentation for the correct tag.)

Now it’s time to let Docker set up the infrastructure for us: 

> docker compose -f docker-compose-mysql.yml up --renew-anon-volumes

This will take a few seconds, but when you’re done, you’ll have six containers running: 

❯ docker ps
CONTAINER ID   NAMES                                                          
6fb1608f567a   clickhouse                
e878c630ead9   docker-grafana-1 
2c4f302f72af   mysql-master     
05ce287a609f   docker-zookeeper-1 
9932c0dd984d   prometheus         
f7d471e3460a   docker-clickhouse-sink-connector-lt-1 

We’ll interact with the clickhouse and mysql-master containers here, with docker-clickhouse-sink-connector-lt-1 and the other containers working in the background. Our ClickHouse server is on localhost:8123 and the MySQL server is on localhost:3306, both with the username/password of root/root

NOTE: In this section we focus on SQL statements, query results, and other outputs. Everything we do here can be done from the command line, but there are plenty of graphical tools that make it easier to work with the components we’re using here. (Not that you asked, but Docker Desktop for containers and DBeaver for databases are our favorites.)

Inserting data into MySQL

The MySQL image comes with a database named test; that’s where we’ll work. To get things started, we’ll go to a MySQL client and run the following commands: 

> USE test;
[2024-04-02 09:10:01] completed in 2 ms
> CREATE TABLE foo (id INT PRIMARY KEY, value VARCHAR(100));
[2024-04-02 09:10:49] completed in 16 ms
> INSERT INTO foo VALUES (1, ‘Monday’);
[2024-04-02 09:11:29] 1 row affected in 5 ms
> SELECT * FROM foo;
id|value |
--+------+
 1|Monday|

So we’ve created a table and inserted some data. Simple as that.

Finding the data in ClickHouse

Now let’s go to ClickHouse and see if our new table and its data are available: 

> USE test;
[2024-04-02 09:12:39] completed in 4 ms
> SELECT * FROM foo FINAL;
id|value |_version           |is_deleted|
--+------+-------------------+----------+
 1|Monday|1775149050623043561|         0|

The table and its data are there, but the ClickHouse table has two additional fields: _version and is_deleted. Those fields let ClickHouse keep track of the latest values of the fields in our database, including whether those fields still exist. 

Which brings us to the FINAL keyword in the ClickHouse SELECT statement above. This returns the final value of the field after all updates and deletes are processed. The Sink Connector creates a ClickHouse database that uses the ReplacingMergeTree engine, which processes UPDATEs and DELETEs asynchronously. Without FINAL, you may see values that reflect changes that haven’t been processed yet. (If you’d like to know more, see Alexander Zaitsev’s blog post ClickHouse ReplacingMergeTree Explained: The Good, The Bad, and The Ugly.) 

So at this point we’ve created a table named foo inside the MySQL database test and put some data in it. The Sink Connector created an equivalent foo table in the ClickHouse database named test, then copied the data from MySQL to ClickHouse. 

Making changes to the data

Now let’s go back to our MySQL client and do a couple more INSERT statements and an UPDATE and a DELETE

> INSERT INTO foo VALUES (2, ‘Maddie’);
[2024-04-02 09:40:42] 1 row affected in 7 ms
> INSERT INTO foo VALUES (3, ‘Raleigh’);
[2024-04-02 09:41:43] 1 row affected in 6 ms
> UPDATE foo SET value=’Tuesday’ WHERE id=1;
[2024-04-02 09:45:03] 1 row affected in 6 ms
> UPDATE foo SET value=’Wednesday’ WHERE id=1;
[2024-04-02 09:46:04] 1 row affected in 5 ms
> DELETE FROM foo WHERE id=3;
[2024-04-02 09:46:55] 1 row affected in 6 ms
> SELECT * FROM foo;
id|value    |
--+---------+
 1|Wednesday|
 2|Maddie   |

When we go to ClickHouse, we get the results we expect: 

> SELECT * FROM foo FINAL;
id|value    |_version           |is_deleted|
--+---------+-------------------+----------+
 2|Maddie   |1775262359745462335|         0|
 1|Wednesday|1775262150030262334|         0|

Handling outages

One hard requirement is that the Sink Connector must be robust: if the Sink Connector or ClickHouse go down, any updates to the MySQL data will still find their way to the ClickHouse database once those components are back up. To illustrate, we’ll simulate an outage for the Sink Connector: 

> docker pause docker-clickhouse-sink-connector-lt-1

Now we’ll add data to MySQL: 

> INSERT INTO foo VALUES(4, ‘Orange’);
[2024-04-02 13:17:17] 1 row affected in 13 ms
> SELECT * FROM foo;
id|value    |
--+---------+
 1|Wednesday|
 2|Maddie   |
 4|Orange   |

Going to our ClickHouse server and querying the database, our new data is missing, as you’d expect: 

> SELECT * FROM foo FINAL; 
id|value    |_version           |is_deleted|
--+---------+-------------------+----------+
 2|Maddie   |1775262359745462335|         0|
 1|Wednesday|1775262150030262334|         0|

Now let’s unpause the Sink Connector: 

> docker unpause docker-clickhouse-sink-connector-lt-1

Now rerunning the query in ClickHouse shows the new data: 

> SELECT * FROM foo FINAL;
id|value    |_version           |is_deleted|
--+---------+-------------------+----------+
 1|Wednesday|1775262150030262334|         0|
 2|Maddie   |1775262359745462335|         0|
 4|Orange   |1775262967919542336|         0|

Similarly, if we leave the Sink Connector up and running but take ClickHouse down with docker pause clickhouse, any changes we make to the MySQL database will be sent to ClickHouse when we use docker unpause clickhouse to bring it back up.

Using a larger dataset

If you cloned the repo, the sink-connector-lightweight/docker/example directory has a sample application and some data to create a larger dataset in the MySQL database. It creates a table, then reads lines of data from a CSV file and adds each one to the new table. 

The CSV file is a set of 5000 sales records from excelbianalytics.com. If you’d like to push your environment further, there are CSV files of various sizes, the largest being 5 million records. Thanks to the owners of that site for making this file copyright-free to use for any purpose.

Running the sample is straightforward. First of all, we’re assuming that you’ve used docker compose to start the demo environment so you have a MySQL server, a ClickHouse server, and the Altinity ClickHouse Sink Connector.

To start, change to the sink-connector-lightweight/docker/example directory and install the npm packages the code needs:

> npm install

Now it’s time to run the sample:

> node loadCSVData.js

(BTW, you can change some parameters, such as your MySQL server’s hostname, port, username, and password. Running node loadCSVData.js –-help will give you all the details. There’s also a complete README file in the example directory. ) 

The code includes the following CREATE statement: 

CREATE TABLE IF NOT EXISTS sales_records (
    region VARCHAR(50), 
    country VARCHAR(50),
    item_type VARCHAR(20),
    sales_channel VARCHAR(20),
    order_priority VARCHAR(5),
    order_date DATE,
    order_id BIGINT,
    ship_date DATE,
    units_sold INT,
    unit_price FLOAT,
    unit_cost FLOAT,
    total_revenue FLOAT,
    total_cost FLOAT,
    total_profit FLOAT,
    PRIMARY KEY(order_id)
);

Each entry in the table has the details of a sale. For our analytics queries, we’ll look at things like the region and country for each sale, the day of the week each sale took place (from order_date), and the profit margin (total_profit/total_revenue) for each sale. ClickHouse can calculate the aggregates we need for our queries extremely efficiently. 

When you run the example, you’ll see something like this:

Connected to MySQL database
Table created successfully.
All lines processed.

The code creates the following table in MySQL: 

> DESCRIBE sales_records;
Field         |Type       |Null|Key|Default|Extra|
--------------+-----------+----+---+-------+-----+
region        |varchar(50)|YES |   |       |     |
country       |varchar(50)|YES |   |       |     |
item_type     |varchar(20)|YES |   |       |     |
sales_channel |varchar(20)|YES |   |       |     |
order_priority|varchar(5) |YES |   |       |     |
order_date    |date       |YES |   |       |     |
order_id      |bigint     |NO  |PRI|       |     |
ship_date     |date       |YES |   |       |     |
units_sold    |int        |YES |   |       |     |
unit_price    |float      |YES |   |       |     |
unit_cost     |float      |YES |   |       |     |
total_revenue |float      |YES |   |       |     |
total_cost    |float      |YES |   |       |     |
total_profit  |float      |YES |   |       |     |

The order_id field is the primary key. The Altinity Sink Connector creates a ClickHouse table that looks like this: 

> DESCRIBE sales_records;
name          |type             |...|comment|ttl_expression|
--------------+-----------------+---+-------+--------------|
region        |Nullable(String) |...|       |              |
country       |Nullable(String) |...|       |              |
item_type     |Nullable(String) |...|       |              |
sales_channel |Nullable(String) |...|       |              |
order_priority|Nullable(String) |...|       |              |
order_date    |Nullable(Date32) |...|       |              |
order_id      |Int64            |...|       |              |
ship_date     |Nullable(Date32) |...|       |              |
units_sold    |Nullable(Int32)  |...|       |              |
unit_price    |Nullable(Float32)|...|       |              |
unit_cost     |Nullable(Float32)|...|       |              |
total_revenue |Nullable(Float32)|...|       |              |
total_cost    |Nullable(Float32)|...|       |              |
total_profit  |Nullable(Float32)|...|       |              |
_version      |UInt64           |...|       |              |
is_deleted    |UInt8            |...|       |              |

As the primary key, order_id can’t be null. The ClickHouse version also has the properties default_type, default_expression, comment, codec_expression, and ttl_expression, but we won’t discuss those here. (To keep the listing simple, we left most of those ClickHouse columns out.) 

Now running SELECT * FROM sales_records LIMIT 10 in your favorite MySQL client shows the data: 

> SELECT * FROM sales_records LIMIT 10;
region            |country      |item_type      |sales_channel|order_priority
------------------+-------------+---------------+-------------+--------------
Sub-Saharan Africa|Cote d'Ivoire|Personal Care  |Online       |L             
Sub-Saharan Africa|Mozambique   |Office Supplies|Offline      |L             
Asia              |Uzbekistan   |Household      |Online       |H             
Sub-Saharan Africa|Mali         |Cosmetics      |Online       |M             
Europe            |Norway       |Cereal         |Online       |M             
Asia              |Tajikistan   |Cereal         |Online       |L             
Europe            |Serbia       |Cereal         |Offline      |C             
Asia              |China        |Meat           |Online       |H             
Europe            |Kosovo       |Household      |Online       |M             
Asia              |Singapore    |Fruits         |Offline      |C           
  

(We truncated these results too.)

The changed data from MySQL is copied to ClickHouse instantly. Running the same query in your favorite ClickHouse client returns the same data along with the _version and is_deleted fields. 

Using ClickHouse for analytic queries on data from MySQL

One of the great high-level features of the Altinity Sink Connector is that it lets you use different databases for the things they do well. MySQL is great for transactions, while ClickHouse is great for analytics. We’ll wrap things up with a couple of analytics queries that would be difficult or expensive to do with MySQL. That lets you use ClickHouse to run analytics on your data without adding to the workload of your production MySQL servers. 

Query: What are the top ten countries with the highest profit margin? 

We can aggregate all of the profit and revenue data for all of the orders for each country to get that country’s average profit margin per order. Here’s the query: 

SELECT
    country,
    CAST((avg(total_profit / total_revenue) * 100) AS Decimal(7,4)) 
        AS avg_profit_margin
FROM
    sales_records
GROUP BY
    country
ORDER BY
    avg_profit_margin DESC
LIMIT 10;

And the results: 

country    |avg_profit_margin|
-----------+-----------------+
The Bahamas|          39.8716|
Oman       |          39.4974|
Estonia    |          39.4241|
Morocco    |          39.2837|
Denmark    |          39.2383|
Montenegro |          39.2379|
Liberia    |          38.8673|
Palau      |          38.8559|
Algeria    |          38.8101|
Guinea     |          38.6938|

Orders from The Bahamas have a robust average profit margin of just under 40 percent. We also used the highly useful Decimal() function to format the results.

Query: For each region, which day of the week generates the most revenue? 

Another thing we might want to know is which day of the week has the highest sales. We could use that for targeted marketing to specific countries on specific days. Here’s the query: 

SELECT
   region,
   toDayOfWeek(order_date) AS day_of_week,
   CAST(sum(total_revenue) AS Decimal(15,2)) AS total_revenue
FROM
   sales_records
GROUP BY
   region,
   day_of_week
ORDER BY
   region,
   total_revenue DESC
LIMIT 1 BY region;

We’re using ClickHouse’s toDayOfWeek() function to convert the order_date column into an integer, where Monday = 1, Tuesday = 2…and Sunday = 7. Here are our results: 

region                           |day_of_week|total_revenue|
---------------------------------+-----------+-------------+
Asia                             |          5| 150662118.42|
Australia and Oceania            |          7| 132561755.05|
Central America and the Caribbean|          7| 110157156.52|
Europe                           |          1| 275697483.97|
Middle East and North Africa     |          6| 138445299.25|
North America                    |          7|  27106816.23|
Sub-Saharan Africa               |          5| 318555438.49|

Using LIMIT 1 BY region shows us only the day with the most revenue in each region. 

That’s great, but we can make the data look a little better. We’ll use a CASE statement to convert the integer returned by toDayOfWeek() into the English name of the day. Here’s the query:

SELECT
    region,
    CASE toDayOfWeek(order_date)
        WHEN 1 THEN 'Monday'
        WHEN 2 THEN 'Tuesday'
        WHEN 3 THEN 'Wednesday'
        WHEN 4 THEN 'Thursday'
        WHEN 5 THEN 'Friday'
        WHEN 6 THEN 'Saturday'
        WHEN 7 THEN 'Sunday'
    END AS day_of_week_name,
    CAST(sum(total_revenue) AS Decimal(15,2)) AS total_revenue
FROM
    sales_records
GROUP BY
    region,
    day_of_week_name
ORDER BY
    region,
    total_revenue DESC
LIMIT 1 BY region;

And here are the more readable results:

region                           |day_of_week_name|total_revenue|
---------------------------------+----------------+-------------+
Asia                             |Friday          | 150662118.42|
Australia and Oceania            |Sunday          | 132561755.05|
Central America and the Caribbean|Sunday          | 110157156.52|
Europe                           |Monday          | 275697483.97|
Middle East and North Africa     |Saturday        | 138445299.25|
North America                    |Sunday          |  27106816.23|
Sub-Saharan Africa               |Friday          | 318555438.49|

Performance

The article Using ClickHouse as an Analytic Extension for MySQL discusses the advantages of running analytics queries with ClickHouse’s column-oriented, multi-threaded architecture. For a quick illustration, we added more sales data until we had 2 million records in our database. (That’s still a tiny database by ClickHouse standards, but it’s large enough to make a difference.) Then we ran a query in both ClickHouse and MySQL. First, we’ll make sure the MySQL and ClickHouse databases have the same data: 

MySQL: 

> SELECT COUNT(*) FROM sales_records; 
Name    |Value  |
--------+-------+
count(*)|2000000|

ClickHouse: 

> SELECT COUNT(*) FROM sales_records;
count()|
-------+
2000000|

The Sink Connector has done its job; the same data is in both places. Next, we ran the query “What are the top ten countries with the highest profit margin?” ten times on each database, alternating between MySQL and ClickHouse. 

Here are the results from MySQL: 

[2024-04-08 13:49:16] 10 rows retrieved starting from 1 in 783 ms (execution: 774 ms, fetching: 9 ms)
[2024-04-08 13:49:59] 10 rows retrieved starting from 1 in 786 ms (execution: 778 ms, fetching: 8 ms)
[2024-04-08 13:50:30] 10 rows retrieved starting from 1 in 808 ms (execution: 800 ms, fetching: 8 ms)
[2024-04-08 13:51:05] 10 rows retrieved starting from 1 in 769 ms (execution: 760 ms, fetching: 9 ms)
[2024-04-08 13:51:43] 10 rows retrieved starting from 1 in 779 ms (execution: 769 ms, fetching: 10 ms)
[2024-04-08 13:52:16] 10 rows retrieved starting from 1 in 769 ms (execution: 760 ms, fetching: 9 ms)
[2024-04-08 13:52:49] 10 rows retrieved starting from 1 in 782 ms (execution: 773 ms, fetching: 9 ms)
[2024-04-08 13:53:44] 10 rows retrieved starting from 1 in 769 ms (execution: 759 ms, fetching: 10 ms
[2024-04-08 13:54:22] 10 rows retrieved starting from 1 in 772 ms (execution: 764 ms, fetching: 8 ms)
[2024-04-08 13:54:59] 10 rows retrieved starting from 1 in 780 ms (execution: 771 ms, fetching: 9 ms)

And from ClickHouse: 

[2024-04-08 13:49:43] 10 rows retrieved starting from 1 in 41 ms (execution: 33 ms, fetching: 8 ms)
[2024-04-08 13:50:15] 10 rows retrieved starting from 1 in 36 ms (execution: 27 ms, fetching: 9 ms)
[2024-04-08 13:50:46] 10 rows retrieved starting from 1 in 32 ms (execution: 23 ms, fetching: 9 ms)
[2024-04-08 13:51:23] 10 rows retrieved starting from 1 in 42 ms (execution: 34 ms, fetching: 8 ms)
[2024-04-08 13:51:58] 10 rows retrieved starting from 1 in 44 ms (execution: 35 ms, fetching: 9 ms)
[2024-04-08 13:52:32] 10 rows retrieved starting from 1 in 41 ms (execution: 31 ms, fetching: 10 ms)
[2024-04-08 13:53:19] 10 rows retrieved starting from 1 in 40 ms (execution: 32 ms, fetching: 8 ms)
[2024-04-08 13:54:05] 10 rows retrieved starting from 1 in 41 ms (execution: 32 ms, fetching: 9 ms)
[2024-04-08 13:54:40] 10 rows retrieved starting from 1 in 24 ms (execution: 16 ms, fetching: 8 ms)
[2024-04-08 13:55:13] 10 rows retrieved starting from 1 in 27 ms (execution: 19 ms, fetching: 8 ms)

The average time for MySQL was 771 ms. ClickHouse, on the other hand, had an average time of 37 ms. This isn’t a scientific test by any means, but even with such a small amount of data ClickHouse ran our analytics query more than 20 times faster. With a real-world data set of hundreds of millions of records or more, the performance difference would be even greater. 

At any rate, these queries only scratch the surface of what we can do with ClickHouse. Because they only use a few of the columns in the table, ClickHouse’s column-oriented architecture makes these queries extremely fast and efficient. 

Summary

The Altinity ClickHouse Sink Connector is a great way to add sophisticated analytics to your organization with minimal impact to your production MySQL or PostgreSQL system. With the data in ClickHouse, you get the advantages of a world-class, open-source analytics engine without losing transactions and other things that MySQL and PostgreSQL do so well. 

We encourage you to get the code and work with it. The Docker and YAML files make it easy to get started, and you can modify that configuration to work with your own servers. 

For more information

To learn more about the Altinity Sink Connector project, how ClickHouse and MySQL work well together, and the details of the ReplacingMergeTree engine, we highly recommend the following articles: 

Share

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.