Using Version 2.0 of the Altinity ClickHouse Sink Connector
Altinity recently released version 2.0 of the Altinity ClickHouse Sink Connector. This open-source, Apache 2.0-licensed project lets you mirror any changes from a MySQL or PostgreSQL database into ClickHouse. Our earlier blog post discussed what’s new in the 2.0 release. In this article we’ll start a test environment and put the Sink Connector to work.
Running the Sink Connector
We’ll use docker compose to get a complete environment up and running. The environment contains a virtual network with containers for MySQL, ClickHouse, Zookeeper, Grafana, Prometheus, and the Sink Connector itself.
If you’d like to skip ahead, we go through some basic tests of the Sink Connector, including a look at fault tolerance. Next we look at a sample application that creates a 5000-record database. With that database created, we look at some advanced analytics queries that show the power of using ClickHouse to analyze data from MySQL.
To the command line we go!
First, clone the repo:
> git clone https://github.com/Altinity/clickhouse-sink-connector.git
There’s a directory with YAML files, shell scripts, and configuration files. Switch to it:
> cd clickhouse-sink-connector/sink-connector-lightweight/docker
Set the CLICKHOUSE_SINK_CONNECTOR_LT_IMAGE
environment variable so Docker knows which version (which tag) of the Sink Connector image to download:
> export CLICKHOUSE_SINK_CONNECTOR_LT_IMAGE=altinity/clickhouse-sink-connector:2.0.1-lt
(This tag value is current as of April 2024; be sure to check the Sink Connector documentation for the correct tag.)
Now it’s time to let Docker set up the infrastructure for us:
> docker compose -f docker-compose-mysql.yml up --renew-anon-volumes
This will take a few seconds, but when you’re done, you’ll have six containers running:
⯠docker ps
CONTAINER ID NAMES
6fb1608f567a clickhouse
e878c630ead9 docker-grafana-1
2c4f302f72af mysql-master
05ce287a609f docker-zookeeper-1
9932c0dd984d prometheus
f7d471e3460a docker-clickhouse-sink-connector-lt-1
We’ll interact with the clickhouse
and mysql-master
containers here, with docker-clickhouse-sink-connector-lt-1
and the other containers working in the background. Our ClickHouse server is on localhost:8123
and the MySQL server is on localhost:3306
, both with the username/password of root
/root
.
NOTE: In this section we focus on SQL statements, query results, and other outputs. Everything we do here can be done from the command line, but there are plenty of graphical tools that make it easier to work with the components we’re using here. (Not that you asked, but Docker Desktop for containers and DBeaver for databases are our favorites.)
Inserting data into MySQL
The MySQL image comes with a database named test; that’s where we’ll work. To get things started, we’ll go to a MySQL client and run the following commands:
> USE test;
[2024-04-02 09:10:01] completed in 2 ms
> CREATE TABLE foo (id INT PRIMARY KEY, value VARCHAR(100));
[2024-04-02 09:10:49] completed in 16 ms
> INSERT INTO foo VALUES (1, ‘Monday’);
[2024-04-02 09:11:29] 1 row affected in 5 ms
> SELECT * FROM foo;
id|value |
--+------+
1|Monday|
So we’ve created a table and inserted some data. Simple as that.
Finding the data in ClickHouse
Now let’s go to ClickHouse and see if our new table and its data are available:
> USE test;
[2024-04-02 09:12:39] completed in 4 ms
> SELECT * FROM foo FINAL;
id|value |_version |is_deleted|
--+------+-------------------+----------+
1|Monday|1775149050623043561| 0|
The table and its data are there, but the ClickHouse table has two additional fields: _version
and is_deleted
. Those fields let ClickHouse keep track of the latest values of the fields in our database, including whether those fields still exist.
Which brings us to the FINAL
keyword in the ClickHouse SELECT
statement above. This returns the final value of the field after all updates and deletes are processed. The Sink Connector creates a ClickHouse database that uses the ReplacingMergeTree
engine, which processes UPDATE
s and DELETE
s asynchronously. Without FINAL
, you may see values that reflect changes that haven’t been processed yet. (If you’d like to know more, see Alexander Zaitsev’s blog post ClickHouse ReplacingMergeTree Explained: The Good, The Bad, and The Ugly.)
So at this point we’ve created a table named foo
inside the MySQL database test
and put some data in it. The Sink Connector created an equivalent foo
table in the ClickHouse database named test
, then copied the data from MySQL to ClickHouse.
Making changes to the data
Now let’s go back to our MySQL client and do a couple more INSERT
statements and an UPDATE
and a DELETE
:
> INSERT INTO foo VALUES (2, ‘Maddie’);
[2024-04-02 09:40:42] 1 row affected in 7 ms
> INSERT INTO foo VALUES (3, ‘Raleigh’);
[2024-04-02 09:41:43] 1 row affected in 6 ms
> UPDATE foo SET value=’Tuesday’ WHERE id=1;
[2024-04-02 09:45:03] 1 row affected in 6 ms
> UPDATE foo SET value=’Wednesday’ WHERE id=1;
[2024-04-02 09:46:04] 1 row affected in 5 ms
> DELETE FROM foo WHERE id=3;
[2024-04-02 09:46:55] 1 row affected in 6 ms
> SELECT * FROM foo;
id|value |
--+---------+
1|Wednesday|
2|Maddie |
When we go to ClickHouse, we get the results we expect:
> SELECT * FROM foo FINAL;
id|value |_version |is_deleted|
--+---------+-------------------+----------+
2|Maddie |1775262359745462335| 0|
1|Wednesday|1775262150030262334| 0|
Handling outages
One hard requirement is that the Sink Connector must be robust: if the Sink Connector or ClickHouse go down, any updates to the MySQL data will still find their way to the ClickHouse database once those components are back up. To illustrate, we’ll simulate an outage for the Sink Connector:
> docker pause docker-clickhouse-sink-connector-lt-1
Now we’ll add data to MySQL:
> INSERT INTO foo VALUES(4, ‘Orange’);
[2024-04-02 13:17:17] 1 row affected in 13 ms
> SELECT * FROM foo;
id|value |
--+---------+
1|Wednesday|
2|Maddie |
4|Orange |
Going to our ClickHouse server and querying the database, our new data is missing, as you’d expect:
> SELECT * FROM foo FINAL;
id|value |_version |is_deleted|
--+---------+-------------------+----------+
2|Maddie |1775262359745462335| 0|
1|Wednesday|1775262150030262334| 0|
Now let’s unpause the Sink Connector:
> docker unpause docker-clickhouse-sink-connector-lt-1
Now rerunning the query in ClickHouse shows the new data:
> SELECT * FROM foo FINAL;
id|value |_version |is_deleted|
--+---------+-------------------+----------+
1|Wednesday|1775262150030262334| 0|
2|Maddie |1775262359745462335| 0|
4|Orange |1775262967919542336| 0|
Similarly, if we leave the Sink Connector up and running but take ClickHouse down with docker pause clickhouse
, any changes we make to the MySQL database will be sent to ClickHouse when we use docker unpause clickhouse
to bring it back up.
Using a larger dataset
If you cloned the repo, the sink-connector-lightweight/docker/example
directory has a sample application and some data to create a larger dataset in the MySQL database. It creates a table, then reads lines of data from a CSV file and adds each one to the new table.
The CSV file is a set of 5000 sales records from excelbianalytics.com. If you’d like to push your environment further, there are CSV files of various sizes, the largest being 5 million records. Thanks to the owners of that site for making this file copyright-free to use for any purpose.
Running the sample is straightforward. First of all, we’re assuming that you’ve used docker compose
to start the demo environment so you have a MySQL server, a ClickHouse server, and the Altinity ClickHouse Sink Connector.
To start, change to the sink-connector-lightweight/docker/example
directory and install the npm packages the code needs:
> npm install
Now it’s time to run the sample:
> node loadCSVData.js
(BTW, you can change some parameters, such as your MySQL server’s hostname, port, username, and password. Running node loadCSVData.js –-help
will give you all the details. There’s also a complete README
file in the example directory. )
The code includes the following CREATE
statement:
CREATE TABLE IF NOT EXISTS sales_records (
region VARCHAR(50),
country VARCHAR(50),
item_type VARCHAR(20),
sales_channel VARCHAR(20),
order_priority VARCHAR(5),
order_date DATE,
order_id BIGINT,
ship_date DATE,
units_sold INT,
unit_price FLOAT,
unit_cost FLOAT,
total_revenue FLOAT,
total_cost FLOAT,
total_profit FLOAT,
PRIMARY KEY(order_id)
);
Each entry in the table has the details of a sale. For our analytics queries, we’ll look at things like the region
and country
for each sale, the day of the week each sale took place (from order_date
), and the profit margin (total_profit/total_revenue
) for each sale. ClickHouse can calculate the aggregates we need for our queries extremely efficiently.
When you run the example, you’ll see something like this:
Connected to MySQL database
Table created successfully.
All lines processed.
The code creates the following table in MySQL:
> DESCRIBE sales_records;
Field |Type |Null|Key|Default|Extra|
--------------+-----------+----+---+-------+-----+
region |varchar(50)|YES | | | |
country |varchar(50)|YES | | | |
item_type |varchar(20)|YES | | | |
sales_channel |varchar(20)|YES | | | |
order_priority|varchar(5) |YES | | | |
order_date |date |YES | | | |
order_id |bigint |NO |PRI| | |
ship_date |date |YES | | | |
units_sold |int |YES | | | |
unit_price |float |YES | | | |
unit_cost |float |YES | | | |
total_revenue |float |YES | | | |
total_cost |float |YES | | | |
total_profit |float |YES | | | |
The order_id
field is the primary key. The Altinity Sink Connector creates a ClickHouse table that looks like this:
> DESCRIBE sales_records;
name |type |...|comment|ttl_expression|
--------------+-----------------+---+-------+--------------|
region |Nullable(String) |...| | |
country |Nullable(String) |...| | |
item_type |Nullable(String) |...| | |
sales_channel |Nullable(String) |...| | |
order_priority|Nullable(String) |...| | |
order_date |Nullable(Date32) |...| | |
order_id |Int64 |...| | |
ship_date |Nullable(Date32) |...| | |
units_sold |Nullable(Int32) |...| | |
unit_price |Nullable(Float32)|...| | |
unit_cost |Nullable(Float32)|...| | |
total_revenue |Nullable(Float32)|...| | |
total_cost |Nullable(Float32)|...| | |
total_profit |Nullable(Float32)|...| | |
_version |UInt64 |...| | |
is_deleted |UInt8 |...| | |
As the primary key, order_id
can’t be null. The ClickHouse version also has the properties default_type
, default_expression
, comment
, codec_expression
, and ttl_expression
, but we won’t discuss those here. (To keep the listing simple, we left most of those ClickHouse columns out.)
Now running SELECT * FROM sales_records LIMIT 10
in your favorite MySQL client shows the data:
> SELECT * FROM sales_records LIMIT 10;
region |country |item_type |sales_channel|order_priority
------------------+-------------+---------------+-------------+--------------
Sub-Saharan Africa|Cote d'Ivoire|Personal Care |Online |L
Sub-Saharan Africa|Mozambique |Office Supplies|Offline |L
Asia |Uzbekistan |Household |Online |H
Sub-Saharan Africa|Mali |Cosmetics |Online |M
Europe |Norway |Cereal |Online |M
Asia |Tajikistan |Cereal |Online |L
Europe |Serbia |Cereal |Offline |C
Asia |China |Meat |Online |H
Europe |Kosovo |Household |Online |M
Asia |Singapore |Fruits |Offline |C
(We truncated these results too.)
The changed data from MySQL is copied to ClickHouse instantly. Running the same query in your favorite ClickHouse client returns the same data along with the _version
and is_deleted
fields.
Using ClickHouse for analytic queries on data from MySQL
One of the great high-level features of the Altinity Sink Connector is that it lets you use different databases for the things they do well. MySQL is great for transactions, while ClickHouse is great for analytics. We’ll wrap things up with a couple of analytics queries that would be difficult or expensive to do with MySQL. That lets you use ClickHouse to run analytics on your data without adding to the workload of your production MySQL servers.
Query: What are the top ten countries with the highest profit margin?
We can aggregate all of the profit and revenue data for all of the orders for each country to get that country’s average profit margin per order. Here’s the query:
SELECT
country,
CAST((avg(total_profit / total_revenue) * 100) AS Decimal(7,4))
AS avg_profit_margin
FROM
sales_records
GROUP BY
country
ORDER BY
avg_profit_margin DESC
LIMIT 10;
And the results:
country |avg_profit_margin|
-----------+-----------------+
The Bahamas| 39.8716|
Oman | 39.4974|
Estonia | 39.4241|
Morocco | 39.2837|
Denmark | 39.2383|
Montenegro | 39.2379|
Liberia | 38.8673|
Palau | 38.8559|
Algeria | 38.8101|
Guinea | 38.6938|
Orders from The Bahamas have a robust average profit margin of just under 40 percent. We also used the highly useful Decimal()
function to format the results.
Query: For each region, which day of the week generates the most revenue?
Another thing we might want to know is which day of the week has the highest sales. We could use that for targeted marketing to specific countries on specific days. Here’s the query:
SELECT
region,
toDayOfWeek(order_date) AS day_of_week,
CAST(sum(total_revenue) AS Decimal(15,2)) AS total_revenue
FROM
sales_records
GROUP BY
region,
day_of_week
ORDER BY
region,
total_revenue DESC
LIMIT 1 BY region;
We’re using ClickHouse’s toDayOfWeek()
function to convert the order_date
column into an integer, where Monday = 1, Tuesday = 2…and Sunday = 7. Here are our results:
region |day_of_week|total_revenue|
---------------------------------+-----------+-------------+
Asia | 5| 150662118.42|
Australia and Oceania | 7| 132561755.05|
Central America and the Caribbean| 7| 110157156.52|
Europe | 1| 275697483.97|
Middle East and North Africa | 6| 138445299.25|
North America | 7| 27106816.23|
Sub-Saharan Africa | 5| 318555438.49|
Using LIMIT 1 BY
region shows us only the day with the most revenue in each region.
That’s great, but we can make the data look a little better. We’ll use a CASE
statement to convert the integer returned by toDayOfWeek()
into the English name of the day. Here’s the query:
SELECT
region,
CASE toDayOfWeek(order_date)
WHEN 1 THEN 'Monday'
WHEN 2 THEN 'Tuesday'
WHEN 3 THEN 'Wednesday'
WHEN 4 THEN 'Thursday'
WHEN 5 THEN 'Friday'
WHEN 6 THEN 'Saturday'
WHEN 7 THEN 'Sunday'
END AS day_of_week_name,
CAST(sum(total_revenue) AS Decimal(15,2)) AS total_revenue
FROM
sales_records
GROUP BY
region,
day_of_week_name
ORDER BY
region,
total_revenue DESC
LIMIT 1 BY region;
And here are the more readable results:
region |day_of_week_name|total_revenue|
---------------------------------+----------------+-------------+
Asia |Friday | 150662118.42|
Australia and Oceania |Sunday | 132561755.05|
Central America and the Caribbean|Sunday | 110157156.52|
Europe |Monday | 275697483.97|
Middle East and North Africa |Saturday | 138445299.25|
North America |Sunday | 27106816.23|
Sub-Saharan Africa |Friday | 318555438.49|
Performance
The article Using ClickHouse as an Analytic Extension for MySQL discusses the advantages of running analytics queries with ClickHouse’s column-oriented, multi-threaded architecture. For a quick illustration, we added more sales data until we had 2 million records in our database. (That’s still a tiny database by ClickHouse standards, but it’s large enough to make a difference.) Then we ran a query in both ClickHouse and MySQL. First, we’ll make sure the MySQL and ClickHouse databases have the same data:
MySQL:
> SELECT COUNT(*) FROM sales_records;
Name |Value |
--------+-------+
count(*)|2000000|
ClickHouse:
> SELECT COUNT(*) FROM sales_records;
count()|
-------+
2000000|
The Sink Connector has done its job; the same data is in both places. Next, we ran the query “What are the top ten countries with the highest profit margin?” ten times on each database, alternating between MySQL and ClickHouse.
Here are the results from MySQL:
[2024-04-08 13:49:16] 10 rows retrieved starting from 1 in 783 ms (execution: 774 ms, fetching: 9 ms)
[2024-04-08 13:49:59] 10 rows retrieved starting from 1 in 786 ms (execution: 778 ms, fetching: 8 ms)
[2024-04-08 13:50:30] 10 rows retrieved starting from 1 in 808 ms (execution: 800 ms, fetching: 8 ms)
[2024-04-08 13:51:05] 10 rows retrieved starting from 1 in 769 ms (execution: 760 ms, fetching: 9 ms)
[2024-04-08 13:51:43] 10 rows retrieved starting from 1 in 779 ms (execution: 769 ms, fetching: 10 ms)
[2024-04-08 13:52:16] 10 rows retrieved starting from 1 in 769 ms (execution: 760 ms, fetching: 9 ms)
[2024-04-08 13:52:49] 10 rows retrieved starting from 1 in 782 ms (execution: 773 ms, fetching: 9 ms)
[2024-04-08 13:53:44] 10 rows retrieved starting from 1 in 769 ms (execution: 759 ms, fetching: 10 ms
[2024-04-08 13:54:22] 10 rows retrieved starting from 1 in 772 ms (execution: 764 ms, fetching: 8 ms)
[2024-04-08 13:54:59] 10 rows retrieved starting from 1 in 780 ms (execution: 771 ms, fetching: 9 ms)
And from ClickHouse:
[2024-04-08 13:49:43] 10 rows retrieved starting from 1 in 41 ms (execution: 33 ms, fetching: 8 ms)
[2024-04-08 13:50:15] 10 rows retrieved starting from 1 in 36 ms (execution: 27 ms, fetching: 9 ms)
[2024-04-08 13:50:46] 10 rows retrieved starting from 1 in 32 ms (execution: 23 ms, fetching: 9 ms)
[2024-04-08 13:51:23] 10 rows retrieved starting from 1 in 42 ms (execution: 34 ms, fetching: 8 ms)
[2024-04-08 13:51:58] 10 rows retrieved starting from 1 in 44 ms (execution: 35 ms, fetching: 9 ms)
[2024-04-08 13:52:32] 10 rows retrieved starting from 1 in 41 ms (execution: 31 ms, fetching: 10 ms)
[2024-04-08 13:53:19] 10 rows retrieved starting from 1 in 40 ms (execution: 32 ms, fetching: 8 ms)
[2024-04-08 13:54:05] 10 rows retrieved starting from 1 in 41 ms (execution: 32 ms, fetching: 9 ms)
[2024-04-08 13:54:40] 10 rows retrieved starting from 1 in 24 ms (execution: 16 ms, fetching: 8 ms)
[2024-04-08 13:55:13] 10 rows retrieved starting from 1 in 27 ms (execution: 19 ms, fetching: 8 ms)
The average time for MySQL was 771 ms. ClickHouse, on the other hand, had an average time of 37 ms. This isn’t a scientific test by any means, but even with such a small amount of data ClickHouse ran our analytics query more than 20 times faster. With a real-world data set of hundreds of millions of records or more, the performance difference would be even greater.
At any rate, these queries only scratch the surface of what we can do with ClickHouse. Because they only use a few of the columns in the table, ClickHouse’s column-oriented architecture makes these queries extremely fast and efficient.
Summary
The Altinity ClickHouse Sink Connector is a great way to add sophisticated analytics to your organization with minimal impact to your production MySQL or PostgreSQL system. With the data in ClickHouse, you get the advantages of a world-class, open-source analytics engine without losing transactions and other things that MySQL and PostgreSQL do so well.
We encourage you to get the code and work with it. The Docker and YAML files make it easy to get started, and you can modify that configuration to work with your own servers.
For more information
To learn more about the Altinity Sink Connector project, how ClickHouse and MySQL work well together, and the details of the ReplacingMergeTree
engine, we highly recommend the following articles:
- Fast MySQL to ClickHouse Replication: Announcing the Altinity Sink Connector for ClickHouse (17 June 2022) – The article that announced the Sink Connector project and explained what it’s all about.
- Replicating Data from PostgreSQL to ClickHouse with the Altinity Sink Connector (25 July 2023) – Shows the project in action with PostgreSQL…we focused on MySQL here, but the principles are the same.
- Using ClickHouse as an Analytic Extension for MySQL (12 October 2022) – Explains the high-level reasons for using ClickHouse and MySQL together.
- ClickHouse ReplacingMergeTree Explained: The Good, The Bad, and The Ugly (10 May 2023) – A great explanation of how ReplacingMergeTree works.
- Finally, the Altinity Knowledge Base has an in-depth discussion of the ReplacingMergeTree engine.