Integrating ClickHouse with MinIO

Integrating ClickHouse with MinIO

In November 2020, Alexander Zaitsev introduced S3-compatible object storage compatibility with ClickHouse. In his article ClickHouse and S3 Compatible Object Storage, he provided steps to use AWS S3 with ClickHouse’s disk storage system and the S3 table function. Now, we are excited to announce full support for integrating with MinIO, ClickHouse’s second fully supported S3-compatible object storage service. MinIO is an extremely high-performance, Kubernetes-native object storage service that you can now access through the S3 table function. You may also use it as one of ClickHouse’s storage disks with a similar configuration as with AWS S3.

MinIO support was originally added to ClickHouse in January 2020, starting with version 20.1.2.4. In this article, we will explain how to integrate MinIO with ClickHouse.

MinIO Using Docker

The easiest way to familiarize yourself with MinIO storage is to use a version of MinIO in a Docker container, as we will do in our examples. We will use a docker-compose cluster of ClickHouse instances, a Docker container running Apache Zookeeper to manage our ClickHouse instances, and a Docker container running MinIO for this example. To use this environment, you will need git, Docker, and docker-compose installed on your system. Then you can clone the repository that contains the test environment to your local system.

git clone https://gitlab.com/altinity-public/blogs/minio-integration-with-clickhouse.git

Next, you will need to check if you can bring up the docker-compose cluster.

Note that you must run all docker-compose commands in the docker-compose directory.

cd minio-integration-with-clickhouse
cd docker-compose
docker-compose up -d
Creating network "docker-compose_default" with the default driver
Creating docker-compose_zookeeper_1 ... done
Creating docker-compose_minio_1     ... done
Creating docker-compose_minio-client_1 ... done
Creating docker-compose_clickhouse1_1  ... done
Creating docker-compose_clickhouse3_1  ... done
Creating docker-compose_clickhouse2_1  ... done
Creating docker-compose_all_services_ready_1 ... done

If the docker-compose environment starts correctly, you will see messages indicating that the clickhouse1, clickhouse2, clickhouse3, minio-client, and minio services are now running.

docker-compose ps
               Name                              Command                  State                    Ports              
----------------------------------------------------------------------------------------------------------------------
docker-compose_all_services_ready_1   /hello                           Exit 0                                         
docker-compose_clickhouse1_1          bash -c clickhouse server  ...   Up (healthy)   8123/tcp, 9000/tcp, 9009/tcp    
docker-compose_clickhouse2_1          bash -c clickhouse server  ...   Up (healthy)   8123/tcp, 9000/tcp, 9009/tcp    
docker-compose_clickhouse3_1          bash -c clickhouse server  ...   Up (healthy)   8123/tcp, 9000/tcp, 9009/tcp    
docker-compose_minio-client_1         /bin/sh -c  /usr/bin/mc co ...   Up (healthy)                                   
docker-compose_minio_1                /usr/bin/docker-entrypoint ...   Up (healthy)   9000/tcp, 0.0.0.0:9001->9001/tcp
docker-compose_zookeeper_1            /docker-entrypoint.sh zkSe ...   Up (healthy)   2181/tcp, 2888/tcp, 3888/tcp

Sanity Checks

Before we proceed, we will perform some sanity checks to ensure that MinIO is running and accessible.

Again, note that you must execute all docker-compose commands from the docker-compose directory.

First, we will check that we can use the minio-client service.

docker-compose exec minio-client mc -v
mc version RELEASE.2021-05-12T03-10-11Z

Next, we will use minio-client to access the minio bucket. In the minio-client.yml file, you may notice that the entrypoint definition will connect the client to the minio service and create the bucket root. This bucket can be found by listing all buckets.

docker-compose exec minio-client mc ls

Then, we will check that the three ClickHouse services are running and ready for queries.

docker-compose exec clickhouse1 bash -c 'clickhouse-client -q "SELECT version()"'
docker-compose exec clickhouse2 bash -c 'clickhouse-client -q "SELECT version()"'
docker-compose exec clickhouse3 bash -c 'clickhouse-client -q "SELECT version()"'

Configuring MinIO Disk Storage

To set up a MinIO storage disk, you will first need a MinIO bucket endpoint, either remote or provided through a MinIO Docker container. You also need an access_key_id and secret_access_key, which correspond to the bucket. Here is an example configuration file using the local MinIO endpoint we created using Docker.

config.d/storage.xml

<yandex>
  <storage_configuration>
    <disks>
      <minio>
        <type>s3</type>
        <endpoint>http://minio:9001/root/data/</endpoint>
        <access_key_id>minio</access_key_id>
        <secret_access_key>minio123</secret_access_key>
      </minio>
    </disks>
    <policies>
      <external>
        <volumes>
          <s3>
            <disk>minio</disk>
          </s3>
        </volumes>
      </external>
    </policies>
  </storage_configuration>
  ...
</yandex>

In this configuration file, we have one policy that includes a single volume with a single disk configured to use a MinIO bucket endpoint. Generally, in each policy, you can define multiple volumes, which is especially useful when moving data between volumes with TTL statements. You can also configure multiple disks and policies in their respective sections. However, to keep our example simple, it only contains the minimal structure required to use your MinIO bucket. For a complete guide to S3-compatible storage configuration, you may refer back to our earlier article: ClickHouse and S3 Compatible Object Storage. We have included this storage configuration file in the configs directory, and it will be ready to use when you start the docker-compose environment.

As you can see in the repository we have provided, each local configuration file is mounted on the ClickHouse volumes in the /etc/clickhouse-server/config.d directory. If you want to add or modify configuration files, these files can be changed in the local config.d directory and added or deleted by changing the volumes mounted in the clickhouse-service.yml file. For those of you who are not using ClickHouse in docker-compose, you can add this storage configuration file, and all other configuration files, in your /etc/clickhouse-server/config.d directory. You can use our docker-compose environment with your local ClickHouse instance by using the same bucket endpoint and credentials as in our configuration file. If you are using a remote MinIO bucket endpoint, make sure to replace the provided bucket endpoint and credentials with your own bucket endpoint and credentials.

The storage configuration is now ready to be used to store table data. Now you can connect to one of the ClickHouse nodes or your local ClickHouse instance. Instructions to connect to the docker-compose node are provided below.

docker-compose exec clickhouse1 bash

Then, connect to the ClickHouse client.

clickhouse client

ClickHouse client version 21.4.6.55 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 21.4.6 revision 54447.

clickhouse1 :) 

Now that you have connected to the ClickHouse client, the following steps will be the same for using a ClickHouse node in the docker-compose cluster and using ClickHouse running on your local machine. You can specify the storage policy in the CREATE TABLE statement to start storing data on the S3-backed disk.

CREATE TABLE minio (
    d UInt64
) ENGINE = MergeTree()
ORDER BY d
SETTINGS storage_policy='external'

Now you are ready to insert data into the table just like any other table.

INSERT INTO minio VALUES (1),(2),(3)

Query id: 4ac85ec5-5e67-4164-9fba-15ec28a28b78

Ok.

3 rows in set. Elapsed: 0.080 sec. 

Once you have stored data in the table, you can confirm that the data was stored on the correct disk by checking the system.parts table.

SELECT disk_name FROM system.parts WHERE table='minio'

Query id: 1d49d414-9dda-4f2b-9d47-f91d0b0bc9ea

┌─disk_name─┐
│ minio     │
└───────────┘

1 rows in set. Elapsed: 0.003 sec. 

Note that two tables using the same storage policy will not share data. To transfer data directly from a MinIO bucket to a table, or vice versa, you can use the S3 table function.

Table Function

MinIO can also be accessed directly using ClickHouse’s S3 table function with the following syntax.

s3(path, [aws_access_key_id, aws_secret_access_key,] format, structure, [compression])

To use the table function with MinIO, you will need to specify your endpoint and access credentials. Note that this time you must omit the / from the end of your endpoint path for proper syntax. Once again, make sure to replace the bucket endpoint and credentials with your own bucket endpoint and credentials if you are using a remote MinIO bucket endpoint.

This query will upload data to MinIO from the table we created earlier.

INSERT INTO FUNCTION s3('http://minio:9001/root/data2', 'minio', 'minio123', 'CSVWithNames', 'd UInt64') SELECT *
FROM minio

Now, let’s create a new table and download the data from MinIO. Notice that we can still take advantage of the S3 table function without using the storage policy we created earlier.

CREATE TABLE minio2 (
    d UInt64
) ENGINE = MergeTree()
ORDER BY d

This query will download data from MinIO into the new table.

INSERT INTO minio2
SELECT * FROM s3('http://minio:9001/root/data2', 'minio', 'minio123', 'CSVWithNames', 'd UInt64')

Let’s confirm that the data was transferred correctly by checking the contents of each table to make sure they match.

SELECT * FROM minio

Query id: fd26acc7-f105-4388-84b5-80786c61f07b

┌─d─┐
│ 1 │
│ 2 │
│ 3 │
└───┘

3 rows in set. Elapsed: 0.008 sec.

SELECT * FROM minio2

Query id: e41145de-a4b4-41ba-a002-e8dd8dc9a9e1

┌─d─┐
│ 1 │
│ 2 │
│ 3 │
└───┘

3 rows in set. Elapsed: 0.001 sec.

Even though this is a small example, you may notice above that the query performance for minio is slower than minio2. The tables that use S3-compatible storage experience higher latency than local tables due to data storage in a container rather than on a local disk.

You may have noticed that MinIO storage in a local Docker container is extremely fast. Although storage in a local Docker container will always be faster than cloud storage, MinIO also outperforms AWS S3 as a cloud storage bucket. For example, after running a performance benchmark loading a dataset containing almost 200 million rows (142 GB), the MinIO bucket showed a performance improvement of nearly 40% over the AWS bucket!

Conclusion

In this article, we have introduced MinIO integration with ClickHouse. We reviewed how to use MinIO and ClickHouse together in a docker-compose cluster to actively store table data in MinIO, as well as import and export data directly to and from MinIO using the S3 table function. We have also briefly discussed the performance advantages of using MinIO, especially in a Docker container.

Stay tuned for the next update in this blog series, in which we will compare the performance of MinIO and AWS S3 on the cloud using some of our standard benchmarking datasets.

Share

2 Comments

Comments are closed.