Adventures in ClickHouse® Development: Running Integration Tests, Part 2

Running ClickHouse integration tests by hand is not simple, especially if you attempt to execute not just a few tests but all integration tests that are run in the ClickHouse CI/CD pipeline. In the first part of this article, we covered the test environment and walked through how we can run a few tests by hand. While doing that, we ran into an issue with Docker images that were not exactly matching the ClickHouse source code that we are using. We also looked at how these tests are executed in the ClickHouse CI/CD pipeline and found that there was a lot of wrapper code involved. It turned out that running all the tests is not simple, and you can’t just start the runner script and expect to match the results of what you see in the CI/CD.

In this second part, we will pick up where we left off and first try to run all the integration tests by hand. We will first try the naive approach and then go back to ci-runner.py to get the details of how all the integration tests are executed using this script. After that, we will look into what it takes to solve the problem with the Docker images that are required for the tests. Finally, we will use a helper test program that we have developed for running integration tests that addresses the pain points that we discover along the way. Just like in the first part, we will be using the v23.12.1.1368-stable branch for the integration test source code.

Trying to run all the integration tests

Let’s try to naively run all the integration tests. In the first part, after extracting ClickHouse binaries from the clickhouse/clickhouse-server:23.12.1.1368-alpine Docker image into the 23.12.1.1368-alpine local folder, we used the following command to run a few tests:

./runner --binary 23.12.1.1368-alpine/clickhouse 'test_ssl_cert_authentication'

The last positional argument, 'test_ssl_cert_authentication' specifies the arguments that will be used to be passed to pytest. In this case, the 'test_ssl_cert_authentication' option selects to run only tests found in the tests/integration/test_ssl_cert_authentication folder.

Consequently, the naive approach to running all integration tests is to try to run the runner script without any test selectors so that all available tests will be collected and executed. We can do this using the following command:

./runner --binary 23.12.1.1368-alpine/clickhouse

If you try that, you will find that it will not work as expected, and you will observe the following issues:

some tests might get stuck
there will be many more fails as compared to CI/CD runs
it will take an unreasonable amount of time
some tests may fail due to Docker images not matching the source code

As before, we could start by ignoring the issue with using matching Docker images and try to address the first three points by looking at how all integration tests are actually being executed in CI/CD. For this, we need to again read the code in the ci-runner.py and see that CI/CD is not running all integration tests at once.

Upon further review of ci-runner.py we can draw the following conclusions:

Tests are broken up and run by groups, the default group size is 100.
The default number of parallel workers that helps speed up test execution is 5.
Some tests are considered not to be supported for being executed with parallel workers, and these are defined in parallel_skip.json. There is also a separate list of tests that are marked as broken if tests are executed with analyzer (--analyzer) analyzer_integration_broken_tests.txt which is used to build a broken test list.
There is also code to retry failing tests, but by default the number of attempts is set to 1, so failed tests are not retried.

Therefore, we can reasonably conclude that running all integration tests locally is not for the faint of heart, and there is a lot of logic that is required to try to reproduce test results from the CI/CD. We can also consider a simple idea of just running the ci-runner.py directly but a simple attempt shows that this script is not written with that idea in mind.

./ci-runner.py 
Traceback (most recent call last):
  File "/root/ClickHouse/tests/integration/./ci-runner.py", line 1008, in <module>
    params = json.loads(open(params_path, "r").read())
TypeError: expected str, bytes or os.PathLike object, not NoneType

This script relies on a parameter file that is created in CI/CD, so we would have to figure out its format and most likely run into more issues given that we are not inside the CI/CD environment.

Dealing with Docker images

In addition to just figuring out how to run all integration tests locally, we have to also address the problem of dealing with Docker images. Therefore, let’s go back to the problem of using matching Docker images that we identified when trying to run just a few tests. To do things right, we need to build all the images that are needed by the integration tests that are defined by Dockerfiles in the docker/test/integration folder, taking into account any image dependencies defined in docker/images.json.

We can also look at how these images are built in CI/CD and find that tests/ci/docker_images_check.py is used, but again, this script is CI/CD oriented, and trying to run it locally, you’ll find that first you’ll have to install any dependent modules such as pygithub and unidiff.

python3 docker_images_check.py 
Traceback (most recent call last):
  File "/root/ClickHouse/tests/ci/docker_images_check.py", line 11, in <module>
    from github import Github
ModuleNotFoundError: No module named 'github'

You can try to get it working locally, but it is not fun. However, the most important problem that we run into, even if we build all images locally, is finding a way to make integration tests actually use them. The problem is that integration tests are executed inside a Docker container, which itself runs Docker. So we have to deal with the Docker-in-Docker enviroment, which means that locally built images are not available to the Docker that is running inside the clickhouse/integration-tests-runner image. In the CI/CD, it is not a problem as all the images are pushed to Dockerhub and are then pulled by Docker that is running inside that container. But we do not want to push our local images anywhere, and therefore we need to find a way to make them available inside that container.

The solution is not simple and looks as follows:

Build all needed images locally and save them into a tar file using the docker save command.
Use the runner’s script --dockerd-volume-dir option, which specifies the local folder that will be mounted as /var/lib/docker inside the container, and run it with the --command option set to the docker load -i command to load our saved images into the /var/lib/docker folder that is mounted from the local system.
Run the runner script again to execute our tests with the --dockerd-volume-dir option set to the same local folder that we used when we loaded our images in the previous step.

Helper test program to the rescue

Having identified all the issues with trying to run all integration tests locally, any reasonable person would just give up and let ClickHouse’s CI/CD pipeline always execute them for us. However, this handicaps local development, and the process of debugging tests fails, especially if you have to deal with different ClickHouse versions.

In order to solve these problems, we have created a convenient test program that addresses these issues and allows our developers and QA engineers to run integration tests locally, taking care of all the subtleties. You can find our test program to run ClickHouse integration tests in our clickhouse-regression repository. Our clickhouse-regression project holds the largest third-party test suite for ClickHouse outside the main ClickHouse’s repository and adds an additional layer of test coverage for our Altinity Stable Builds. The README.md provides the instructions to get tests running after the necessary prerequisites are installed. In this specific case, only Docker and TestFlows are needed. Given that we have Docker already installed, we can quickly install TestFlows as follows:

pip3 install testflows==2.2.9

Then we just need to check out the clickhouse-regression repository. We can do a shallow check-out.

git clone --depth 1 https://github.com/Altinity/clickhouse-regression.git

After checking out the repository, you should find the regression.py, which we can now use to run all the integration tests or just a subset of integration tests while taking care of all the Docker images, running tests in groups, controlling the number of parallel workers, and retrying any failed tests twice by default, turning off parallelism to filter out any test fails caused by parallel workers during retries.

ls clickhouse-regression/integration/
docker  __pycache__  README.md  regression.py  steps.py  test.log

The instructions in the README.md will provide more details. However, we can build all the images locally and run the first ten tests using the following command:

./regression.py --root-dir ~/ClickHouse/ --binary ~/ClickHouse/tests/integration/23.12.1.1368-alpine/clickhouse --slice 0 10 -l test.log

Any subsequent runs can specify the --skip-build-images options to skip building, saving, and loading image steps, as the state of the runner container’s /var/lib/docker is preserved in the clickhouse-regression/integration/docker/dockerd_volume_dir folder, which is automatically specified as the value of the --dockerd-volume-dir runner option by regression.py.

Given that I have images already build, saved and loaded, I will use the --skip-build-images option to speed up the run as well as use the --output classic TestFlows option to shorten the output.

./regression.py --root-dir ~/ClickHouse/ --binary ~/ClickHouse/tests/integration/23.12.1.1368-alpine/clickhouse --slice 0 10 -l test.log -o classic --skip-build-images

Sweet! We got the first ten tests executed in two minutes, and now we know that all Docker images were handled correctly! What about running all the tests? You can just remove the --slice option and let all the tests run, but be patient, as it takes about 3 hours to complete, depending on the number of fails and retries when tests are executed without parallel workers.

Conclusion

We have taken a long journey looking at what it takes to run ClickHouse integration tests by hand locally. The process of running correctly even a few tests is not simple, and running all the tests requires a lot of work unless you are using a helper test program like regression.py from our clickhouse-regression repository. The complexity forces most of the ClickHouse developers to exclusively rely on running integration tests only as part of the CI/CD pipeline. However, in our work ensuring ClickHouse overall quality and as part of Altinity Stable Builds, we have to deal with supporting different versions of ClickHouse. Our developers and QA engineers have to execute and debug any test fails locally to ensure that every fail is understood before the release can be made available to our customers. Therefore, the ability to run tests by hand is a must, and being able to deal with ClickHouse integration tests is just part of our job. Feel free to try running ClickHouse integration tests by hand using our test program. All our ClickHouse tests are open-source, and you can find them at https://github.com/Altinity/clickhouse-regression/ and help our efforts to make ClickHouse better for everyone!

Get in touch with ClickHouse experts.

Adventures in ClickHouse® Development: Running Integration Tests, Part 2

Trying to run all the integration tests

Dealing with Docker images

Helper test program to the rescue

Conclusion

Related:

Get in touch with ClickHouse experts.

Trying to run all the integration tests

Dealing with Docker images

Helper test program to the rescue

Conclusion

Related:

Adventures in ClickHouse® Development: Running Integration Tests, Part 1