Rescuing ClickHouse from the Linux OOM Killer

From time to time we see a message like this on an Altinity support channel: 

“Help! My ClickHouse server disappeared!!! What happened?” 

In many cases, this is a sign of an out-of-memory condition. ClickHouse began consuming too much memory, and Linux killed it in order to prevent the system from becoming unstable. The Linux process that does this is the Out-of-Memory Killer, or OOM killer for short. 

In this blog article, we will dig into the OOM killer and how to prevent ClickHouse from becoming a victim. Let’s start with a visit to the scene of the crime. 

OOM killer footprints in the wild

The typical symptoms of an OOM killer intervention look like the following. 

  1. ClickHouse suddenly stops or restarts.
  2. In the system journal (available with the dmesg command, /var/log/kern.log, journalctl, /var/log/messages depending on the OS), you can see messages like the one below:
[Tue Jan  1 00:00:00 2022] Out of memory: Killed process 109666 (clickhouse-serv) total-vm:100680400kB, anon-rss:65557992kB, file-rss:231920kB, shmem-rss:0kB, UID:0 pgtables:161300kB oom_score_adj:0
...
[Tue Jan  1 00:00:00 2022] Memory cgroup out of memory: Killed process 1234 (clickhouse) total-vm:...kB, anon-rss:...kB, file-rss:...kB, shmem-rss:...kB, UID:0
...
[Tue Jan  1 00:00:00 2022] oom_reaper: reaped process 113333 (clickhouse), now anon-rss: ... kB, file-rss:0kB, shmem-rss:0kB

If you are running ClickHouse directly on a Linux host (including VMs) you can use the following commands to find OOM killer events in the system logs:

dmesg -T | grep -I 'killed process'

# or 
grep -s -i 'killed process' /var/log/{syslog,messages,kern.log}

# or also for the previous boots (for distros using journalctl - see https://www.baeldung.com/linux/which-process-killed )
journalctl --list-boots | \
    awk '{ print $1 }' | \
    xargs -I{} journalctl --utc --no-pager -b {} -kqg 'killed process' -o verbose --output-fields=MESSAGE

In Kubernetes, it’s a little different. When you describe the pod, you see something like the following. 

kubectl describe pod/chi-demo-ch-0-0-0
. . .
   State:          Running
      Started:      Thu, 10 Sep 2022 11:14:13 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 10 Sep 2022 11:04:03 +0200
      Finished:     Thu, 10 Sep 2022 11:14:11 +0200

You can see the following additional hints on ClickHouse itself. 

First of all, your server monitoring will show that the amount of used memory grows until it hits the limit. After the OOM killer strikes, it drops to nearly zero.

Second, you may see messages like this:

> 2022.01.01 00:00:01.000001 [ 123123 ] {} <Fatal> Application: Child process was terminated by signal 9 (KILL). If it is not done by 'forcestop' command or manually, the possible cause is OOM Killer (see 'dmesg' and look at the '/var/log/kern.log' for the details).

Now, once we have the suspect, let’s try to recreate the circumstances of the crime.

Understanding what happened

ClickHouse may need to allocate substantial RAM during normal operations: for queries (sorting / aggregation / joins in RAM), as well as to store caches, dictionaries, buffers, etc.

There is a physical limitation on the amount of RAM available on the underlying system. For example, it is not possible to allocate 200GB on a system with 128GB or RAM. That is, unless swap is enabled, which means that Linux spools to storage when there is not enough memory. (Enabling swap is a very, very bad idea for performance. It can slow ClickHouse to a crawl.)

So what will the operational system do if one tries to allocate more RAM than is physically available? Linux distros enable the OOM killer by default. If it detects the situation when there is no more free RAM available, it just finds the process that takes the most RAM and kills it without any hesitation!

In order to avoid such a bad scenario and stay in the game, ClickHouse tries to be polite, and it doesn’t request too much RAM. (Well, nobody wants to be killed.) ClickHouse tracks how much memory it uses. If it sees that it has allocated ‘a lot’ already, it stops new allocations and returns ‘Memory limit (total) exceeded’ exception to the requesting thread. Usually, this is sufficient to avoid a rendezvous with the OOM killer.

But what does ‘a lot’ actually mean? For a node with 16GB of RAM, 20GB is ‘a lot’, while for a node with 128Gb of RAM, 20Gb is OK. One can set ClickHouse upper memory limit with the max_server_memory_usage setting. It is zero by default, meaning that the actual value is calculated automatically during the clickhouse-server startup. ClickHouse uses the amount of the RAM available on the node and another setting called max_server_memory_usage_to_ram_ratio (default is 0.9). So that means that with the default configuration, ClickHouse assumes it’s safe to allocate up to 90% of the physical RAM. You’ll see a message like the following in the log at startup.

2022.06.03 09:32:37.814551 [ 10879 ] {} <Information> Application: Setting max_server_memory_usage was set to 28.05 GiB (31.17 GiB available * 0.90 max_server_memory_usage_to_ram_ratio)

So here you can see that clickhouse detected that the system has 32 GB available, multiplied it by 0.9 (max_server_memory_usage_to_ram_ratio) and set max_server_memory_usage to 28 GB.

In this scenario, ClickHouse will allow memory allocations up to 28 GB in total, and all allocations above the limit will fail with “Memory limit (total) exceeded” exceptions, staying about 3GB away from the dangerous limit where OOM killer will start looking for a victim.

So ClickHouse tries it best to be safe, but is there anything that can go wrong here? Of course! There could be the following unexpected circumstances:

  1. Other software runs on the same node, and takes more than 10% of RAM. Imagine in the example above, ClickHouse stays within the allowed limit (28GB) but there is also mysqld running on the same node, which takes 6GB. 6+28 is bigger than 32, so the OOM killer is going to be awaken and will come and kill the biggest process. That is ClickHouse.
  2. ClickHouse may detect the available memory incorrectly. For example, ClickHouse node has 32GB of physical RAM, but there is a lower limit set using cgroups (using docker or Kubernetes memory limits). ClickHouse versions older than 21.12 were not able to detect cgroup memory limits.
  3. Sometimes ClickHouse also tracks memory inaccurately. Say, the ClickHouse memory tracker thinks that it has allocated only 26GB, so it’s ok to allocate 2GB more, but actually it has allocated 31GB already. How can it happen? Accurate memory tracking is quite a difficult task in a multi-threaded application containing hundreds of components, including external libraries. Allocations and deallocations may happen in different places of the code, and in different threads. There were a few bugs like this in old versions of ClickHouse. Many of them were addressed in the last 1-2 years. So if you use version 22.3 or newer the memory tracking should be accurate. But no software is free of bugs, so tracking errors are still possible though increasingly improbable.

To summarize, if you run a recent ClickHouse that is properly configured, and there is no other memory intensive workload running on the same server, the OOM killer should not come after ClickHouse. 

How to stay safe

So what does “properly configured” mean? Here are the main principles.

First of all, analyze carefully how the RAM is used on the node, then estimate how much RAM other software and OS can take. For example, ZFS file systems may use a lot of RAM for ARC cache. If you see that 10% is not enough for non-ClickHouse processes, then decrease max_server_memory_usage_to_ram_ratio to 0.8 or below, or set max_server_memory_usage to some explicit value. 

If cgroup limits are used (memory limit via docker / Kubernetes), make sure that the ClickHouse version is up to date, and that the limit is detected correctly. If ClickHouse can not detect RAM size correctly, set max_server_memory_usage explicitly.

And last but not least –  do not use old ClickHouse versions, because those may not track memory accurately. Regardless of the version, if you find a problem with memory being tracked incorrectly, please file an issue on GitHub, and it will be fixed in a future ClickHouse release.

Q & A

Let’s answer a few common questions that come up when configuring ClickHouse in order to avoid attracting the attention of the OOM killer.

Q. Is it possible to disable the OOM killer completely?

A. Yes, it is possible but not recommended. It performs the useful social function. Without OOM killer it is possible to get a kernel panic and a complete restart of the server. See https://www.percona.com/blog/2019/08/02/out-of-memory-killer-or-savior/.

Q. Can I set max_server_memory_usage_to_ram_ratio > 1? Can I set max_server_memory_usage to a value bigger than the amount of physical RAM

A. Yes, but it is likewise not recommended. It may work if one has swap & overcommit enabled. But that is a terrible idea for performance reasons. Sometimes it makes sense to do it on developer / low-RAM machines in order to run queries that require more RAM, but in most production scenarios memory overcommit leads to big troubles. 

Q. How to configure max_server_memory_usage?

A. Use the following example.

<!-- /etc/clickhouse-server/config.d/max_server_memory_usage.xml -->
<?xml version="1.0"?>
<clickhouse>
   <!-- when max_server_memory_usage is set max_server_memory_usage_to_ram_ratio is ignored-->
   <max_server_memory_usage>4000000000</max_server_memory_usage>
</clickhouse>

Q. How to configure  max_server_memory_usage_to_ram_ratio?

A. Here’s another example. 

<!-- /etc/clickhouse-server/config.d/max_server_memory_usage.xml -->
<?xml version="1.0"?>
<clickhouse>
   <!-- when max_server_memory_usage is set max_server_memory_usage_to_ram_ratio is ignored-->
   <max_server_memory_usage>0</max_server_memory_usage>
   <max_server_memory_usage_to_ram_ratio>0.75</max_server_memory_usage_to_ram_ratio>
</clickhouse>

Q. Can I set max_server_memory_usage to small values like 2GB?

A. In most cases that’s not the best idea. Almost every query will return the exception ‘Memory limit (total) exceeded.’ See https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#max_server_memory_usage for more information.

Q. I use version 20.3 or older, which does not have max_server_memory_usage yet.

A. Please upgrade! (Check out Altinity Stable releases.) 

Q. I’m getting ‘Memory limit (for query) exceeded’, ‘Memory limit (for user) exceeded’. Is it related?

A. You’re hitting query-level or user-level limitations. It’s not related to the ‘total’ memory limit / OOM killer. Check out the following server settings:

Q. Are there other ways to avoid running out of memory? 

A. Other approaches include enabling on-disk spilling for queries. Check out max_bytes_before_external_group_by and max_bytes_before_external_sort in the Altinity Knowledge Base.  Or you can rewrite the query to use less memory, for example by using few GROUP BY keys or choosing aggregates that use less memory. 

Q. What is the relationship between max_memory_usage and max_server_memory_usage?

A. max_memory_usage is the maximum amount of memory allowed for a single query. By default, it’s 10Gb. The default value is good across a wide range of use cases. Don’t adjust it unless you need to. 

There are scenarios when you need to relax the limit for particular queries (if you hit ‘Memory limit (for query) exceeded’), or use a lower limit if you need to discipline the users or increase the number of simultaneous queries.

max_server_memory_usage is the maximum memory for the entire server. It includes the constant memory footprint (used by different caches, dictionaries, etc.) plus the sum of memory temporarily used by running queries (a theoretical limit is the number of simultaneous queries multiplied by max_memory_usage).

Q. My ClickHouse server uses only 10% of the RAM, while the limit is 90%.

A. It is OK if your tables are large, which is to say greater than available RAM. The free RAM will be used by the OS page cache, which improves the query speed significantly if you read the same data often. If your RAM exceeds your data size or the page cache together with clickhouse-server are unable to use all available RAM, then maybe your host is too big for your dataset. (If you are running Altinity.Cloud you can just scale down to a smaller instance size).

Q. What does memory usage of the ClickHouse server usually look like?

A. It differs significantly depending on the use case. In a well-tuned system, you might see ClickHouse use 10-30% of RAM all the time, with spikes to 50-70% from time to time when large queries are executed. The page cache uses the rest.

Q. How can I debug what takes so much RAM before OOM?

A. Check /var/log/clickhouse-server/clickhouse-server.log – the records containing MemoryTracker. You can trace them back by query_id or by thread_id to understand what that memory was needed for. 

grep 'MemoryTracker' /var/log/clickhouse-server/clickhouse-server.log

For a running system, you can use system tables + log tables (if enabled). See https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kbc-who-ate-my-memory/.

Q. ClickHouse suddenly stopped when it was not using RAM / there were no OOM killer events inside system journals.

A. Well, OOM killer is not the only thing that can kill servers. Check /var/log/clickhouse-server/clickhouse-server.log and find the last lines before the crash. That should help to identify the reason. The OOM killer is generally our first suspect, but humans can also stop or restart ClickHouse. In Kubernetes environments, the Altinity Kubernetes Operator for ClickHouse may restart ClickHouse as part of an upgrade or configuration change.

Conclusion

The OOM killer keeps Linux operating system stable by eliminating processes that use too much memory. It is usually not too hard to detect when this happens to ClickHouse. If ClickHouse suddenly disappears, it’s the first cause one should try to eliminate. As this article has shown, there are many ways to address memory usage and ensure that ClickHouse and the underlying Linux OS stay healthy at all times. 

If you have further questions about ClickHouse memory usage or any other aspect of ClickHouse operation, feel free to contact us or join our public Slack workspace. You can also sign up for Altinity.Cloud which automatically configures notes to avoid OOM killer problems as well as many others. See you soon!

Share

Related: