Fixing the Dreaded ClickHouse Crash Loop on Kubernetes

As countless ClickHouse users have learned, Kubernetes is a great platform for data. It’s portable to almost every IT environment. Managed Kubernetes services like Amazon EKS simplify operation. And the Altinity Kubernetes Operator for ClickHouse lets you start complex ClickHouse clusters from a single resource file. 

But there’s still the occasional dark cloud.  One of these is pod crash loops, which occur when a ClickHouse pod crashes on startup. Here’s an example that shows a pod crash loop in progress.

$ kubectl get pods
NAME                      READY   STATUS             RESTARTS      AGE
chi-crash-demo-ch-0-0-0   0/1     CrashLoopBackOff   4 (19s ago)   116s
chi-ok-demo-ch-0-0-0      1/1     Running            0             2m4s

Pod crash loops often arise because of ClickHouse misconfiguration or version upgrade issues. They are generally straightforward to fix once you know the cause. But how can you figure out what’s happening? This blog article walks through the steps to diagnose and fix crash loop problems. 

Dawn of a crash loop

The examples I’m about to provide use Kubernetes 1.22 running on Minikube, ClickHouse 21.8.11.1 (Altinity Stable build), and Altinity Operator for ClickHouse version 0.18.3. To keep things simple, storage definitions, AZ assignments, and other niceties are omitted. 

OK, let’s create a healthy ClickHouse server. We start with a very simple pod definition, which we’ll store in file crash.yaml. 

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: "crash-demo"
spec:
  configuration:
    clusters:
      - name: "ch"
        templates:
          podTemplate: clickhouse-stable
  templates:
    podTemplates:
      - name: clickhouse-stable
        spec:
          containers:
          - name: clickhouse
            image: altinity/clickhouse-server:21.8.11.1.altinitystable

We apply the definition and have a look at the resulting pod. 

$ kubectl apply -f crash.yaml 
clickhouseinstallation.clickhouse.altinity.com/crash-demo created
$ kubectl get pods
NAME                      READY   STATUS    RESTARTS   AGE
chi-crash-demo-ch-0-0-0   1/1     Running   0          10s

Everything is healthy so far. Now let’s break it by adding a bad configuration file into the resource definition. Here’s the new definition with the offending text highlighted.

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: "crash-demo"
spec:
  configuration:
    clusters:
      - name: "ch"
        templates:
          podTemplate: clickhouse-stable
    files:
      badconfig.xml: |
        <yandex>
          <foo><baz>
        </yandex>

  templates:
    podTemplates:
      - name: clickhouse-stable
        spec:
          containers:
          - name: clickhouse
            image: altinity/clickhouse-server:21.8.11.1.altinitystable

Let’s apply and see what happens. Note: You may have to wait a minute or two before ClickHouse picks up the change and the pod restarts. 

$ kubectl apply -f crash.yaml 
clickhouseinstallation.clickhouse.altinity.com/crash-demo configured
$ kubectl get pods
NAME                      READY   STATUS             RESTARTS      AGE
chi-crash-demo-ch-0-0-0   0/1     CrashLoopBackOff   1 (12s ago)   15s

Boom! We have just created a pod crash loop. The pod will continuously restart until we figure out what is wrong and fix it. 

Getting to the root cause

We now have a broken pod to play with. How do we figure out what happened? Let’s go through the steps in order.  

Check pod events using `kubectl describe`

The `kubectl describe` command shows you configuration data and events related to a currently executing pod. This should be your first stop if a pod is not coming up for any reason, including pod crash loops. Here is an example to describe our pod.

$ kubectl describe pod/chi-crash-demo-ch-0-0-0
Name:         chi-crash-demo-ch-0-0-0
Namespace:    default
. . .
(Lots of configuration information)
Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  4m28s                   default-scheduler  Successfully assigned default/chi-crash-demo-ch-0-0-0 to logos2
  Normal   Pulled     3m44s (x4 over 4m27s)   kubelet            Container image "altinity/clickhouse-server:21.8.11.1.altinitystable" already present on machine
  Normal   Created    3m44s (x4 over 4m27s)   kubelet            Created container clickhouse
  Normal   Started    3m44s (x4 over 4m27s)   kubelet            Started container clickhouse
  Warning  BackOff    3m13s (x13 over 4m25s)  kubelet            Back-off restarting failed container

If you made a simple configuration mistake, such as picking a bad pod name, you’ll see it here. The event output also prints useful messages if you can’t allocate storage or don’t have enough resources to schedule the pod, e.g., insufficient memory or CPU. If you see a problem, correct the resource file and apply it again with kubectl.  

In this case there’s nothing useful in the message, so we’ll have to proceed to the next step. 

Check pod logs with `kubectl logs`

The ClickHouse pod may be crashing, but that does not mean we can’t see the logs from outside. Our next step is to use the ‘kubectl logs’ command, which will show messages as  ClickHouse starts. Here’s an example. 

$ kubectl logs pod/chi-crash-demo-ch-0-0-0
Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = Exception: Failed to merge config with '/etc/clickhouse-server/config.d/badconfig.xml': SAXParseException: Tag mismatch in '/etc/clickhouse-server/config.d/badconfig.xml', line 3 column 2 (version 21.8.11.1.altinitystable (altinity build))
. . . 

In this case, the logs show us a useful message right away. There is something wrong with file badconfig.xml, so we now know where to look.  We can fix the resource definition and apply it with kubectl. 

Change the pod entry point and debug using `kubectl exec`

In some cases, it’s not enough to see logs to figure out what’s going on. We need to get into the pod and go mano-a-mano with ClickHouse. The key to debugging the pod is to make it come up but not run ClickHouse. 

To make the pod come up and halt, we need to make two simple changes to the ClickHouse configuration. We’ll change the entrypoint to run a sleep command and we’ll alter the liveness probe. Here’s the configuration file, with changes commented and highlighted.

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: "crash-demo"
spec:
  configuration:
    clusters:
      - name: "ch"
        templates:
          podTemplate: clickhouse-stable
    files:
      badconfig.xml: |
        <yandex>
          <foo><baz>
        </yandex>

  templates:
    podTemplates:
      - name: clickhouse-stable
        spec:
          containers:
          - name: clickhouse
            image: altinity/clickhouse-server:21.8.11.1.altinitystable
            # Add command to bring up pod and stop.
            command:
              - "/bin/bash"
              - "-c"
              - "sleep 9999999"
            # Fix liveness probe so that we won't look for ClickHouse.
            livenessProbe:
              exec:
                command:
                - ls
              initialDelaySeconds: 5
              periodSeconds: 5

Apply the updated file and wait until the pod starts successfully. Again, this might take a couple minutes. 

$ kubectl apply -f crash.yaml clickhouseinstallation.clickhouse.altinity.com/crash-demo configured
$ kubectl get pods
NAME                      READY   STATUS    RESTARTS   AGE
chi-crash-demo-ch-0-0-0   0/1     Running   0          5s

OK, success!  We are ready to connect and figure out what’s wrong.  But first, one question: why did we alter the liveness probe? It’s an important trick.

The liveness probe is used by Kubernetes to check whether the pod is working or not. For ClickHouse, clickhouse-operator configures a liveness probe to run an HTTP GET against the ClickHouse /ping URL. If the liveness probe fails–and it will because ClickHouse can’t start–Kubernetes will eventually notice and restart the pod. That’s a little disappointing if you are right in the middle of debugging problems. 

Now that the pod is patiently waiting, let’s use kubectl exec to get in and see what’s going on. We enter the following command to get to the bash prompt. 

$ kubectl exec -it chi-crash-demo-ch-0-0-0 -- bash
root@chi-crash-demo-ch-0-0-0:/#

Cool, we’re in and can start poking around to diagnose the problem. The simplest way is to start ClickHouse manually and see what happens. Here’s what we see. 

# clickhouse-server -C /etc/clickhouse-server/config.xml
Processing configuration file '/etc/clickhouse-server/config.xml'.
Merging configuration file '/etc/clickhouse-server/conf.d/chop-generated-macros.xml'.
Merging configuration file '/etc/clickhouse-server/config.d/01-clickhouse-01-listen.xml'.
Merging configuration file '/etc/clickhouse-server/config.d/01-clickhouse-02-logger.xml'.
Merging configuration file '/etc/clickhouse-server/config.d/01-clickhouse-03-query_log.xml'.
Merging configuration file '/etc/clickhouse-server/config.d/01-clickhouse-04-part_log.xml'.
Merging configuration file '/etc/clickhouse-server/config.d/badconfig.xml'.
Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = Exception: Failed to merge config with '/etc/clickhouse-server/config.d/badconfig.xml': SAXParseException: Tag mismatch in '/etc/clickhouse-server/config.d/badconfig.xml', line 3 column 2, Stack trace (when copying this message, always include the lines below):
...

We now see what’s wrong. The bad configuration file we inserted is biting us, just as we saw from looking at logs using kubectl.

Advanced debugging

In some cases we might still not understand why ClickHouse is crashing. Networking issues are a common reason why further work is necessary. In this case you may need more debugging tools than are available on the stripped down container that runs ClickHouse. 

You can get additional packages using ‘apt install’. For example, say you need the ping command to diagnose network connectivity. Here’s how to get it. 

$ kubectl exec -it chi-crash-demo-ch-0-0-0 -- bash
root@chi-crash-demo-ch-0-0-0:/# apt update
. . .
root@chi-crash-demo-ch-0-0-0:/# apt install iputils-ping
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  libcap2 libcap2-bin libpam-cap
The following NEW packages will be installed:
  iputils-ping libcap2 libcap2-bin libpam-cap
0 upgraded, 4 newly installed, 0 to remove and 21 not upgraded.
Need to get 90.5 kB of archives.
After this operation, 333 kB of additional disk space will be used.
Do you want to continue? [Y/n] Y
. . . 
root@chi-ok-demo-ch-0-0-0:/# ping www.yahoo.com
PING new-fp-shed.wg1.b.yahoo.com (98.137.11.163) 56(84) bytes of data.
64 bytes from media-router-fp74.prod.media.vip.gq1.yahoo.com (98.137.11.163): icmp_seq=1 ttl=48 time=31.2 ms

Bear in mind that any tools you install will disappear when the pod restarts. This is a feature, not a bug; you don’t have to worry about cleaning up the debris left from diagnosing problems. 

Fixing the ClickHouse pod

Depending on the crash loop cause there may be different fixes. Here are three cases we often see and how to fix them. 

  1. Configuration file error. Fix the configuration in the Kubernetes resource definition and apply using kubectl. Don’t fix configuration issues on the file system. Your fixes will disappear when the pod restarts.
  2. Bad SQL file after upgrade. Sometimes old table definitions have SQL that is no longer supported in a new ClickHouse version. If the table is not needed, you can fix it by moving the table definition file out of /var/lib/clickhouse/metadata/ to /var/lib/clickhouse. ClickHouse then won’t see it when trying to boot. (Do this fix using a `kubectl exec` session; it will persist when ClickHouse restarts.)
  3. ClickHouse bad upgrade. You have upgraded to a bad version of ClickHouse, for whatever reason. This is rare but happens. Set the version number in the Kubernetes resource definition back to the previous working version and apply using kubectl. 

In our example, the root cause is case 1. We can exit the pod (if we’re using kubectl exec) and go back to the resource definition.  Let’s comment out the bad configuration file plus the commands used to halt the pod on startup.  Here’s the new file.

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: "crash-demo"
spec:
  configuration:
    clusters:
      - name: "ch"
        templates:
          podTemplate: clickhouse-stable
#    files:
#      badconfig.xml: |
#        <yandex>
#          <foo><baz>
#        </yandex>

  templates:
    podTemplates:
      - name: clickhouse-stable
        spec:
          containers:
          - name: clickhouse
            image: altinity/clickhouse-server:21.8.11.1.altinitystable
#            # Add command to bring up pod and stop.
#            command:
#              - "/bin/bash"
#              - "-c"
#              - "sleep 9999999"
#            # Fix liveness probe so that we won't look for ClickHouse.
#            livenessProbe:
#              exec:
#                command:
#                - ls
#              initialDelaySeconds: 5
#              periodSeconds: 5

Let’s apply it and see what happens. Once again, you may need to wait a minute or two for the pod to restart. 

$ kubectl apply -f crash.yaml 
clickhouseinstallation.clickhouse.altinity.com/crash-demo configured
$ kubectl get pods
NAME                      READY   STATUS    RESTARTS   AGE
chi-crash-demo-ch-0-0-0   1/1     Running   0          24s

The configuration file does not look very pretty with the extra comments, but no matter. ClickHouse is up and applications are back online.  We can clean up at leisure. 

Conclusion

Pod crash loops are rare but they do arise from time to time. The fact that pods are “closed” can make debugging difficult the first time it happens to you, especially on a production system. Fortunately, there are abundant tools, and you can often diagnose issues without directly connecting to the pod using ‘kubectl exec.’ I hope this blog article will help you the next time you run into the problem.

At Altinity we are huge fans of running ClickHouse on Kubernetes. Don’t hesitate to contact us if you have further questions. You can use the Contact Us form, join our Slack Workspace, or post issues on the Altinity Operator for ClickHouse project on GitHub. It’s open source and we love to help users as well as make the code even better. See you soon! 

Share

2 Comments

Comments are closed.