When a GKE Auto-Upgrade Broke kube-proxy: What Happened and How We Responded

It started with an OpsGenie alert.

It was just another Saturday, or so I thought. One of our ClickHouse® clusters in Altinity.Cloud suddenly went silent — no queries, no metrics, no logs from customers trying to connect. The pods were all running. The nodes were healthy, but no traffic was reaching the database.

We were in the middle of a total ingress outage. And not just on one cluster. Others were blinking red as well. Something was blocking external access across several GKE (Google Kubernetes Engine) clusters.

This is the story of how we chased the issue down a rabbit hole involving kernel regressions, IPv6 iptables, and kube-proxy, and how we found both a short-term workaround and a long-term fix, with recommendations for what you can do if you’re affected, as well as some lessons about the inner workings of Kubernetes.

What Went Wrong

Early on May 17, several of our GKE-based Altinity.Cloud clusters began experiencing a complete loss of ingress traffic. ClickHouse services running on these clusters became unreachable, and we quickly traced the issue to newly upgraded worker nodes.

The affected nodes were automatically updated to the image version ubuntu_containerd, which ships with an updated kernel. Shortly after the upgrade, traffic to and from these nodes dropped to zero.

The specific affected image version is ubuntu_containerd 1.32.3-gke.1785003 and kernel version 6.8.0-1022-gke — if you’re using these versions, read on for our fix and recommendations.

Load balancer health checks were failing, and kube-proxy logs showed that it could not apply iptables rules correctly, particularly for IPv6. It was time to roll up our sleeves and fix the issue ASAP.

Digging Into the Root Cause

Clue #1: kube-proxy errors

The key clue was in the logs. kube-proxy was continuously failing with errors like:

Warning: Extension MARK revision 0 not supported, missing kernel module?
ip6tables-restore v1.8.9 (nf_tables): unknown option "--xor-mark"

Clue #2: Rogue IPv6 Rules

Despite not explicitly using IPv6 in our workloads, GKE nodes process IPv6 iptables rules by default. The upgraded kernel lacked support for certain extensions used by kube-proxy, due to a regression in how the nf_tables module was packaged or loaded.

Suspected Culprit: Kernel Regression in Ubuntu Nodes

This wasn’t a misconfiguration on our side; it’s part of an ongoing issue in the Ubuntu kernel series, already being tracked in several places:

Why This Was Hard to Catch

Several factors made this issue particularly disruptive:

GKE auto-upgrades nodes with no way to opt out.
IPv6 iptables rules are evaluated even if you don’t use IPv6.
The issue was not caught during GKE image validation, meaning broken images reached production clusters.
There was no working image to roll forward to — newer builds had the same regression.

Mitigation and Recovery

We had identified the likely culprit: the node’s upgrade. As an immediate first step we added maintenance exceptions to pause all upgrades for unaffected clusters; this prevented the issue from cascading to all of our GKE-hosted clusters.

Then, we started looking for a fix for the affected clusters. During troubleshooting, we attempted to roll forward to another available image version (1.32.3-gke.1927002), but it exhibited the same kernel issue.

In order to get operational as quickly as possible, we implemented a short-term fix: we modified the load balancer health check policy to ignore failed kube-proxy health checks. This would allow traffic to flow to nodes despite the missing kube-proxy extensions.

This short-term fix achieved service recovery without modifying workloads or infrastructure.

The Real Fix

To fully eliminate the risk of recurrence, we began migrating all affected clusters from ubuntu_containerd to the cos_containerd image family, which uses a different kernel version that doesn’t have the broken nf_tables behavior.

This switch involved recreating nodes with the new image type, which causes a small amount of downtime, but we considered the stability benefits to be worth it.

We’re also actively tracking Google’s community issue and waiting on a patched Ubuntu image. As of today, no fixed version has been released by GKE.

If you’re running GKE with ubuntu_containerd, we suggest taking the following steps:

Disable kube-proxy health checks in your Load Balancer by switching to externalTrafficPolicy: Cluster. This restores traffic in the short term.
Switch to cos_containerd as your node image for production workloads.
Pause auto-upgrades where possible until patched images are confirmed and tested.

What We’re Doing Going Forward

This incident highlighted a key blind spot in how we handle upstream updates. To avoid similar surprises in the future, we’re:

Expanding our image pre-testing pipelines to catch regressions earlier.
Enabling notifications for upcoming auto-upgrades to improve readiness.
Making GKE Maintenance Windows required across our fleet to control when upgrades roll out.

Need Help?

If you’re impacted by this, feel free to reach out to our team on slack. We’re happy to help you work through mitigation or migration options.

Thanks for reading, and thank you for trusting us to run ClickHouse at scale — even when kernels fight back.

2 Comments

Thanks for posting this issue.
We came across the same issue in a different form – NLB health-checks fail, causing Kubernetes services to die.
Pretty high severity regression – do you know if Google has an open issue for fixing the regression in UBUNTU_CONTAINERD?

Josh Lee says:
28th May 2025 at 7:42 am
We’re tracking the community issue we submitted here: https://issuetracker.google.com/issues/418916546?pli=1
It looks like google hasn’t opened any other issues for this yet.
Reply

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE^® SOLUTIONS

Get in touch with ClickHouse experts.

When a GKE Auto-Upgrade Broke kube-proxy: What Happened and How We Responded

What Went Wrong

Digging Into the Root Cause

Clue #1: kube-proxy errors

Clue #2: Rogue IPv6 Rules

Suspected Culprit: Kernel Regression in Ubuntu Nodes

Why This Was Hard to Catch

Mitigation and Recovery

The Real Fix

What We’re Doing Going Forward

Need Help?

2 Comments

Leave a Reply Cancel reply

PRODUCTS

OPEN SOURCE SOFTWARE

CLICKHOUSE® SOLUTIONS

Get in touch with ClickHouse experts.

What Went Wrong

Digging Into the Root Cause

Clue #1: kube-proxy errors

Clue #2: Rogue IPv6 Rules

Suspected Culprit: Kernel Regression in Ubuntu Nodes

Why This Was Hard to Catch

Mitigation and Recovery

The Real Fix

What We Recommend to Others

What We’re Doing Going Forward

Need Help?

2 Comments

Leave a Reply Cancel reply

CLICKHOUSE^® SOLUTIONS