When a GKE Auto-Upgrade Broke kube-proxy: What Happened and How We Responded

It started with an OpsGenie alert.
It was just another Saturday, or so I thought. One of our ClickHouse® clusters in Altinity.Cloud suddenly went silent — no queries, no metrics, no logs from customers trying to connect. The pods were all running. The nodes were healthy, but no traffic was reaching the database.
We were in the middle of a total ingress outage. And not just on one cluster. Others were blinking red as well. Something was blocking external access across several GKE (Google Kubernetes Engine) clusters.
This is the story of how we chased the issue down a rabbit hole involving kernel regressions, IPv6 iptables, and kube-proxy, and how we found both a short-term workaround and a long-term fix, with recommendations for what you can do if you’re affected, as well as some lessons about the inner workings of Kubernetes.
What Went Wrong
Early on May 17, several of our GKE-based Altinity.Cloud clusters began experiencing a complete loss of ingress traffic. ClickHouse services running on these clusters became unreachable, and we quickly traced the issue to newly upgraded worker nodes.
The affected nodes were automatically updated to the image version ubuntu_containerd, which ships with an updated kernel. Shortly after the upgrade, traffic to and from these nodes dropped to zero.
The specific affected image version is
ubuntu_containerd 1.32.3-gke.1785003
and kernel version6.8.0-1022-gke
— if you’re using these versions, read on for our fix and recommendations.
Load balancer health checks were failing, and kube-proxy logs showed that it could not apply iptables rules correctly, particularly for IPv6. It was time to roll up our sleeves and fix the issue ASAP.
Digging Into the Root Cause
Clue #1: kube-proxy errors
The key clue was in the logs. kube-proxy was continuously failing with errors like:
Warning: Extension MARK revision 0 not supported, missing kernel module?
ip6tables-restore v1.8.9 (nf_tables): unknown option "--xor-mark"
Clue #2: Rogue IPv6 Rules
Despite not explicitly using IPv6 in our workloads, GKE nodes process IPv6 iptables rules by default. The upgraded kernel lacked support for certain extensions used by kube-proxy, due to a regression in how the nf_tables module was packaged or loaded.
Suspected Culprit: Kernel Regression in Ubuntu Nodes
This wasn’t a misconfiguration on our side; it’s part of an ongoing issue in the Ubuntu kernel series, already being tracked in several places:
Why This Was Hard to Catch
Several factors made this issue particularly disruptive:
- GKE auto-upgrades nodes with no way to opt out.
- IPv6 iptables rules are evaluated even if you don’t use IPv6.
- The issue was not caught during GKE image validation, meaning broken images reached production clusters.
- There was no working image to roll forward to — newer builds had the same regression.
Mitigation and Recovery
We had identified the likely culprit: the node’s upgrade. As an immediate first step we added maintenance exceptions to pause all upgrades for unaffected clusters; this prevented the issue from cascading to all of our GKE-hosted clusters.
Then, we started looking for a fix for the affected clusters. During troubleshooting, we attempted to roll forward to another available image version (1.32.3-gke.1927002
), but it exhibited the same kernel issue.
In order to get operational as quickly as possible, we implemented a short-term fix: we modified the load balancer health check policy to ignore failed kube-proxy health checks. This would allow traffic to flow to nodes despite the missing kube-proxy extensions.
This short-term fix achieved service recovery without modifying workloads or infrastructure.
The Real Fix
To fully eliminate the risk of recurrence, we began migrating all affected clusters from ubuntu_containerd
to the cos_containerd
image family, which uses a different kernel version that doesn’t have the broken nf_tables
behavior.
This switch involved recreating nodes with the new image type, which causes a small amount of downtime, but we considered the stability benefits to be worth it.
We’re also actively tracking Google’s community issue and waiting on a patched Ubuntu image. As of today, no fixed version has been released by GKE.
What We Recommend to Others
If you’re running GKE with ubuntu_containerd, we suggest taking the following steps:
- Disable kube-proxy health checks in your Load Balancer by switching to
externalTrafficPolicy: Cluster.
This restores traffic in the short term. - Switch to
cos_containerd
as your node image for production workloads. - Pause auto-upgrades where possible until patched images are confirmed and tested.
What We’re Doing Going Forward
This incident highlighted a key blind spot in how we handle upstream updates. To avoid similar surprises in the future, we’re:
- Expanding our image pre-testing pipelines to catch regressions earlier.
- Enabling notifications for upcoming auto-upgrades to improve readiness.
- Making GKE Maintenance Windows required across our fleet to control when upgrades roll out.
Need Help?
If you’re impacted by this, feel free to reach out to our team on slack. We’re happy to help you work through mitigation or migration options.
Thanks for reading, and thank you for trusting us to run ClickHouse at scale — even when kernels fight back.
ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.
Thanks for posting this issue.
We came across the same issue in a different form – NLB health-checks fail, causing Kubernetes services to die.
Pretty high severity regression – do you know if Google has an open issue for fixing the regression in UBUNTU_CONTAINERD?
We’re tracking the community issue we submitted here: https://issuetracker.google.com/issues/418916546?pli=1
It looks like google hasn’t opened any other issues for this yet.