Kubernetes stale conntrack during rolling updates: intermittent connection resets

You deploy a new version of a service. The rollout reports healthy. Error budgets look fine. Then you notice sporadic connection resets in application logs, a handful of timeout errors, or brief latency spikes that correlate exactly with pod terminations. The failures are intermittent, last only seconds, and never trigger a full outage. This is the stale conntrack race during rolling updates.

When a backend pod terminates, kube-proxy removes the endpoint from iptables or IPVS rules, but the kernel connection tracking table may still hold state for active TCP flows. Return packets matched against stale entries bypass NAT reversal, reach the client with an unexpected source IP, and trigger a RST. The result is a brief burst of connection failures that operators often misattribute to application bugs or network flapping.

What this means

kube-proxy implements Kubernetes Services by programming DNAT rules that rewrite traffic from a ClusterIP to a backend Pod IP. The kernel’s conntrack subsystem tracks these NAT mappings so return packets are rewritten back to the Service IP.

During a rolling update, a pod enters Terminating. kube-proxy removes the endpoint from the Service backend set, usually by deleting the KUBE-SEP-* iptables chain or removing the real server from IPVS. For active or recently active TCP connections, the conntrack entry often persists after the endpoint is gone. When a return packet for one of those flows arrives, conntrack still has the old Pod IP in its reply path. Instead of reversing DNAT and rewriting the source to the Service IP, the packet is forwarded to the client with the Pod IP as the source.

The client stack recognizes an unexpected source address. In many cases the kernel marks the packet as INVALID, causing it to bypass NAT reversal. The client responds with a TCP RST, killing the connection. Because this only affects flows active when the old pod terminated, the failures are intermittent and brief.

The issue is most visible in clusters with high connection reuse, long-lived requests, or workloads that open many concurrent connections to a Service during a rollout. It is not a kube-proxy crash or a network partition. It is a state synchronization race between the control plane’s view of endpoints and the kernel’s view of active connections.

flowchart TD
    A[Rolling update: old pod terminates] --> B[kube-proxy removes endpoint from rules]
    B --> C[Kernel conntrack entry for active TCP flow persists]
    C --> D[Return packet from old Pod IP arrives]
    D --> E[Conntrack state bypasses DNAT reversal]
    E --> F[Client receives packet from Pod IP instead of Service IP]
    F --> G[Client sends TCP RST]

Common causes

CauseWhat it looks likeFirst thing to check
Race between endpoint removal and conntrack flushIntermittent RSTs on existing connections exactly when old pods terminateStale conntrack entries pointing to a terminated Pod IP
Insufficient graceful termination periodRSTs spike immediately when a pod enters Terminating, before in-flight requests completeterminationGracePeriodSeconds and application SIGTERM handling
Application exits immediately on SIGTERMApp stops serving while conntrack still holds active flow stateContainer lifecycle hooks and graceful shutdown logic
kube-proxy sync lagStale rules persist longer than expected, widening the race windowkubeproxy_sync_proxy_rules_last_timestamp_seconds age

Quick checks

# Check for stale conntrack entries pointing to a terminated Pod IP
conntrack -L | grep <old-pod-ip>
# Watch conntrack entries for a Service ClusterIP during a rollout
watch "conntrack -L -d <service-cluster-ip>"
# Check the last time kube-proxy successfully synced rules
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_last_timestamp_seconds
# Verify how long the pod has to drain connections
kubectl get pod <pod> -o jsonpath='{.spec.terminationGracePeriodSeconds}'
# Check whether the kernel marks unusual TCP packets as INVALID
sysctl net.netfilter.nf_conntrack_tcp_be_liberal

What good and bad output looks like

  • conntrack -L | grep <old-pod-ip> returning entries after the pod is fully terminated confirms stale state.
  • kubeproxy_sync_proxy_rules_last_timestamp_seconds more than 60 seconds old means kube-proxy is not processing endpoint changes.
  • terminationGracePeriodSeconds of 30 or less is often too short for applications with long in-flight requests.
  • nf_conntrack_tcp_be_liberal returning 0 means the kernel is strict about marking out-of-window packets as INVALID, which amplifies the RST behavior.

How to diagnose it

Follow this flow to confirm stale conntrack is causing your resets and to identify the contributing factor.

  1. Correlate failures with deployment events.
    Check application logs, ingress metrics, or client-side connection reset counters. If RSTs or timeouts cluster within seconds of a pod entering Terminating, the correlation confirms the race.

  2. Identify the affected node and service.
    Map the source IP and destination port of the failing connections to a Kubernetes Service. Determine which backend node the traffic was routed through. This is where the stale conntrack entry lives.

  3. Capture conntrack state on the node.
    SSH to the node or use a privileged debug pod. Run conntrack -L | grep <old-pod-ip> for the terminated pod, and conntrack -L -d <service-cluster-ip> for the Service. If entries for the old Pod IP persist after the pod is deleted and no longer appears in kubectl get endpoints, the race is confirmed.

  4. Verify kube-proxy processed the endpoint removal.
    Query kubeproxy_sync_proxy_rules_last_timestamp_seconds and check kube-proxy logs. If the timestamp is recent and logs show the endpoint was removed, the issue is the kernel conntrack race, not a sync failure. If the timestamp is stale, investigate kube-proxy sync lag first.

  5. Evaluate graceful termination behavior.
    Check whether the application stops accepting new connections and waits for in-flight work to finish after receiving SIGTERM. If the process exits immediately on SIGTERM, it is dying before the conntrack entries naturally expire, widening the window for resets.

  6. Check for kernel-level INVALID drops.
    If nf_conntrack_tcp_be_liberal is 0, the kernel is likely marking valid return packets as INVALID because their sequence numbers fall outside the expected window after the endpoint change. Setting this sysctl to 1 tells the kernel to be more lenient and is a standard mitigation.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
kubeproxy_sync_proxy_rules_last_timestamp_secondsFrozen timestamp means endpoint removals are not being programmedTimestamp more than 60 seconds old
kubeproxy_sync_proxy_rules_duration_secondsLong syncs delay endpoint removal and extend the conntrack race windowp99 sync duration exceeds 5 seconds
kubeproxy_sync_proxy_rules_endpoint_changes_totalHigh endpoint churn forces frequent full syncs and can delay removalRate sustained above baseline during rollouts
rest_client_requests_total from kube-proxyAPI connectivity loss prevents timely rule updatesError codes 4xx or 5xx from API server
Conntrack table utilizationGeneral pressure amplifies drop and reset behaviornf_conntrack_count above 75 percent of nf_conntrack_max
Application connection reset rateDirect symptom of the failure modeSpikes correlating with rolling updates

Fixes

If the cause is a conntrack NAT race

The most effective kernel-level mitigation is to set net.netfilter.nf_conntrack_tcp_be_liberal=1 on every node. This prevents the kernel from marking legitimate return packets as INVALID solely because their TCP sequence numbers fall outside the expected window after an endpoint change. Many managed Kubernetes distributions set this automatically; if you run self-managed nodes, add it to your node bootstrap or a sysctl-tuning DaemonSet.

As a defensive filter, you can drop INVALID state packets before they leave the node:

iptables -t filter -I INPUT -p tcp -m conntrack --ctstate INVALID -j DROP

Warning: This drops packets and does not preserve the connection. Test in staging first.

If you are running Kubernetes 1.26 or later, terminating endpoints are enabled by default. This feature keeps a terminating pod registered in the EndpointSlice during its grace period, giving kube-proxy a consistent backend target and reducing orphaned conntrack state. Ensure your cluster is not disabling this behavior.

To clear stale state during an active incident:

conntrack -D -d <old-pod-ip>

If the cause is premature pod shutdown

Increase terminationGracePeriodSeconds to give the application enough time to drain active connections before the container runtime sends SIGKILL. The value should exceed the longest in-flight request duration your application handles.

Ensure the application responds to SIGTERM by closing its listener and waiting for open requests to complete, rather than exiting immediately. If your framework or container image does not handle this natively, add a preStop lifecycle hook that signals the application to begin draining before Kubernetes sends SIGTERM.

If the cause is UDP-specific conntrack lag

For UDP services, including CoreDNS, conntrack cleanup behavior differs from TCP. If you are in IPVS mode, stale UDP session affinity can cause traffic to stick to dead backends. Reduce the IPVS UDP timeout to limit persistence:

ipvsadm --set <tcp_timeout> <tcp_fin_timeout> <udp_timeout>

If you are in iptables mode and see UDP drops during rollouts, verify that kube-proxy is flushing UDP conntrack entries on endpoint removal. Manual flushing with conntrack -D can provide temporary relief.

Prevention

  • Set net.netfilter.nf_conntrack_tcp_be_liberal=1 on all nodes and persist the setting across reboots via sysctl configuration.
  • Validate that every service implements graceful shutdown: stop accepting new connections, complete in-flight work, then exit.
  • Size terminationGracePeriodSeconds to match your actual request latency distribution, not just a default value.
  • Monitor kube-proxy sync latency and endpoint change rates. A kube-proxy that cannot keep up with endpoint churn leaves a larger window for stale conntrack races.
  • Run load tests that include rolling updates under realistic connection concurrency to measure baseline reset rates before changes reach production.

How Netdata helps

In Netdata, look for:

  • kubeproxy_sync_proxy_rules_duration_seconds and endpoint change rates climbing before resets.
  • Node-level conntrack utilization and INVALID packet rate rising during rollouts.
  • Application error spikes aligned with pod termination events in the Deployment timeline.