Kubernetes kube-proxy IPVS: stale rules and session affinity issues
DNS queries start timing out from one node after a CoreDNS rolling update. A UDP Service returns timeouts for some clients but not others. New Services are unreachable from a specific node while older Services continue to work.
In IPVS mode, kube-proxy programs the kernel’s IPVS table with virtual servers and real servers. The IPVS connection table lives outside kube-proxy’s direct control and outside nf_conntrack. That separation creates two IPVS-specific failure modes: stale rules that diverge from EndpointSlice state, and UDP session affinity that sticks to dead backends long after a pod terminates.
What this means
kube-proxy in IPVS mode does not proxy traffic in userspace. It creates IPVS virtual servers for each Service ClusterIP and registers real servers for each endpoint. The kernel handles forwarding via hash lookups. This scales better than iptables for large clusters, but it introduces two failure modes that behave differently than in iptables mode.
First, stale rules. If kube-proxy’s sync loop falls behind, its API server watch silently dies, or a sync fails partially, the IPVS virtual servers and real servers on a node may diverge from the current EndpointSlice state. Traffic continues to flow through the kernel, but it flows to backends that no longer exist or misses backends that were just created.
Second, UDP session affinity sticking to dead backends. IPVS maintains its own connection table, separate from nf_conntrack. For UDP traffic, IPVS treats a flow from a specific source IP and source port as a session and remembers which real server received the first packet. When that backend pod is terminated, the IPVS connection entry remains active with a default UDP timeout of 300 seconds. New packets from the same client continue to be forwarded to the dead pod IP until that timeout expires. kube-proxy’s conntrack cleanup flushes nf_conntrack entries, but it does not flush the IPVS connection table. This is particularly devastating for CoreDNS and other UDP-based cluster services.
flowchart TD
A[UDP packet to ClusterIP] --> B{IPVS connection table lookup}
B -->|Existing entry found| C[Forward to old backend IP]
B -->|No entry| D[Round-robin to current backend]
C --> E{Old pod terminated?}
E -->|Yes| F[Packet blackholed or dropped]
E -->|No| G[Normal response]
F --> H[Client retries from same source IP and port]
H --> A
D --> I[Create new IPVS connection entry]
I --> GCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| IPVS UDP timeout keeping dead backends | DNS or UDP timeouts from specific nodes after a pod rollout; only some clients affected | ipvsadm -Lcn showing connections to terminated pod IPs |
| Silent API server watch death | New Services unreachable from one node while existing Services work; no obvious errors | Age of kubeproxy_sync_proxy_rules_last_timestamp_seconds on the node |
| Sync loop backlog | Endpoint changes take minutes to appear in IPVS; rolling updates cause intermittent drops | kubeproxy_sync_proxy_rules_duration_seconds p99 versus the sync period |
| IPVS real server leak | Deleted pod IP still appears as a real server with traffic directed to it | ipvsadm -Ln real server list compared to current EndpointSlices |
| Conntrack exhaustion alongside IPVS | UDP packets dropped despite healthy endpoints; DNS fails cluster-wide | conntrack -S drop counter and nf_conntrack_count versus nf_conntrack_max |
Quick checks
# List IPVS virtual servers and real servers
sudo ipvsadm -Ln
# Show IPVS connection table entries with source-to-backend mappings
sudo ipvsadm -Lcn
# Check age of the last successful sync from kube-proxy metrics
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_last_timestamp_seconds
# Check sync duration percentiles
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_duration_seconds
# Compare current endpoint IPs to IPVS real servers for a service
kubectl get endpointslices -l kubernetes.io/service-name=coredns -o json | jq -r '.items[].endpoints[].addresses[]'
sudo ipvsadm -Ln -t <cluster-ip>:53
# Check conntrack entries for a specific ClusterIP
sudo conntrack -L -d <cluster-ip> 2>/dev/null
# Check kube-proxy logs for IPVS or sync errors
kubectl logs -n kube-system -l k8s-app=kube-proxy | grep -iE "error|ipvs|sync"
How to diagnose it
Confirm the node is in IPVS mode. Run
sudo ipvsadm -Ln. If it returns virtual servers, the node is using IPVS. If the output is empty, kube-proxy may be in iptables mode and this article’s IPVS-specific guidance does not apply. Why: iptables mode does not maintain a separate per-flow connection table for UDP. Sticking to dead backends is an IPVS-specific behavior.Check for stuck UDP session affinity. Run
sudo ipvsadm -Lcn | grep <old-pod-ip>to see if IPVS connections still point to a terminated backend. Why: IPVS tracks UDP flows independently. Even after nf_conntrack entries expire, IPVS can retain the source-to-backend mapping for the duration of its UDP timeout.Verify kube-proxy sync health. Query
kubeproxy_sync_proxy_rules_last_timestamp_secondsonhttp://localhost:10249/metrics. If the value is older than 2-3 minutes, the sync loop is stalled or the API watch is dead. Why: A kube-proxy process can pass its healthz check while operating on frozen state. The last successful sync timestamp is the correctness signal; healthz is only a liveness signal.Compare API state to IPVS state. Get the current EndpointSlice addresses for the affected Service and compare them against the real servers shown in
sudo ipvsadm -Ln. Why: Discrepancies reveal whether the issue is stale rules from sync lag, or active IPVS connection entries that have outlived their endpoint.Check for conntrack overlap. Run
sudo conntrack -L -d <cluster-ip>and cross-reference the entries withsudo ipvsadm -Lcn. Why: Both tables can contain stale entries simultaneously. Cleaning nf_conntrack without addressing the IPVS table leaves the affinity problem unresolved.Determine whether the scope is node-local or cluster-wide. If one node is affected, suspect a local kube-proxy watch failure or sync stall. If all nodes show the same stale entries or DNS timeouts, suspect a cluster-wide endpoint change event, a conntrack saturation wave, or a configuration issue. Why: IPVS state is per-node. Localized symptoms point to node-level kube-proxy failure rather than a cluster control plane outage.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
kubeproxy_sync_proxy_rules_duration_seconds p99 | Measures time to reconcile IPVS rules | p99 exceeding 5-10 seconds indicates sync lag |
kubeproxy_sync_proxy_rules_last_timestamp_seconds | Shows freshness of the last successful sync | Timestamp older than 2 minutes means stale rules |
IPVS active connection count (ipvsadm -Ln) | Reveals load distribution and stuck flows | ActiveConn on a weight=0 or missing real server |
rest_client_requests_total from kube-proxy | Tracks API server connectivity | 5xx, 403, or 429 errors indicate watch problems |
nf_conntrack_count vs nf_conntrack_max | Shared kernel resource used alongside IPVS | Above 75% increases risk of silent connection drops |
Conntrack drop counter (conntrack -S) | Confirms packet loss from full table | Any increment means new connections are failing |
Fixes
If UDP session affinity is stuck to a dead backend
Reduce the IPVS UDP timeout so stale entries expire faster. The default is 300 seconds.
# Reduce UDP timeout to 60 seconds (emergency; affects all IPVS UDP services on the node)
# Arguments are TCP, TCP_FIN, and UDP timeouts respectively
sudo ipvsadm --set 900 120 60
This change is immediate but non-persistent across reboots. Document the node and revert or persist via your node configuration management after the incident.
To clear existing stuck entries immediately, delete the specific virtual server. kube-proxy recreates it on the next sync.
# WARNING: Traffic to this ClusterIP will drop until kube-proxy recreates the virtual server.
# Use only when waiting for timeout expiration is not acceptable.
sudo ipvsadm -D -t <cluster-ip>:<port>
If you cannot tolerate the brief drop, lower the timeout and wait for the stale entries to expire.
If kube-proxy sync is stalled
Restart kube-proxy on the affected node to force a full re-sync.
# Delete the kube-proxy pod; the DaemonSet will recreate it
kubectl delete pod -n kube-system -l k8s-app=kube-proxy --field-selector spec.nodeName=<node-name>
After restart, allow 30 to 60 seconds for the initial sync to complete. The first sync programs all rules from scratch and will take longer than incremental syncs.
If conntrack is exhausting alongside IPVS issues
IPVS mode still uses nf_conntrack for masquerading. Increase the table limit as immediate relief.
# Immediate relief
sudo sysctl -w net.netfilter.nf_conntrack_max=262144
Then investigate connection leaks or churn that are filling the table. Conntrack exhaustion affects all traffic on the node, not just the Service layer.
If IPVS real servers outlive their endpoints
When the sync loop cannot remove a real server, verify that the EndpointSlice has removed the backend. If the API state is correct but ipvsadm -Ln still shows the old real server, restart kube-proxy to force a full state rebuild.
Prevention
- Monitor
kubeproxy_sync_proxy_rules_last_timestamp_secondson every node and alert when it is older than two minutes. A passing healthz check does not guarantee that rules are current. - Track IPVS UDP timeout defaults in your node baseline. If you run CoreDNS or other UDP services in IPVS mode, lower the UDP timeout proactively rather than waiting for an incident.
- Size
nf_conntrack_maxfor your node workload density. IPVS mode still relies on conntrack for masquerading. - During rolling updates of UDP-backed Services, watch
ipvsadm -Lcnfor connection counts to terminating pods. If counts are high, consider a controlled virtual server deletion before the update. - Ensure API server load balancers support long-lived connections. Dropped watch connections are a leading cause of silent sync death.
- Periodically compare EndpointSlice state against
ipvsadm -Lnoutput as a consistency check, especially after node recoveries or kube-proxy restarts.
How Netdata helps
Netdata correlates the signals that isolate IPVS failures from generic network issues:
- Per-node conntrack utilization and drop rates alongside kube-proxy process health to distinguish table exhaustion from rule staleness.
- kube-proxy metrics endpoint scraping for sync duration and last sync timestamp freshness.
- Kernel-level connection and socket metrics alongside container health to surface when IPVS entries stick to terminating backends.
Related guides
- For conntrack table exhaustion diagnosis, see Kubernetes conntrack exhaustion: dropped connections under load.
- For sync loop issues in iptables mode, see Kubernetes kube-proxy iptables sync stall: causes and recovery.
- For DNS-specific failure patterns inside pods, see Kubernetes DNS resolution failures inside pods.
- For the full signal taxonomy for Kubernetes networking, see Kubernetes monitoring checklist: the signals every production cluster needs.






