Kubernetes conntrack exhaustion: dropped connections under load

Intermittent connection timeouts under load in Kubernetes often trace to a full nf_conntrack table on the node. Existing TCP sessions stay open, but new connections fail silently. DNS resolution becomes unreliable. Application logs show timeouts to healthy dependencies. The root cause is usually not the application, network policy, or CNI, but kernel connection tracking exhaustion.

Every connection that traverses kube-proxy NAT rules creates an entry in the node’s nf_conntrack table. This finite, node-level table is shared by all workloads and invisible to most application monitoring. When it fills, the kernel drops new connection attempts without sending a TCP reset or ICMP error. The application sees a timeout.

What this means

The Linux kernel’s connection tracking subsystem (nf_conntrack) maintains state for every network connection that requires NAT. kube-proxy uses DNAT and SNAT to implement Kubernetes Services, which means nearly every pod-to-service connection creates a conntrack entry. These entries persist until the connection closes or a timeout fires.

Because conntrack is a node-level resource, all pods, host processes, and kubelet operations share one table. When the table reaches nf_conntrack_max, the kernel cannot allocate new entries. It silently drops SYN packets for TCP and new UDP flows. Existing established connections continue because their entries remain in the table, which makes the failure appear random and workload-dependent rather than systemic.

DNS usually fails first. CoreDNS relies on short-lived UDP queries to upstream resolvers. UDP conntrack entries accumulate without the natural cleanup signals that TCP provides, so a node under pressure often loses DNS before application TCP traffic fails.

Common causes

CauseWhat it looks likeFirst thing to check
Default table size too low for workloadTimeouts begin as traffic increases; small nodes with many pods hit limits firstnf_conntrack_count against nf_conntrack_max
High connection churn / TIME_WAIT accumulationMicroservices making many short HTTP requests fill the table with TIME_WAIT entriesconntrack -L -p tcp state distribution
UDP traffic accumulationDNS, StatsD, or logging traffic creating entries that lack natural TCP cleanupconntrack -L protocol breakdown
Connection leaksApplications opening connections without closing them; entries never free until timeoutconntrack -L -d <pod-ip> age per destination
kube-proxy stale endpoint entriesRemoved pods still have conntrack entries because cleanup failed or racedconntrack -L -d <old-pod-ip> after a rollout
Bursty traffic or retry stormsA brief spike in new connections overwhelms the remaining table headroomconntrack -S drop counter rate

Quick checks

Run these checks on a node showing symptoms. They are read-only unless noted.

# Check conntrack utilization
awk '{c=$1} END {getline m < "/proc/sys/net/netfilter/nf_conntrack_max"; printf "%.1f%%\n", c*100/m}' /proc/sys/net/netfilter/nf_conntrack_count

# Check for active drops
conntrack -S

# Check kernel log for table-full messages
dmesg | grep -i "nf_conntrack.*table full"

# List conntrack entries by protocol to spot UDP accumulation
conntrack -L | awk '{print $1}' | sort | uniq -c | sort -rn

# Check TCP state distribution for TIME_WAIT bloat
conntrack -L -p tcp | awk '{print $4}' | sort | uniq -c | sort -rn

# Verify kube-proxy is syncing rules and not stuck
curl -s http://localhost:10249/metrics | grep kubeproxy_sync_proxy_rules_last_timestamp_seconds

# Check if IPVS conntrack is also filling (IPVS mode only)
ipvsadm -Lcn | wc -l

How to diagnose it

  1. Confirm the table is full. Check the ratio of nf_conntrack_count to nf_conntrack_max. If utilization is above 85%, the node is in the danger zone. Above 95%, new connections are likely being dropped.

  2. Confirm drops are occurring. Run conntrack -S and look for a non-zero drop counter. If the counter is increasing, the kernel is actively rejecting new connections. Also check dmesg for the string nf_conntrack: table full, dropping packet. This is definitive proof.

  3. Determine whether the problem is isolated or widespread. Check the same metrics on other nodes. Conntrack exhaustion is usually workload-dependent. If only one node is affected, look for a noisy neighbor pod or a connection leak on that node. If many nodes are affected, the cluster-wide traffic pattern or the default nf_conntrack_max is too low.

  4. Identify which protocol is filling the table. Use conntrack -L grouped by protocol. If UDP dominates, suspect DNS, metrics, or logging traffic. If TCP dominates, inspect the TCP state distribution. A high proportion of TIME_WAIT indicates short-lived HTTP connections without reuse. A high proportion of ESTABLISHED indicates long-lived or leaked connections.

  5. Correlate with workload changes. Check if the issue started after a deployment rollout, a scale-up event, or a configuration change that increased connection rates. Rolling updates spike conntrack usage when old and new endpoints coexist and kube-proxy has not yet flushed old entries.

  6. Check kube-proxy sync health. A kube-proxy instance with a dead API server watch or a stalled sync loop may fail to clean up conntrack entries for removed endpoints. Verify that kubeproxy_sync_proxy_rules_last_timestamp_seconds is advancing and that the process is not crash-looping.

  7. Check for stale endpoint entries. After a rolling update, run conntrack -L -d <old-pod-ip> for IPs of terminated pods. If entries remain, kube-proxy’s cleanup did not run or lost a race. These stale entries consume table space until the TCP or UDP timeout expires.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
nf_conntrack_count / nf_conntrack_maxMeasures table utilization. This is the primary indicator.Sustained ratio above 70%
conntrack -S drop counterConfirms active packet loss due to exhaustion.Any increasing value
dmesg nf_conntrack: table fullDefinitive kernel-level evidence of the failure.Any occurrence
TCP TIME_WAIT ratioReveals connection churn from short-lived HTTP flows.TIME_WAIT exceeds 50% of TCP entries
UDP conntrack entry countUDP lacks connection close signals, so entries accumulate silently.UDP count growing steadily without traffic decrease
kube-proxy sync timestamp ageStale rules prevent endpoint cleanup, extending conntrack lifetime.Last sync older than 2 minutes
CoreDNS SERVFAIL rateDNS fails first because UDP queries are small and frequent.coredns_dns_responses_total{rcode="SERVFAIL"} increasing
Application connection timeout rateThe user-visible symptom of silent SYN drops.Timeouts correlating with specific nodes

Fixes

Immediate relief

Increase the table size. This is safe and takes effect immediately without restarting services.

# Double the conntrack limit (temporary)
sudo sysctl -w net.netfilter.nf_conntrack_max=131072

# Persist across reboots by adding to sysctl.d
echo "net.netfilter.nf_conntrack_max=131072" | sudo tee /etc/sysctl.d/99-conntrack.conf

Flush stale entries. If you have confirmed that old pod IPs are filling the table after a rollout, you can delete entries for a specific dead IP. This is state-changing but low-risk if the IP is truly terminated.

# Remove entries for a terminated pod IP (state-changing)
conntrack -D -d <old-pod-ip>

If the cause is connection churn

Reduce TCP TIME_WAIT accumulation by enabling connection reuse and pooling in clients. Ensure connections are closed properly.

Tune conntrack timeouts if your workload is dominated by short-lived flows. Lowering nf_conntrack_tcp_timeout_time_wait from the default 120 seconds can help, but this changes kernel behavior globally and should be tested in staging first.

# Reduce TIME_WAIT timeout (test before applying in production)
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30

If the cause is UDP accumulation

Reduce UDP stream timeout for DNS and metrics traffic. The default nf_conntrack_udp_timeout_stream can be too high for high-churn workloads, causing entries to accumulate.

sudo sysctl -w net.netfilter.nf_conntrack_udp_timeout_stream=30

If the cause is kube-proxy stale rules

Restart kube-proxy on the affected node to force a full resync and conntrack cleanup. This is safe because existing kernel rules persist during the brief restart.

# Disruptive: restarts kube-proxy pods on the target node
kubectl delete pod -n kube-system -l k8s-app=kube-proxy --field-selector spec.nodeName=<node>

If the cause is proxy mode scaling limits

In iptables mode, kube-proxy holds the xtables lock during iptables-restore, which can delay syncs and extend the window during which stale conntrack entries persist. If your cluster runs thousands of Services, evaluate migrating to IPVS or nftables mode. These modes reduce sync duration and lock contention, which indirectly improves conntrack cleanup latency.

Prevention

Monitor conntrack utilization per node. Set alerts at 70% of nf_conntrack_max to provide runway before the cliff edge. Do not wait for drops.

Size nf_conntrack_max for your node density. A table of 1,000,000 entries consumes roughly 300 MB of kernel memory. Size the limit to accommodate peak connection count plus TIME_WAIT and UDP overhead, but ensure you have enough system memory.

Fix connection leaks at the application level. Conntrack exhaustion is often a symptom of clients that open connections without closing them. Application health checks, connection pool metrics, and file descriptor counts are leading indicators.

Tune timeouts for your traffic pattern. The defaults assume general-purpose servers. Nodes running high-churn microservices or DNS-heavy workloads benefit from shorter TCP TIME_WAIT and UDP stream timeouts.

Limit unnecessary connection creation. Readiness and liveness probes that create new TCP sessions on every execution contribute to table pressure. Where possible, configure probes to reuse connections or use less frequent intervals.

Review NodePort and ExternalTrafficPolicy usage. Services with externalTrafficPolicy: Local create additional health check connections. NodePort services on busy nodes increase the total connection count because every node must accept the traffic.

How Netdata helps

Netdata monitors these signals per node and correlates them for faster root-cause analysis:

  • Conntrack utilization: nf_conntrack_count against nf_conntrack_max per node, with real-time history.
  • Kernel drops: Packets dropped by the conntrack subsystem, shown alongside TCP retransmission rates.
  • Cross-signal context: Conntrack saturation correlated with CoreDNS SERVFAIL rates, kube-proxy sync latency, and pod-level connection counts.