Kubernetes pod network isolation: when one node loses pod connectivity

One node in your cluster drops off the pod network. Pods still show Running, but they cannot reach Services, other pods, or external endpoints. The rest of the cluster is unaffected. This is single-node pod network isolation, and it is almost always a node-local CNI or data path failure.

Symptoms are subtle compared to a full cluster outage. Applications log connection timeouts. Health checks fail. New pods stick in ContainerCreating. Because the node often stays Ready, operators may blame the application rather than the network layer.

This guide shows how to confirm the failure is node-local, identify the root cause, and restore connectivity safely.

What this means

Kubernetes delegates pod networking to a CNI plugin, usually running as a DaemonSet. The plugin creates the veth pair, attaches it to the host bridge or overlay, and sets routes. When the plugin fails on one node, existing pods usually keep their assigned IPs, but the path between the pod network namespace and the host collapses. New pods cannot start because kubelet cannot create the network sandbox.

The node may stay Ready because kubelet and the container runtime are healthy. However, the NetworkUnavailable condition may flip to True, indicating the CNI has not finished wiring the node. From the cluster perspective, the node is up but its workloads are unreachable.

Common causes

CauseWhat it looks likeFirst thing to check
CNI DaemonSet pod failure on the nodeExisting pods lose cluster connectivity; new pods stuck in ContainerCreatingCNI pod status on the node
CNI config version skew with containerd 1.6.0-1.6.3FailedCreatePodSandBox events containing incompatible CNI versions/etc/cni/net.d/ config cniVersion and containerd version
Conntrack table exhaustionIntermittent connection timeouts under load; nf_conntrack: table full in dmesg/proc/sys/net/netfilter/nf_conntrack_count vs nf_conntrack_max
Firewall blocking CNI portsCross-node traffic drops; intra-node traffic may still workHost and cloud firewall rules for CNI ports
MTU mismatch between host and pod interfacesSmall ICMP pings succeed; HTTP requests hang after the TCP handshakeip link output on host and inside a pod
Stale conntrack entries after endpoint churnUDP DNS timeouts after CoreDNS pods are replacedconntrack -L for entries pointing to old endpoint IPs

Quick checks

# Check node network conditions
kubectl describe node <node-name> | grep -A 10 "^Conditions:"

# Check the CNI pod on the affected node
kubectl get pods -n kube-system --field-selector spec.nodeName=<node-name> -o wide

# Look for sandbox creation failures tied to the node
kubectl get events --field-selector involvedObject.kind=Pod,reason=FailedCreatePodSandBox --sort-by='.lastTimestamp' | grep <node-name>

# Inspect CNI configuration version
grep -h cniVersion /etc/cni/net.d/* | head -n 5

# Check conntrack utilization
echo "scale=2; $(cat /proc/sys/net/netfilter/nf_conntrack_count) * 100 / $(cat /proc/sys/net/netfilter/nf_conntrack_max)" | bc

# Check host interface MTU
ip link show $(ip route show default | head -n1 | awk '{print $5}') | grep mtu

# Check CNI-specific ports (Flannel VXLAN example)
ss -ulnp | grep -E "8472|8285"

# Check containerd version for known CNI compatibility issues
containerd --version

How to diagnose it

  1. Confirm the blast radius is a single node. Start a debug pod on the affected node using nodeName: <node> and attempt to reach a known good pod on another node. If the failure is isolated to one node, the problem is local to its CNI or data path.

  2. Check the NetworkUnavailable condition. Run kubectl describe node. If NetworkUnavailable is True, the CNI control plane has not established the node’s network.

  3. Inspect the CNI pod. Use labels that match your plugin, such as k8s-app=calico-node, k8s-app=cilium, app=flannel, or k8s-app=weave-net. If the pod is CrashLoopBackOff, OOMKilled, or not present, the node cannot program pod networking.

  4. Read kubelet sandbox events. Look for NetworkPlugin cni failed to set up pod in the event message. This confirms kubelet invoked the CNI binary but the setup operation returned an error, pointing to a config, binary, or runtime compatibility issue.

  5. Validate CNI configuration on disk. Check /etc/cni/net.d/. Ensure the cniVersion declared in the config is supported by the installed plugin binaries. If you are running containerd 1.6.0 through 1.6.3, the config must declare version 1.0.0 or later; otherwise sandbox creation fails with an incompatible version error. Upgrading containerd to 1.6.4 or later resolves this.

  6. Test the data path from inside a pod. Exec into a Running pod on the node and run ip link to inspect its interface MTU. Then send large pings with ping -M do -s 1472 <target-pod-ip> if the image supports it, or start a TCP transfer with curl. If small packets pass but large payloads hang after the handshake, an MTU mismatch is silently dropping traffic.

  7. Check conntrack state and kernel logs. Search dmesg for nf_conntrack: table full, dropping packet. If conntrack is exhausted, new connections are silently dropped. Also look for stale UDP entries. When CoreDNS pods are replaced, old conntrack entries for the previous pod IPs can persist, causing DNS queries to be sent to dead endpoints.

  8. Verify inter-node firewall paths. If the cluster uses Flannel, confirm UDP port 8472 (VXLAN) or 8285 (UDP backend) is open between all nodes. For AWS VPC CNI, verify the instance has not hit its ENI limit, which prevents IP assignment and causes sandbox creation failures.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Node condition NetworkUnavailableDirectly indicates the CNI plugin has not finished wiring the nodeStatus True for more than 60 seconds
CNI DaemonSet pod restarts per nodeA crashing CNI pod isolates the node immediatelyRestart count increasing on the node’s CNI pod
FailedCreatePodSandBox event rateConfirms kubelet cannot delegate network setup to the CNI pluginEvents concentrated on a single node
Conntrack utilization ratioA full table silently drops new connections across all pods on the nodeSustained above 80% of nf_conntrack_max
Cross-node pod-to-pod latency and lossDetects partial data path breaks like MTU mismatches or firewall blocksLoss or latency spikes only from the affected node
kubelet run_podsandbox operation errorsSurfaces CNI invocation failures from the runtime sidekubelet_runtime_operations_errors_total increasing for run_podsandbox

Fixes

If the cause is CNI plugin failure

Delete the CNI pod on the affected node and let the DaemonSet recreate it. If the pod is OOMKilled, raise its memory limit or reduce pod density. If the node has exceeded its ENI or IP pool limit, common with AWS VPC CNI, free unused IPs or select an instance type that supports more ENIs.

If the cause is CNI configuration drift

Restore the correct CNI config file in /etc/cni/net.d/. Ensure the cniVersion in the config matches the plugin binary capabilities. On nodes running containerd 1.6.0 through 1.6.3, upgrade containerd to 1.6.4 or later, or ensure both the config and the loopback plugin declare version 1.0.0.

If the cause is conntrack exhaustion

Immediately increase the table size:

sysctl -w net.netfilter.nf_conntrack_max=<higher_value>

Then identify the source: high connection churn, TIME_WAIT accumulation, or a connection leak. For UDP-heavy workloads, reduce net.netfilter.nf_conntrack_udp_timeout_stream. See Kubernetes conntrack exhaustion for detailed tuning.

If the cause is a firewall or security group

Open the CNI-specific ports between all cluster nodes. For Flannel, allow UDP 8472 for VXLAN and UDP 8285 for the UDP backend. Verify that host-level iptables or nftables rules and cloud security groups are not dropping encapsulated traffic.

If the cause is an MTU mismatch

Set the pod interface MTU in the CNI configuration to match the host’s primary interface or the VPC MTU. For example, if the host uses jumbo frames but the CNI bridge defaults to 1500, large packets will be dropped after the TCP handshake. Adjust the CNI config and restart the CNI pods to apply the change.

If the cause is stale conntrack entries

Flush stale entries manually with conntrack -D -d <old-pod-ip> if necessary. To prevent recurrence, ensure DNS workloads use graceful termination periods long enough for clients to switch, and consider NodeLocal DNSCache to reduce cross-node UDP conntrack pressure.

Prevention

  • Monitor the CNI DaemonSet with per-node alerts for pods that are not in the Running state.
  • Alert on NetworkUnavailable=True on any node.
  • Track conntrack utilization on every node and size nf_conntrack_max relative to peak connection counts; the default 65536 is often too low for dense nodes.
  • Pin CNI and containerd versions in your node image pipeline and validate compatibility before rollout.
  • Set explicit MTU values in CNI configs during node bootstrap.
  • Document required inter-node firewall rules for your CNI plugin in your network runbook.
  • Monitor FailedCreatePodSandBox event rate per node to catch CNI regressions before workloads are scheduled.

How Netdata helps

  • Netdata tracks node-level conntrack utilization and alerts when the table approaches its limit, before connections start dropping.
  • The kernel error chart surfaces dmesg messages such as nf_conntrack: table full without requiring manual node logins.
  • Per-node pod health monitoring correlates CNI pod restarts with sandbox creation failures in the same time window.
  • Network latency and packet-drop metrics per interface help isolate MTU or firewall issues to a specific node.
flowchart TD
    A[Pods on one node lose connectivity] --> B{NetworkUnavailable True?}
    B -->|Yes| C[Check CNI pod status and logs]
    B -->|No| D[Check conntrack and MTU]
    C --> E{CNI pod Running?}
    E -->|No| F[Restart or fix CNI pod resources]
    E -->|Yes| G[Check CNI config version and containerd compatibility]
    D --> H{Conntrack table full?}
    H -->|Yes| I[Increase nf_conntrack_max and reduce churn]
    H -->|No| J[Check MTU and inter-node firewall ports]