Kubernetes DNS resolution failures inside pods
DNS failures inside pods break service discovery. A single overloaded CoreDNS replica or saturated conntrack table on one node can look like a multi-service outage. Before fixing, determine whether the failure is cluster-wide, node-specific, or workload-specific.
What this means
Kubernetes injects an /etc/resolv.conf into every pod that points to the cluster DNS service, typically CoreDNS. CoreDNS resolves cluster-internal names via the kubernetes plugin and forwards external queries to an upstream resolver. A failure at any point produces the same symptom: the name cannot be resolved.
Because nearly every pod uses DNS, a localized CoreDNS failure, node-level conntrack exhaustion, or a misconfigured forwarding loop can look like an outage. DNS traffic traverses kube-proxy rules to reach the cluster DNS Service IP, so stale iptables or IPVS state on a node can break resolution even when CoreDNS itself is healthy.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| CoreDNS OOMKilled or crashloop | SERVFAIL spikes, all pods affected cluster-wide | CoreDNS pod status and memory limits in the kube-system Deployment |
| ndots:5 amplification | High query volume to CoreDNS, slow external name resolution, elevated CPU | /etc/resolv.conf inside the pod for ndots:5 |
| Upstream DNS failure | External names fail while kubernetes.default succeeds | Corefile forwarding config and node upstream resolver |
| DNS loop | Intermittent timeouts, high CoreDNS CPU, repeating queries | Node /etc/resolv.conf and CoreDNS forwarding target |
| Conntrack exhaustion | Random connection timeouts, UDP DNS queries dropped first | nf_conntrack_count versus nf_conntrack_max on the node |
| kube-proxy stale rules | DNS failures isolated to specific nodes, not the whole cluster | kube-proxy sync timestamp and conntrack entries for old CoreDNS pods |
Quick checks
Run these in order.
# Check CoreDNS pod health
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Test cluster DNS resolution from an affected pod
kubectl exec -it <pod> -- nslookup kubernetes.default
# Check pod resolver configuration
kubectl exec -it <pod> -- cat /etc/resolv.conf
# Check CoreDNS error responses (port-forward, then query locally)
kubectl port-forward -n kube-system <coredns-pod> 9153:9153 &>/dev/null &
curl -s http://localhost:9153/metrics | grep coredns_dns_responses_total
kill %1
# Check conntrack table utilization on the affected node
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
# Check for conntrack drops (on the node)
conntrack -S | grep drop
# Check kube-proxy last successful sync timestamp (on the node)
curl -s http://127.0.0.1:10249/metrics | grep kubeproxy_sync_proxy_rules_last_timestamp_seconds
# Check node resolver for forwarding loops (on the node)
cat /etc/resolv.conf
What good and bad look like:
- CoreDNS pods should all be
RunningandReady. AnyOOMKilledevent means memory limits are too low. nslookup kubernetes.defaultshould return the ClusterIP quickly. Failure here points to the CoreDNS path./etc/resolv.confinside the pod showsndots:5by default. This is normal but amplifies external queries.coredns_dns_responses_totalwithrcode="SERVFAIL"should be near zero. A rising count means CoreDNS cannot satisfy queries.- Conntrack utilization should stay below 90 percent. At 100 percent, new UDP packets are silently dropped.
- The kube-proxy sync timestamp should be within the last 60 seconds. A stale timestamp means rules are not being updated.
How to diagnose it
Confirm DNS is the problem. Run
nslookup kubernetes.defaultandnslookup <external-domain>from an affected pod. If both fail, the issue is between the pod and CoreDNS or within CoreDNS itself. If only external names fail, suspect upstream forwarding. If internal names fail but external names work, suspect the kubernetes plugin or CoreDNS configuration.Check CoreDNS pod health. Look for
OOMKilled,CrashLoopBackOff, orNotReadystates. CoreDNS memory scales with service count and cache size; the default limit is often too low for large clusters. If CoreDNS is restarting, DNS is unavailable during each restart window. If pods are healthy but overloaded, scale the Deployment or add NodeLocal DNSCache.Inspect CoreDNS metrics. Query port 9153 for
coredns_dns_responses_totalandcoredns_dns_request_duration_seconds. A SERVFAIL rate above 1 percent of total responses indicates CoreDNS cannot resolve queries. High latency on internal names suggests the kubernetes plugin is slow or the API server watch is stale. High latency on external names points to upstream resolvers.Check pod resolver configuration. Look at
/etc/resolv.confinside the pod. The defaultndots:5and three search domains mean any name with fewer than five dots is expanded through all search suffixes before a final absolute query. A query forapi.example.comgenerates four lookups, three of which return NXDOMAIN. If your workload makes heavy use of short external names, this amplification can saturate CoreDNS.Verify the node network path. DNS traffic from the pod to the cluster DNS Service IP traverses kube-proxy rules and conntrack. On the node, check
nf_conntrack_countagainstnf_conntrack_max. If utilization is above 90 percent, new UDP packets to CoreDNS are silently dropped. Checkkubeproxy_sync_proxy_rules_last_timestamp_seconds. If it is more than a few minutes old, kube-proxy is not updating rules and traffic may be sent to a dead CoreDNS pod IP.Check for DNS loops. Compare the node’s
/etc/resolv.confwith the CoreDNS Corefile forwarding configuration. If the node points to a local resolver that forwards to CoreDNS, and CoreDNS forwards back to the node, queries loop until they time out. This produces intermittent failures and high CoreDNS CPU.Test upstream resolution directly. From a node or a debug pod with host networking, query the upstream resolver that CoreDNS uses. If the upstream is slow or returns SERVFAIL, CoreDNS is functioning correctly but has a bad upstream. Fix the node resolver or update the Corefile to use a reliable upstream.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| CoreDNS SERVFAIL rate | Indicates CoreDNS cannot resolve queries | SERVFAIL > 1% of total responses |
| CoreDNS request duration p99 | Latency affects all inter-service calls | p99 > 500ms sustained |
| CoreDNS pod restart count | OOM or crash reduces DNS capacity | Any OOMKilled or CrashLoopBackOff event |
Pod ndots setting | ndots:5 amplifies query volume and CoreDNS load | External-API workloads using default ndots:5 |
| Conntrack table utilization | UDP DNS packets are dropped when the table is full | nf_conntrack_count > 90% of nf_conntrack_max |
| kube-proxy last sync timestamp age | Stale rules break reachability to the cluster DNS Service IP | Timestamp older than 2 minutes |
| CoreDNS upstream response code | Separates internal resolution from external forwarding failures | External queries fail while internal queries succeed |
Fixes
CoreDNS resource pressure
CoreDNS memory scales with service count and cache size. If CoreDNS pods are OOMKilled, increase the memory limit in the CoreDNS Deployment manifest. Also consider scaling the replica count to match cluster query volume. In very large clusters, deploy NodeLocal DNSCache to absorb query load at the node level and reduce cross-node traffic to CoreDNS.
ndots amplification
For workloads that resolve many external names, set dnsConfig.options.ndots to 2 in the pod spec. This reduces the number of search-suffix expansions per query. The tradeoff is that short unqualified names may require a fully qualified dot suffix, but external name latency drops significantly.
Upstream DNS failure
Verify the Corefile forwarding configuration and test the upstream resolver directly from the node. If the node uses a local resolver such as systemd-resolved that points back to the cluster, break the loop by configuring CoreDNS to forward directly to a known upstream IP. If the upstream itself is unreliable, switch to a different resolver.
Conntrack exhaustion
Increase nf_conntrack_max on affected nodes. The default is often too low for nodes running many pods with high connection churn. Apply it immediately with sysctl -w net.netfilter.nf_conntrack_max=<higher_value>, then persist it in the node configuration. Monitor conntrack state distribution to identify whether TCP TIME_WAIT or UDP entries dominate the table.
kube-proxy stale rules
If DNS failures are isolated to a single node, check whether kube-proxy’s watch connection to the API server is alive. A kube-proxy process that has lost its watch operates on stale rules but may still return 200 on its healthz endpoint. Restart the kube-proxy pod on the affected node to force a full resync. If the cluster uses IPVS mode, also check for stale UDP conntrack entries pointing to old CoreDNS pod IPs after a rollout.
Prevention
- Size CoreDNS for scale. Set memory limits based on service count and monitor CoreDNS pod restarts. Treat CoreDNS restarts as a page-worthy event.
- Deploy NodeLocal DNSCache. This reduces CoreDNS load and protects against node-level conntrack exhaustion by keeping DNS traffic local.
- Monitor conntrack utilization per node. Include
nf_conntrack_count / nf_conntrack_maxin standard node health checks. Alert at 75 percent to leave headroom for bursts. - Audit pod dnsConfig. For namespaces running external-API clients, review
ndotssettings and search domains. Avoid unnecessary amplification. - Avoid DNS loops. Ensure node-level resolvers do not forward back into the cluster. Document the expected upstream path and validate it after node image changes.
How Netdata helps
- Correlate CoreDNS error rates and latency with pod-level resource saturation and node conntrack utilization in one view.
- Alert on rising SERVFAIL rates before application error rates spike.
- Track kube-proxy sync latency per node to detect stale rules that break ClusterIP reachability.
- Monitor conntrack table utilization across the fleet to catch the node-level UDP drops that often precede DNS outages.
Related guides
- See Kubernetes API server slow or unresponsive: causes and fixes
- See Kubernetes kubelet not responding: PLEG, runtime, and certificate issues
- See Kubernetes monitoring checklist: the signals every production cluster needs
- See Kubernetes node NotReady: kubelet, runtime, and network diagnosis
- See Kubernetes pod stuck ContainerCreating: volume, network, and image issues
- See Kubernetes pod CrashLoopBackOff: causes, diagnosis, and fixes
- See Kubernetes pod ImagePullBackOff: registry, auth, and network diagnosis
- See Kubernetes pod OOMKilled: cgroup limits, evictions, and fixes
- See Kubernetes pod stuck Pending: scheduling failures explained





