Kubernetes headless service resolution: SRV records and pod discovery
You deployed a StatefulSet with a headless Service so peers can discover each other, but nslookup returns NXDOMAIN or only a single IP when several pods are running. Your application might rely on SRV records for port discovery and the lookup returns nothing. Or a pod rescheduled onto a new node and clients kept trying the old IP for minutes because the TTL behavior surprised you.
These symptoms trace back to how Kubernetes translates headless Service semantics into DNS records. A headless Service does not give you a stable cluster IP. Instead, the control plane exposes pod identities directly through DNS. When a pod is missing its hostname, a port is unnamed, or an EndpointSlice lacks the right metadata, the records simply do not exist.
After reading this guide you will be able to verify why a pod is missing its DNS record, why an SRV query fails, and how to distinguish misconfiguration from DNS layer delays.
flowchart TD
A[Client queries CoreDNS] --> B{Service clusterIP=None?}
B -->|No| C[Return cluster IP A record]
B -->|Yes| D{Query type?}
D -->|A/AAAA| E[Select Ready Endpoints]
D -->|SRV| F{Named port exists?}
F -->|No| G[Return NXDOMAIN]
F -->|Yes| H[Return SRV record with port and target FQDN]
E --> I{EndpointSlice hostname set?}
I -->|No| J[Return pod IPs at Service name]
I -->|Yes| K[Return hostname.subdomain.namespace.svc.cluster.local]What this means
A headless Service is defined by setting .spec.clusterIP to "None". Kubernetes does not allocate a cluster IP and kube-proxy does not program virtual IP rules for it. Instead, DNS queries against the Service name return A or AAAA records for each backing pod IP directly. If the Service has named ports, CoreDNS also creates SRV records of the form _<port-name>._<protocol>.<svc-name>.<namespace>.svc.cluster.local.
For a pod to receive an individual DNS record (<hostname>.<subdomain>.<namespace>.svc.cluster.local), three conditions must all be true: the pod has spec.hostname set; a headless Service exists in the same namespace with the same name as spec.subdomain; and the pod is Ready, unless the Service sets publishNotReadyAddresses: true.
CoreDNS also uses the hostname field on EndpointSlice addresses to generate per-pod A/AAAA records linked to the parent headless Service. Without a hostname on the EndpointSlice address, no per-pod DNS record is created for that pod.
The default TTL for these records is 5 seconds. The DNS search path in pod /etc/resolv.conf lists search domains in this order: <namespace>.svc.cluster.local, svc.cluster.local, cluster.local. Queries with fewer than ndots dots (default 5) are tried against each search path first, which can cause short-name lookups to stall in 5-second increments waiting for NXDOMAIN responses.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Headless Service has no named ports | dig SRV returns NXDOMAIN | kubectl get svc for empty or unnamed ports |
Pod missing hostname or subdomain | Only Service-level A records exist | Pod spec for hostname and subdomain |
| Pod not Ready | Pod IP missing from DNS despite correct config | Pod Ready condition |
EndpointSlice address lacks hostname | No individual pod FQDN resolves | kubectl get endpointslices for hostname field |
| Selectorless headless Service without manual Endpoints | DNS returns no records | kubectl get endpoints for empty subsets |
ndots:5 search path delay | 5-second DNS stalls on short names | /etc/resolv.conf in client pod |
| CoreDNS degraded or OOM killed | All internal DNS fails | CoreDNS pod status and restarts |
Quick checks
Run these from a debug pod inside the cluster.
# Verify the Service is headless and has named ports
kubectl get svc <svc-name> -n <ns> -o yaml | grep -E 'clusterIP:|ports:'
# Check pod hostname, subdomain, and readiness
kubectl get pods -n <ns> -l <selector> \
-o custom-columns='NAME:.metadata.name,HOSTNAME:.spec.hostname,SUBDOMAIN:.spec.subdomain,READY:.status.conditions[?(@.type=="Ready")].status'
# Inspect EndpointSlice hostnames
kubectl get endpointslices -n <ns> \
-o json | jq '.items[].endpoints[] | {ip: .addresses[0], hostname: .hostname, ready: .conditions.ready}'
# Query A records for the headless Service
dig +short <svc-name>.<ns>.svc.cluster.local
# Query SRV records for a named port
dig +short SRV _<port-name>._tcp.<svc-name>.<ns>.svc.cluster.local
# Query a specific pod FQDN
dig +short <hostname>.<subdomain>.<ns>.svc.cluster.local
# Check client ndots and search path
kubectl run -it --rm debug --image=busybox:1.36 --restart=Never -- cat /etc/resolv.conf
# Check CoreDNS latency and error metrics
kubectl get --raw /metrics | grep -E 'coredns_dns_request_duration_seconds|coredns_dns_responses_total'
How to diagnose it
Confirm the Service is headless. Check
kubectl get svc <name> -o yamlforclusterIP: None. If the field has an IP, DNS returns the cluster IP instead of pod IPs.Verify named ports for SRV. A headless Service with an empty
.spec.portsarray produces zero DNS records. Ensure at least one port has anameand that theprotocolmatches your query.Check pod hostname and subdomain fields. For individual pod DNS,
spec.hostnamemust be set andspec.subdomainmust exactly match the headless Service name.Validate pod readiness. A pod that is not Ready is excluded from DNS unless the Service sets
publishNotReadyAddresses: true. Check the Ready condition in pod status.Inspect the EndpointSlice for hostname values. Without a
hostnameon the EndpointSlice address, CoreDNS will not create a per-pod A/AAAA record even if the pod has hostname and subdomain set.Test DNS from inside the cluster. Use
digfor the Service A record, SRV record, and pod FQDN. If queries timeout, try the fully qualified name with a trailing dot to bypass search domains.Check CoreDNS health. If configuration looks correct but DNS fails, check CoreDNS pod status, memory limits, and logs. CoreDNS versions earlier than 1.7.0 can exit during API server network jitters and leave records stale.
Verify manual Endpoints for selectorless Services. If the headless Service has no
.spec.selector, create an Endpoints resource manually. Confirm the Endpoints subset lists the target IPs and that port numbers align with the Service spec.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
CoreDNS request latency (coredns_dns_request_duration_seconds) | Elevated latency affects all cluster DNS | p99 > 500 ms sustained |
CoreDNS SERVFAIL rate (coredns_dns_responses_total{rcode="SERVFAIL"}) | Indicates resolution failures | Any sustained nonzero rate above baseline |
| CoreDNS container restarts | DNS authority instability causes lookup outages | Restart count increasing |
| Pod Ready condition | Unready pods drop out of headless DNS | Ready=False for backing pods |
| EndpointSlice hostname coverage | Missing hostnames mean missing per-pod records | Endpoints with null or missing hostname |
API server LIST/WATCH latency (apiserver_request_duration_seconds) | CoreDNS learns endpoints through API watches | p99 > 1 s for LIST or WATCH verbs |
| Application connection timeout rate | Stale DNS or missing records surface as timeouts | Timeouts spike after pod rescheduling |
Fixes
If the cause is missing Service ports
Add at least one named port to the headless Service .spec.ports. Without ports, Kubernetes creates no EndpointSlice subsets and CoreDNS generates no records.
If the cause is pod hostname or subdomain mismatch
Ensure the pod template sets spec.hostname and spec.subdomain to the headless Service name. If either is missing, CoreDNS cannot build the pod FQDN.
If the cause is pod readiness
Fix the container health checks or workload dependencies preventing the pod from becoming Ready. If your workload requires DNS before it can pass readiness, consider setting publishNotReadyAddresses: true temporarily, but remove it once stable.
If the cause is EndpointSlice gaps
For selectorless headless Services, create the Endpoints resource manually. Ensure the Endpoints subset lists the correct IPs and that the Service .spec.ports[].port matches the targetPort referenced in the Endpoints subset.
If the cause is DNS search path latency
Use fully qualified domain names in application configuration, ending with a dot to bypass search domains. Alternatively, lower ndots in the pod dnsConfig if the application only resolves FQDNs.
If the cause is CoreDNS instability
Scale CoreDNS replicas to match cluster query load. Increase the CoreDNS memory limit if it is OOM killed. Because headless Service TTL defaults to 5 seconds, avoid configuring CoreDNS cache TTLs that would override rapid failover after pod rescheduling.
Prevention
- Define named ports on every headless Service that must support SRV discovery.
- Validate pod templates with a CI check that confirms
subdomainmatches an existing headless Service. - Monitor CoreDNS memory and replica count as cluster size grows.
- Use FQDNs with trailing dots in application configs to avoid
ndotsamplification. - After rolling updates to CoreDNS or the API server, run a smoke test that queries headless Service A and SRV records from a debug pod.
How Netdata helps
Netdata correlates the layers involved in headless DNS incidents:
- CoreDNS latency and error charts surface p99 latency and SERVFAIL trends before applications report failures.
- Pod readiness and container restart alerts catch unready pods that have disappeared from headless DNS.
- API server latency correlation helps determine whether slow DNS updates stem from control plane watch delays.
- System DNS response metrics from client nodes isolate whether delays are inside CoreDNS or in the client resolver stack.
Related guides
- How the Kubernetes control plane works: a mental model for operators
- Kubernetes conntrack exhaustion: dropped connections under load
- Kubernetes API server slow or unresponsive: causes and fixes
- Kubernetes API server etcd latency: detection and cascading failures
- Kubernetes API server watch storm: re-list cascades and connection floods






