Kubernetes headless service resolution: SRV records and pod discovery

You deployed a StatefulSet with a headless Service so peers can discover each other, but nslookup returns NXDOMAIN or only a single IP when several pods are running. Your application might rely on SRV records for port discovery and the lookup returns nothing. Or a pod rescheduled onto a new node and clients kept trying the old IP for minutes because the TTL behavior surprised you.

These symptoms trace back to how Kubernetes translates headless Service semantics into DNS records. A headless Service does not give you a stable cluster IP. Instead, the control plane exposes pod identities directly through DNS. When a pod is missing its hostname, a port is unnamed, or an EndpointSlice lacks the right metadata, the records simply do not exist.

After reading this guide you will be able to verify why a pod is missing its DNS record, why an SRV query fails, and how to distinguish misconfiguration from DNS layer delays.

flowchart TD
    A[Client queries CoreDNS] --> B{Service clusterIP=None?}
    B -->|No| C[Return cluster IP A record]
    B -->|Yes| D{Query type?}
    D -->|A/AAAA| E[Select Ready Endpoints]
    D -->|SRV| F{Named port exists?}
    F -->|No| G[Return NXDOMAIN]
    F -->|Yes| H[Return SRV record with port and target FQDN]
    E --> I{EndpointSlice hostname set?}
    I -->|No| J[Return pod IPs at Service name]
    I -->|Yes| K[Return hostname.subdomain.namespace.svc.cluster.local]

What this means

A headless Service is defined by setting .spec.clusterIP to "None". Kubernetes does not allocate a cluster IP and kube-proxy does not program virtual IP rules for it. Instead, DNS queries against the Service name return A or AAAA records for each backing pod IP directly. If the Service has named ports, CoreDNS also creates SRV records of the form _<port-name>._<protocol>.<svc-name>.<namespace>.svc.cluster.local.

For a pod to receive an individual DNS record (<hostname>.<subdomain>.<namespace>.svc.cluster.local), three conditions must all be true: the pod has spec.hostname set; a headless Service exists in the same namespace with the same name as spec.subdomain; and the pod is Ready, unless the Service sets publishNotReadyAddresses: true.

CoreDNS also uses the hostname field on EndpointSlice addresses to generate per-pod A/AAAA records linked to the parent headless Service. Without a hostname on the EndpointSlice address, no per-pod DNS record is created for that pod.

The default TTL for these records is 5 seconds. The DNS search path in pod /etc/resolv.conf lists search domains in this order: <namespace>.svc.cluster.local, svc.cluster.local, cluster.local. Queries with fewer than ndots dots (default 5) are tried against each search path first, which can cause short-name lookups to stall in 5-second increments waiting for NXDOMAIN responses.

Common causes

CauseWhat it looks likeFirst thing to check
Headless Service has no named portsdig SRV returns NXDOMAINkubectl get svc for empty or unnamed ports
Pod missing hostname or subdomainOnly Service-level A records existPod spec for hostname and subdomain
Pod not ReadyPod IP missing from DNS despite correct configPod Ready condition
EndpointSlice address lacks hostnameNo individual pod FQDN resolveskubectl get endpointslices for hostname field
Selectorless headless Service without manual EndpointsDNS returns no recordskubectl get endpoints for empty subsets
ndots:5 search path delay5-second DNS stalls on short names/etc/resolv.conf in client pod
CoreDNS degraded or OOM killedAll internal DNS failsCoreDNS pod status and restarts

Quick checks

Run these from a debug pod inside the cluster.

# Verify the Service is headless and has named ports
kubectl get svc <svc-name> -n <ns> -o yaml | grep -E 'clusterIP:|ports:'

# Check pod hostname, subdomain, and readiness
kubectl get pods -n <ns> -l <selector> \
  -o custom-columns='NAME:.metadata.name,HOSTNAME:.spec.hostname,SUBDOMAIN:.spec.subdomain,READY:.status.conditions[?(@.type=="Ready")].status'

# Inspect EndpointSlice hostnames
kubectl get endpointslices -n <ns> \
  -o json | jq '.items[].endpoints[] | {ip: .addresses[0], hostname: .hostname, ready: .conditions.ready}'

# Query A records for the headless Service
dig +short <svc-name>.<ns>.svc.cluster.local

# Query SRV records for a named port
dig +short SRV _<port-name>._tcp.<svc-name>.<ns>.svc.cluster.local

# Query a specific pod FQDN
dig +short <hostname>.<subdomain>.<ns>.svc.cluster.local

# Check client ndots and search path
kubectl run -it --rm debug --image=busybox:1.36 --restart=Never -- cat /etc/resolv.conf

# Check CoreDNS latency and error metrics
kubectl get --raw /metrics | grep -E 'coredns_dns_request_duration_seconds|coredns_dns_responses_total'

How to diagnose it

  1. Confirm the Service is headless. Check kubectl get svc <name> -o yaml for clusterIP: None. If the field has an IP, DNS returns the cluster IP instead of pod IPs.

  2. Verify named ports for SRV. A headless Service with an empty .spec.ports array produces zero DNS records. Ensure at least one port has a name and that the protocol matches your query.

  3. Check pod hostname and subdomain fields. For individual pod DNS, spec.hostname must be set and spec.subdomain must exactly match the headless Service name.

  4. Validate pod readiness. A pod that is not Ready is excluded from DNS unless the Service sets publishNotReadyAddresses: true. Check the Ready condition in pod status.

  5. Inspect the EndpointSlice for hostname values. Without a hostname on the EndpointSlice address, CoreDNS will not create a per-pod A/AAAA record even if the pod has hostname and subdomain set.

  6. Test DNS from inside the cluster. Use dig for the Service A record, SRV record, and pod FQDN. If queries timeout, try the fully qualified name with a trailing dot to bypass search domains.

  7. Check CoreDNS health. If configuration looks correct but DNS fails, check CoreDNS pod status, memory limits, and logs. CoreDNS versions earlier than 1.7.0 can exit during API server network jitters and leave records stale.

  8. Verify manual Endpoints for selectorless Services. If the headless Service has no .spec.selector, create an Endpoints resource manually. Confirm the Endpoints subset lists the target IPs and that port numbers align with the Service spec.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
CoreDNS request latency (coredns_dns_request_duration_seconds)Elevated latency affects all cluster DNSp99 > 500 ms sustained
CoreDNS SERVFAIL rate (coredns_dns_responses_total{rcode="SERVFAIL"})Indicates resolution failuresAny sustained nonzero rate above baseline
CoreDNS container restartsDNS authority instability causes lookup outagesRestart count increasing
Pod Ready conditionUnready pods drop out of headless DNSReady=False for backing pods
EndpointSlice hostname coverageMissing hostnames mean missing per-pod recordsEndpoints with null or missing hostname
API server LIST/WATCH latency (apiserver_request_duration_seconds)CoreDNS learns endpoints through API watchesp99 > 1 s for LIST or WATCH verbs
Application connection timeout rateStale DNS or missing records surface as timeoutsTimeouts spike after pod rescheduling

Fixes

If the cause is missing Service ports

Add at least one named port to the headless Service .spec.ports. Without ports, Kubernetes creates no EndpointSlice subsets and CoreDNS generates no records.

If the cause is pod hostname or subdomain mismatch

Ensure the pod template sets spec.hostname and spec.subdomain to the headless Service name. If either is missing, CoreDNS cannot build the pod FQDN.

If the cause is pod readiness

Fix the container health checks or workload dependencies preventing the pod from becoming Ready. If your workload requires DNS before it can pass readiness, consider setting publishNotReadyAddresses: true temporarily, but remove it once stable.

If the cause is EndpointSlice gaps

For selectorless headless Services, create the Endpoints resource manually. Ensure the Endpoints subset lists the correct IPs and that the Service .spec.ports[].port matches the targetPort referenced in the Endpoints subset.

If the cause is DNS search path latency

Use fully qualified domain names in application configuration, ending with a dot to bypass search domains. Alternatively, lower ndots in the pod dnsConfig if the application only resolves FQDNs.

If the cause is CoreDNS instability

Scale CoreDNS replicas to match cluster query load. Increase the CoreDNS memory limit if it is OOM killed. Because headless Service TTL defaults to 5 seconds, avoid configuring CoreDNS cache TTLs that would override rapid failover after pod rescheduling.

Prevention

  • Define named ports on every headless Service that must support SRV discovery.
  • Validate pod templates with a CI check that confirms subdomain matches an existing headless Service.
  • Monitor CoreDNS memory and replica count as cluster size grows.
  • Use FQDNs with trailing dots in application configs to avoid ndots amplification.
  • After rolling updates to CoreDNS or the API server, run a smoke test that queries headless Service A and SRV records from a debug pod.

How Netdata helps

Netdata correlates the layers involved in headless DNS incidents:

  • CoreDNS latency and error charts surface p99 latency and SERVFAIL trends before applications report failures.
  • Pod readiness and container restart alerts catch unready pods that have disappeared from headless DNS.
  • API server latency correlation helps determine whether slow DNS updates stem from control plane watch delays.
  • System DNS response metrics from client nodes isolate whether delays are inside CoreDNS or in the client resolver stack.