$ guides / kubernetes / kubernetes-headless-service-resolution ▌

Operations Guides

Kubernetes headless service resolution: SRV records and pod discovery

You deployed a StatefulSet with a headless Service so peers can discover each other, but nslookup returns NXDOMAIN or only a single IP when several pods are running. Your application might rely on SRV records for port discovery and the lookup returns nothing. Or a pod rescheduled onto a new node and clients kept trying the old IP for minutes because the TTL behavior surprised you.

These symptoms trace back to how Kubernetes translates headless Service semantics into DNS records. A headless Service does not give you a stable cluster IP. Instead, the control plane exposes pod identities directly through DNS. When a pod is missing its hostname, a port is unnamed, or an EndpointSlice lacks the right metadata, the records simply do not exist.

After reading this guide you will be able to verify why a pod is missing its DNS record, why an SRV query fails, and how to distinguish misconfiguration from DNS layer delays.

flowchart TD
    A[Client queries CoreDNS] --> B{Service clusterIP=None?}
    B -->|No| C[Return cluster IP A record]
    B -->|Yes| D{Query type?}
    D -->|A/AAAA| E[Select Ready Endpoints]
    D -->|SRV| F{Named port exists?}
    F -->|No| G[Return NXDOMAIN]
    F -->|Yes| H[Return SRV record with port and target FQDN]
    E --> I{EndpointSlice hostname set?}
    I -->|No| J[Return pod IPs at Service name]
    I -->|Yes| K[Return hostname.subdomain.namespace.svc.cluster.local]

What this means

A headless Service is defined by setting .spec.clusterIP to "None". Kubernetes does not allocate a cluster IP and kube-proxy does not program virtual IP rules for it. Instead, DNS queries against the Service name return A or AAAA records for each backing pod IP directly. If the Service has named ports, CoreDNS also creates SRV records of the form _<port-name>._<protocol>.<svc-name>.<namespace>.svc.cluster.local.

For a pod to receive an individual DNS record (<hostname>.<subdomain>.<namespace>.svc.cluster.local), three conditions must all be true: the pod has spec.hostname set; a headless Service exists in the same namespace with the same name as spec.subdomain; and the pod is Ready, unless the Service sets publishNotReadyAddresses: true.

CoreDNS also uses the hostname field on EndpointSlice addresses to generate per-pod A/AAAA records linked to the parent headless Service. Without a hostname on the EndpointSlice address, no per-pod DNS record is created for that pod.

The default TTL for these records is 5 seconds. The DNS search path in pod /etc/resolv.conf lists search domains in this order: <namespace>.svc.cluster.local, svc.cluster.local, cluster.local. Queries with fewer than ndots dots (default 5) are tried against each search path first, which can cause short-name lookups to stall in 5-second increments waiting for NXDOMAIN responses.

Common causes

Cause	What it looks like	First thing to check
Headless Service has no named ports	`dig SRV` returns NXDOMAIN	`kubectl get svc` for empty or unnamed ports
Pod missing `hostname` or `subdomain`	Only Service-level A records exist	Pod spec for `hostname` and `subdomain`
Pod not Ready	Pod IP missing from DNS despite correct config	Pod Ready condition
EndpointSlice address lacks `hostname`	No individual pod FQDN resolves	`kubectl get endpointslices` for `hostname` field
Selectorless headless Service without manual Endpoints	DNS returns no records	`kubectl get endpoints` for empty subsets
`ndots:5` search path delay	5-second DNS stalls on short names	`/etc/resolv.conf` in client pod
CoreDNS degraded or OOM killed	All internal DNS fails	CoreDNS pod status and restarts

Quick checks

Run these from a debug pod inside the cluster.

# Verify the Service is headless and has named ports
kubectl get svc <svc-name> -n <ns> -o yaml | grep -E 'clusterIP:|ports:'

# Check pod hostname, subdomain, and readiness
kubectl get pods -n <ns> -l <selector> \
  -o custom-columns='NAME:.metadata.name,HOSTNAME:.spec.hostname,SUBDOMAIN:.spec.subdomain,READY:.status.conditions[?(@.type=="Ready")].status'

# Inspect EndpointSlice hostnames
kubectl get endpointslices -n <ns> \
  -o json | jq '.items[].endpoints[] | {ip: .addresses[0], hostname: .hostname, ready: .conditions.ready}'

# Query A records for the headless Service
dig +short <svc-name>.<ns>.svc.cluster.local

# Query SRV records for a named port
dig +short SRV _<port-name>._tcp.<svc-name>.<ns>.svc.cluster.local

# Query a specific pod FQDN
dig +short <hostname>.<subdomain>.<ns>.svc.cluster.local

# Check client ndots and search path
kubectl run -it --rm debug --image=busybox:1.36 --restart=Never -- cat /etc/resolv.conf

# Check CoreDNS latency and error metrics
kubectl get --raw /metrics | grep -E 'coredns_dns_request_duration_seconds|coredns_dns_responses_total'

How to diagnose it

Confirm the Service is headless. Check kubectl get svc <name> -o yaml for clusterIP: None. If the field has an IP, DNS returns the cluster IP instead of pod IPs.
Verify named ports for SRV. A headless Service with an empty .spec.ports array produces zero DNS records. Ensure at least one port has a name and that the protocol matches your query.
Check pod hostname and subdomain fields. For individual pod DNS, spec.hostname must be set and spec.subdomain must exactly match the headless Service name.
Validate pod readiness. A pod that is not Ready is excluded from DNS unless the Service sets publishNotReadyAddresses: true. Check the Ready condition in pod status.
Inspect the EndpointSlice for hostname values. Without a hostname on the EndpointSlice address, CoreDNS will not create a per-pod A/AAAA record even if the pod has hostname and subdomain set.
Test DNS from inside the cluster. Use dig for the Service A record, SRV record, and pod FQDN. If queries timeout, try the fully qualified name with a trailing dot to bypass search domains.
Check CoreDNS health. If configuration looks correct but DNS fails, check CoreDNS pod status, memory limits, and logs. CoreDNS versions earlier than 1.7.0 can exit during API server network jitters and leave records stale.
Verify manual Endpoints for selectorless Services. If the headless Service has no .spec.selector, create an Endpoints resource manually. Confirm the Endpoints subset lists the target IPs and that port numbers align with the Service spec.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
CoreDNS request latency (`coredns_dns_request_duration_seconds`)	Elevated latency affects all cluster DNS	p99 > 500 ms sustained
CoreDNS SERVFAIL rate (`coredns_dns_responses_total{rcode="SERVFAIL"}`)	Indicates resolution failures	Any sustained nonzero rate above baseline
CoreDNS container restarts	DNS authority instability causes lookup outages	Restart count increasing
Pod Ready condition	Unready pods drop out of headless DNS	Ready=False for backing pods
EndpointSlice hostname coverage	Missing hostnames mean missing per-pod records	Endpoints with null or missing `hostname`
API server LIST/WATCH latency (`apiserver_request_duration_seconds`)	CoreDNS learns endpoints through API watches	p99 > 1 s for LIST or WATCH verbs
Application connection timeout rate	Stale DNS or missing records surface as timeouts	Timeouts spike after pod rescheduling

Fixes

If the cause is missing Service ports

Add at least one named port to the headless Service .spec.ports. Without ports, Kubernetes creates no EndpointSlice subsets and CoreDNS generates no records.

If the cause is pod hostname or subdomain mismatch

Ensure the pod template sets spec.hostname and spec.subdomain to the headless Service name. If either is missing, CoreDNS cannot build the pod FQDN.

If the cause is pod readiness

Fix the container health checks or workload dependencies preventing the pod from becoming Ready. If your workload requires DNS before it can pass readiness, consider setting publishNotReadyAddresses: true temporarily, but remove it once stable.

If the cause is EndpointSlice gaps

For selectorless headless Services, create the Endpoints resource manually. Ensure the Endpoints subset lists the correct IPs and that the Service .spec.ports[].port matches the targetPort referenced in the Endpoints subset.

If the cause is DNS search path latency

Use fully qualified domain names in application configuration, ending with a dot to bypass search domains. Alternatively, lower ndots in the pod dnsConfig if the application only resolves FQDNs.

If the cause is CoreDNS instability

Scale CoreDNS replicas to match cluster query load. Increase the CoreDNS memory limit if it is OOM killed. Because headless Service TTL defaults to 5 seconds, avoid configuring CoreDNS cache TTLs that would override rapid failover after pod rescheduling.

Prevention

Define named ports on every headless Service that must support SRV discovery.
Validate pod templates with a CI check that confirms subdomain matches an existing headless Service.
Monitor CoreDNS memory and replica count as cluster size grows.
Use FQDNs with trailing dots in application configs to avoid ndots amplification.
After rolling updates to CoreDNS or the API server, run a smoke test that queries headless Service A and SRV records from a debug pod.

How Netdata helps

Netdata correlates the layers involved in headless DNS incidents:

CoreDNS latency and error charts surface p99 latency and SERVFAIL trends before applications report failures.
Pod readiness and container restart alerts catch unready pods that have disappeared from headless DNS.
API server latency correlation helps determine whether slow DNS updates stem from control plane watch delays.
System DNS response metrics from client nodes isolate whether delays are inside CoreDNS or in the client resolver stack.

The Netdata solution

Kubernetes monitoring with Netdata

Netdata monitors Kubernetes with per-second metrics across the control plane, nodes, and every pod, with ML anomaly detection and zero per-pod configuration. Correlate API-server and etcd latency, kubelet PLEG stalls, scheduling pressure, and OOMKills in one place.

See Kubernetes monitoring → Start monitoring free

Kubernetes headless service resolution: SRV records and pod discovery

Kubernetes headless service resolution: SRV records and pod discovery

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If the cause is missing Service ports

If the cause is pod hostname or subdomain mismatch

If the cause is pod readiness

If the cause is EndpointSlice gaps

If the cause is DNS search path latency

If the cause is CoreDNS instability

Prevention

How Netdata helps

Related guides

Kubernetes monitoring with Netdata