NGINX DNS resolution failures on dynamic upstreams: 502s and resolver_timeout
What this means
Intermittent 502 Bad Gateway responses that only hit locations using a variable in proxy_pass, such as proxy_pass http://$backend;, point to dynamic DNS resolution failure. Static upstream locations are unaffected. dig from the host may succeed instantly while nginx logs show 502s with latency spikes clustering at exactly 30 seconds, the default resolver_timeout.
nginx resolves upstream hostnames through two paths.
A static directive such as proxy_pass http://backend.example.com; resolves once at configuration load via the OS resolver and caches the IP in worker memory. If the IP changes later, nginx continues routing to the stale address until you reload.
A variable directive such as proxy_pass http://$backend; triggers dynamic DNS resolution through nginx’s internal async resolver. The resolver does not block the worker’s event loop, but the request remains parked until the answer arrives or resolver_timeout expires. The default is 30 seconds. If the DNS server is slow, drops packets, or returns no record, the request hangs and then fails with 502. Because queries often succeed, failures look intermittent.
dig from the host may succeed while nginx fails. The OS resolver can use /etc/resolv.conf, local caches, or multiple nameservers with fallback. nginx uses only the IP addresses explicitly configured in the resolver directive. If that specific path is unreachable, congested, or filtered by firewall rules that affect the nginx worker but not your shell session, dig succeeds and nginx does not.
flowchart LR
Client -->|request| Worker[nginx worker]
Worker -->|variable
proxy_pass| Resolver[async resolver]
Resolver -->|DNS query| DNS[DNS server]
DNS -->|response| Resolver
Resolver -->|IP| Worker
Worker -->|proxy| Upstream[upstream app]
DNS -.->|timeout / fail| Resolver
Resolver -.->|resolver_timeout| Worker
Worker -->|502| ClientCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| DNS server unreachable or slow from nginx | Random 502s clustered around 30s latency spikes; resolver errors in logs | dig @<resolver_ip> <hostname> from the nginx host |
Missing resolver directive | All variable-proxy requests fail; “no live upstreams” or resolver errors in logs | nginx -T | grep resolver |
DNS cache expired with no valid= override | 502 bursts after DNS TTL expires; healthy between bursts | Resolver config for valid= parameter vs DNS TTL |
| Upstream hostname deleted or typoed | Consistent 502s for one specific hostname | dig the exact hostname from the host |
| Variable proxy_pass bypassing upstream block features | Flaky behavior even when DNS works, because keepalive pools and health checks are unavailable | nginx -T review to see if an explicit upstream block is defined but unused |
Quick checks
# Test DNS resolution using the same resolver IP as nginx
dig @<resolver_ip> <upstream_hostname>
# Inspect resolver and proxy_pass configuration (requires sufficient privileges)
nginx -T 2>/dev/null | grep -E '(resolver|proxy_pass)'
# Check error log for resolver errors
tail -n 1000 /var/log/nginx/error.log | grep -iE '(resolver|host not found|no live upstreams)'
# Find URI paths that produce 502s (default combined log format)
awk '$9 == 502 {print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head
# Verify workers are healthy
curl -s http://127.0.0.1/nginx_status
# Confirm outbound DNS socket state (run with appropriate privileges)
ss -tunap | grep ':53'
How to diagnose it
- Scope the 502s to variable proxy_pass locations. Parse access logs for 502 responses and compare the URIs to your configuration. If 502s only appear where
proxy_passcontains a variable, you are looking at a dynamic resolution problem. - Check for the 30-second latency signature. If your access log includes
$request_time, filter for 502s and inspect the duration. Values clustering near 30.000 indicateresolver_timeoutis firing. Shorter values may indicate immediate DNS refusal. - Inspect the resolver configuration. Run
nginx -Tand locate theresolverdirective. Note the IP addresses and anyvalid=parameter. If there is noresolverdirective, nginx cannot resolve variable hostnames at runtime. - Test resolution manually from the host. Use
digpointed at the exact resolver IP configured in nginx. If nginx runs in a container, run the test inside the same network namespace. If dig succeeds but nginx still fails, suspect packet loss, firewall rules affecting the nginx process, or an expired cache window. - Read the error log for resolver messages. Look for “host not found”, “upstream timed out”, or resolver-specific errors. These confirm whether the failure is DNS-level or upstream-connect-level.
- Correlate with DNS server health. If you operate the DNS infrastructure, check its query logs and latency metrics for the same time windows.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| 502 rate on variable-proxy locations | Direct symptom of resolver or upstream connect failure | Sustained nonzero rate on paths using dynamic upstreams |
$request_time latency spikes | DNS timeouts manifest as request latencies near resolver_timeout | P95 latency jumping to ~30s, or your configured timeout |
| Error log resolver entries | The only place nginx explicitly logs DNS failure | Any “host not found” or resolver timeout message |
| Active connections in Writing state | Requests waiting for resolution consume connection slots | Writing count elevated while request rate stays flat |
| DNS query latency from host | Measures the resolver path independently of nginx | Query time > 1s or packet loss to the configured resolver |
Fixes
Fix DNS reachability. If the configured resolver is unreachable, update the resolver directive to point to a working DNS server. Run nginx -t to validate syntax, then reload. If the resolver is a container or cluster DNS service, verify it is routable from the nginx network namespace.
Add or tune valid=. If failures correlate with TTL expiration, set resolver <ip> valid=30s; to cache responses and reduce query volume. This overrides the DNS TTL. The tradeoff is that a long cache delays failover when an upstream IP changes, while a short cache increases query volume and exposure to DNS blips.
Lower resolver_timeout to fail faster. The default 30 seconds is too long for most production proxies. Set resolver_timeout 5s; so DNS failures return 502 quickly. If you configure proxy_next_upstream, a fast failure allows retries to alternate backends rather than keeping the client hanging.
Hardcode an emergency upstream IP. For critical paths, maintain a commented static upstream block with known good IPs. If DNS is completely broken, switch the location to the static upstream and reload. This sacrifices dynamic resolution for immediate availability.
Replace variable proxy_pass with static upstream blocks where possible. If the backend IP is stable, use a static proxy_pass to an explicit upstream block. This resolves DNS once at startup and restores upstream keepalive pooling and passive health checks. You must reload nginx to pick up IP changes.
Prevention
- Cache DNS responses with
valid=set to a safe duration for your environment. - Monitor error logs for resolver keywords proactively.
- Keep a runbook for switching dynamic locations to static upstream IPs during DNS outages.
- Test resolver reachability from the nginx network namespace after any network or firewall changes.
How Netdata helps
- Correlate 502 spikes with P95 request latency to spot the
resolver_timeoutsignature. - Monitor error log rates for resolver keywords.
- Track active connection states to see if Writing connections accumulate during DNS outages.
- Alert on upstream response time anomalies for variable-proxy locations.
Related guides
- How NGINX actually works in production: a mental model for operators
- nginx 413 Request Entity Too Large: client_max_body_size explained
- nginx 499 status code: why clients close connections before the response
- nginx 500 Internal Server Error: how to diagnose it
- nginx 502 Bad Gateway: causes and how to fix it
- nginx 503 Service Temporarily Unavailable: causes and fixes
- nginx 504 Gateway Time-out: causes and fixes
- NGINX active connections climbing: reading, writing, waiting explained
- nginx: bind() to 0.0.0.0:80 failed (98: Address already in use)
- NGINX backend cascade failure: when slow upstreams take down everything
- nginx: a client request body is buffered to a temporary file - what it means
- NGINX proxy cache hit rate is low: measuring and improving it







