NGINX DNS resolution failures on dynamic upstreams: 502s and resolver_timeout

What this means

Intermittent 502 Bad Gateway responses that only hit locations using a variable in proxy_pass, such as proxy_pass http://$backend;, point to dynamic DNS resolution failure. Static upstream locations are unaffected. dig from the host may succeed instantly while nginx logs show 502s with latency spikes clustering at exactly 30 seconds, the default resolver_timeout.

nginx resolves upstream hostnames through two paths.

A static directive such as proxy_pass http://backend.example.com; resolves once at configuration load via the OS resolver and caches the IP in worker memory. If the IP changes later, nginx continues routing to the stale address until you reload.

A variable directive such as proxy_pass http://$backend; triggers dynamic DNS resolution through nginx’s internal async resolver. The resolver does not block the worker’s event loop, but the request remains parked until the answer arrives or resolver_timeout expires. The default is 30 seconds. If the DNS server is slow, drops packets, or returns no record, the request hangs and then fails with 502. Because queries often succeed, failures look intermittent.

dig from the host may succeed while nginx fails. The OS resolver can use /etc/resolv.conf, local caches, or multiple nameservers with fallback. nginx uses only the IP addresses explicitly configured in the resolver directive. If that specific path is unreachable, congested, or filtered by firewall rules that affect the nginx worker but not your shell session, dig succeeds and nginx does not.

flowchart LR
    Client -->|request| Worker[nginx worker]
    Worker -->|variable
proxy_pass| Resolver[async resolver]
    Resolver -->|DNS query| DNS[DNS server]
    DNS -->|response| Resolver
    Resolver -->|IP| Worker
    Worker -->|proxy| Upstream[upstream app]
    DNS -.->|timeout / fail| Resolver
    Resolver -.->|resolver_timeout| Worker
    Worker -->|502| Client

Common causes

Cause	What it looks like	First thing to check
DNS server unreachable or slow from nginx	Random 502s clustered around 30s latency spikes; resolver errors in logs	`dig @<resolver_ip> <hostname>` from the nginx host
Missing `resolver` directive	All variable-proxy requests fail; “no live upstreams” or resolver errors in logs	`nginx -T \| grep resolver`
DNS cache expired with no `valid=` override	502 bursts after DNS TTL expires; healthy between bursts	Resolver config for `valid=` parameter vs DNS TTL
Upstream hostname deleted or typoed	Consistent 502s for one specific hostname	`dig` the exact hostname from the host
Variable proxy_pass bypassing upstream block features	Flaky behavior even when DNS works, because keepalive pools and health checks are unavailable	`nginx -T` review to see if an explicit upstream block is defined but unused

Quick checks

# Test DNS resolution using the same resolver IP as nginx
dig @<resolver_ip> <upstream_hostname>

# Inspect resolver and proxy_pass configuration (requires sufficient privileges)
nginx -T 2>/dev/null | grep -E '(resolver|proxy_pass)'

# Check error log for resolver errors
tail -n 1000 /var/log/nginx/error.log | grep -iE '(resolver|host not found|no live upstreams)'

# Find URI paths that produce 502s (default combined log format)
awk '$9 == 502 {print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head

# Verify workers are healthy
curl -s http://127.0.0.1/nginx_status

# Confirm outbound DNS socket state (run with appropriate privileges)
ss -tunap | grep ':53'

How to diagnose it

Scope the 502s to variable proxy_pass locations. Parse access logs for 502 responses and compare the URIs to your configuration. If 502s only appear where proxy_pass contains a variable, you are looking at a dynamic resolution problem.
Check for the 30-second latency signature. If your access log includes $request_time, filter for 502s and inspect the duration. Values clustering near 30.000 indicate resolver_timeout is firing. Shorter values may indicate immediate DNS refusal.
Inspect the resolver configuration. Run nginx -T and locate the resolver directive. Note the IP addresses and any valid= parameter. If there is no resolver directive, nginx cannot resolve variable hostnames at runtime.
Test resolution manually from the host. Use dig pointed at the exact resolver IP configured in nginx. If nginx runs in a container, run the test inside the same network namespace. If dig succeeds but nginx still fails, suspect packet loss, firewall rules affecting the nginx process, or an expired cache window.
Read the error log for resolver messages. Look for “host not found”, “upstream timed out”, or resolver-specific errors. These confirm whether the failure is DNS-level or upstream-connect-level.
Correlate with DNS server health. If you operate the DNS infrastructure, check its query logs and latency metrics for the same time windows.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
502 rate on variable-proxy locations	Direct symptom of resolver or upstream connect failure	Sustained nonzero rate on paths using dynamic upstreams
`$request_time` latency spikes	DNS timeouts manifest as request latencies near `resolver_timeout`	P95 latency jumping to ~30s, or your configured timeout
Error log resolver entries	The only place nginx explicitly logs DNS failure	Any “host not found” or resolver timeout message
Active connections in Writing state	Requests waiting for resolution consume connection slots	Writing count elevated while request rate stays flat
DNS query latency from host	Measures the resolver path independently of nginx	Query time > 1s or packet loss to the configured resolver

Fixes

Fix DNS reachability. If the configured resolver is unreachable, update the resolver directive to point to a working DNS server. Run nginx -t to validate syntax, then reload. If the resolver is a container or cluster DNS service, verify it is routable from the nginx network namespace.

Add or tune valid=. If failures correlate with TTL expiration, set resolver <ip> valid=30s; to cache responses and reduce query volume. This overrides the DNS TTL. The tradeoff is that a long cache delays failover when an upstream IP changes, while a short cache increases query volume and exposure to DNS blips.

Lower resolver_timeout to fail faster. The default 30 seconds is too long for most production proxies. Set resolver_timeout 5s; so DNS failures return 502 quickly. If you configure proxy_next_upstream, a fast failure allows retries to alternate backends rather than keeping the client hanging.

Hardcode an emergency upstream IP. For critical paths, maintain a commented static upstream block with known good IPs. If DNS is completely broken, switch the location to the static upstream and reload. This sacrifices dynamic resolution for immediate availability.

Replace variable proxy_pass with static upstream blocks where possible. If the backend IP is stable, use a static proxy_pass to an explicit upstream block. This resolves DNS once at startup and restores upstream keepalive pooling and passive health checks. You must reload nginx to pick up IP changes.

Prevention

Cache DNS responses with valid= set to a safe duration for your environment.
Monitor error logs for resolver keywords proactively.
Keep a runbook for switching dynamic locations to static upstream IPs during DNS outages.
Test resolver reachability from the nginx network namespace after any network or firewall changes.

How Netdata helps

Correlate 502 spikes with P95 request latency to spot the resolver_timeout signature.
Monitor error log rates for resolver keywords.
Track active connection states to see if Writing connections accumulate during DNS outages.
Alert on upstream response time anomalies for variable-proxy locations.

The Netdata solution

Web server monitoring with Netdata

Netdata monitors NGINX with per-second request, connection, and latency metrics plus ML anomaly detection. Correlate connection and file-descriptor exhaustion, upstream cascade failures, buffer spill, and TLS CPU with the host signals behind them.

See web server monitoring → Start monitoring free

NGINX DNS resolution failures on dynamic upstreams: 502s and resolver_timeout

NGINX DNS resolution failures on dynamic upstreams: 502s and resolver_timeout

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Prevention

How Netdata helps

Related guides

Web server monitoring with Netdata