NGINX DNS resolution failures on dynamic upstreams: 502s and resolver_timeout

What this means

Intermittent 502 Bad Gateway responses that only hit locations using a variable in proxy_pass, such as proxy_pass http://$backend;, point to dynamic DNS resolution failure. Static upstream locations are unaffected. dig from the host may succeed instantly while nginx logs show 502s with latency spikes clustering at exactly 30 seconds, the default resolver_timeout.

nginx resolves upstream hostnames through two paths.

A static directive such as proxy_pass http://backend.example.com; resolves once at configuration load via the OS resolver and caches the IP in worker memory. If the IP changes later, nginx continues routing to the stale address until you reload.

A variable directive such as proxy_pass http://$backend; triggers dynamic DNS resolution through nginx’s internal async resolver. The resolver does not block the worker’s event loop, but the request remains parked until the answer arrives or resolver_timeout expires. The default is 30 seconds. If the DNS server is slow, drops packets, or returns no record, the request hangs and then fails with 502. Because queries often succeed, failures look intermittent.

dig from the host may succeed while nginx fails. The OS resolver can use /etc/resolv.conf, local caches, or multiple nameservers with fallback. nginx uses only the IP addresses explicitly configured in the resolver directive. If that specific path is unreachable, congested, or filtered by firewall rules that affect the nginx worker but not your shell session, dig succeeds and nginx does not.

flowchart LR
    Client -->|request| Worker[nginx worker]
    Worker -->|variable
proxy_pass| Resolver[async resolver] Resolver -->|DNS query| DNS[DNS server] DNS -->|response| Resolver Resolver -->|IP| Worker Worker -->|proxy| Upstream[upstream app] DNS -.->|timeout / fail| Resolver Resolver -.->|resolver_timeout| Worker Worker -->|502| Client

Common causes

CauseWhat it looks likeFirst thing to check
DNS server unreachable or slow from nginxRandom 502s clustered around 30s latency spikes; resolver errors in logsdig @<resolver_ip> <hostname> from the nginx host
Missing resolver directiveAll variable-proxy requests fail; “no live upstreams” or resolver errors in logsnginx -T | grep resolver
DNS cache expired with no valid= override502 bursts after DNS TTL expires; healthy between burstsResolver config for valid= parameter vs DNS TTL
Upstream hostname deleted or typoedConsistent 502s for one specific hostnamedig the exact hostname from the host
Variable proxy_pass bypassing upstream block featuresFlaky behavior even when DNS works, because keepalive pools and health checks are unavailablenginx -T review to see if an explicit upstream block is defined but unused

Quick checks

# Test DNS resolution using the same resolver IP as nginx
dig @<resolver_ip> <upstream_hostname>

# Inspect resolver and proxy_pass configuration (requires sufficient privileges)
nginx -T 2>/dev/null | grep -E '(resolver|proxy_pass)'

# Check error log for resolver errors
tail -n 1000 /var/log/nginx/error.log | grep -iE '(resolver|host not found|no live upstreams)'

# Find URI paths that produce 502s (default combined log format)
awk '$9 == 502 {print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head

# Verify workers are healthy
curl -s http://127.0.0.1/nginx_status

# Confirm outbound DNS socket state (run with appropriate privileges)
ss -tunap | grep ':53'

How to diagnose it

  1. Scope the 502s to variable proxy_pass locations. Parse access logs for 502 responses and compare the URIs to your configuration. If 502s only appear where proxy_pass contains a variable, you are looking at a dynamic resolution problem.
  2. Check for the 30-second latency signature. If your access log includes $request_time, filter for 502s and inspect the duration. Values clustering near 30.000 indicate resolver_timeout is firing. Shorter values may indicate immediate DNS refusal.
  3. Inspect the resolver configuration. Run nginx -T and locate the resolver directive. Note the IP addresses and any valid= parameter. If there is no resolver directive, nginx cannot resolve variable hostnames at runtime.
  4. Test resolution manually from the host. Use dig pointed at the exact resolver IP configured in nginx. If nginx runs in a container, run the test inside the same network namespace. If dig succeeds but nginx still fails, suspect packet loss, firewall rules affecting the nginx process, or an expired cache window.
  5. Read the error log for resolver messages. Look for “host not found”, “upstream timed out”, or resolver-specific errors. These confirm whether the failure is DNS-level or upstream-connect-level.
  6. Correlate with DNS server health. If you operate the DNS infrastructure, check its query logs and latency metrics for the same time windows.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
502 rate on variable-proxy locationsDirect symptom of resolver or upstream connect failureSustained nonzero rate on paths using dynamic upstreams
$request_time latency spikesDNS timeouts manifest as request latencies near resolver_timeoutP95 latency jumping to ~30s, or your configured timeout
Error log resolver entriesThe only place nginx explicitly logs DNS failureAny “host not found” or resolver timeout message
Active connections in Writing stateRequests waiting for resolution consume connection slotsWriting count elevated while request rate stays flat
DNS query latency from hostMeasures the resolver path independently of nginxQuery time > 1s or packet loss to the configured resolver

Fixes

Fix DNS reachability. If the configured resolver is unreachable, update the resolver directive to point to a working DNS server. Run nginx -t to validate syntax, then reload. If the resolver is a container or cluster DNS service, verify it is routable from the nginx network namespace.

Add or tune valid=. If failures correlate with TTL expiration, set resolver <ip> valid=30s; to cache responses and reduce query volume. This overrides the DNS TTL. The tradeoff is that a long cache delays failover when an upstream IP changes, while a short cache increases query volume and exposure to DNS blips.

Lower resolver_timeout to fail faster. The default 30 seconds is too long for most production proxies. Set resolver_timeout 5s; so DNS failures return 502 quickly. If you configure proxy_next_upstream, a fast failure allows retries to alternate backends rather than keeping the client hanging.

Hardcode an emergency upstream IP. For critical paths, maintain a commented static upstream block with known good IPs. If DNS is completely broken, switch the location to the static upstream and reload. This sacrifices dynamic resolution for immediate availability.

Replace variable proxy_pass with static upstream blocks where possible. If the backend IP is stable, use a static proxy_pass to an explicit upstream block. This resolves DNS once at startup and restores upstream keepalive pooling and passive health checks. You must reload nginx to pick up IP changes.

Prevention

  • Cache DNS responses with valid= set to a safe duration for your environment.
  • Monitor error logs for resolver keywords proactively.
  • Keep a runbook for switching dynamic locations to static upstream IPs during DNS outages.
  • Test resolver reachability from the nginx network namespace after any network or firewall changes.

How Netdata helps

  • Correlate 502 spikes with P95 request latency to spot the resolver_timeout signature.
  • Monitor error log rates for resolver keywords.
  • Track active connection states to see if Writing connections accumulate during DNS outages.
  • Alert on upstream response time anomalies for variable-proxy locations.