nginx connect() failed (111: Connection refused) while connecting to upstream

HTTP 502 Bad Gateway and the error connect() failed (111: Connection refused) while connecting to upstream mean nginx reached the upstream IP, but the target port actively refused the TCP connection. The backend is either not running, not listening on the interface nginx expects, or a firewall is blocking the port.

This is distinct from upstream timed out (110: Connection timed out). A timeout means the TCP SYN never received a response, usually because a firewall silently dropped the packet or the host is unreachable. Errno 111 means the network path is open but no process is accepting connections. The error log includes the upstream address, such as upstream: "fastcgi://127.0.0.1:9000". Read that line first to isolate the exact backend.

If you have multiple upstream servers, stock nginx retries the request on the next peer because connect failures map to the error condition in proxy_next_upstream, which is in the default retry set. With a single upstream, or if all servers are already marked unavailable, the client gets an immediate 502.

What this means

When nginx proxies a request, connect() returns ECONNREFUSED (111) when the target IP is reachable but no process listens on that port. The client receives HTTP 502.

The error log contains the upstream address, for example upstream: "http://10.0.1.10:8080". Use this to identify the exact peer that failed.

Stock nginx uses passive health checks with max_fails=1 and fail_timeout=10s. A single 111 counts as one failure. If a server accumulates max_fails failures within fail_timeout, nginx marks it unavailable for the remainder of that period and stops sending traffic to it. With a single upstream server, nginx keeps attempting connections regardless of recent failures. Connect failures map to the error condition, which is in the default proxy_next_upstream set, so nginx retries on the next peer when multiple servers are configured.

flowchart TD
    A[Error 111 in nginx log] --> B{Backend process running?}
    B -->|No| C[Restart backend and check for OOM or crash]
    B -->|Yes| D{Listening on expected interface and port?}
    D -->|No| E[Fix bind address or upstream port in config]
    D -->|Yes| F{Reachable from nginx host?}
    F -->|No| G[Check firewall rules or container network]
    F -->|Yes| H{Using localhost or 127.0.0.1?}
    H -->|Yes| I[Replace with explicit IP or service name]
    H -->|No| J{IPv6 localhost mismatch?}
    J -->|Yes| K[Use explicit 127.0.0.1 instead of localhost]
    J -->|No| L[Review max_fails and fail_timeout state]

Common causes

CauseWhat it looks likeFirst thing to check
Backend process stopped or crashed502s start immediately after a deploy, restart, or OOM kill; error log points to the expected backend addressProcess status on the backend host: ps aux or container runtime status
Backend listening on wrong interface or portBackend process is running, but direct connection from the nginx host still fails with refusalss -tlnp on the backend host to verify bound IP and port
Container localhost mismatchnginx runs inside a container and uses 127.0.0.1 or localhost as upstream, reaching its own loopback instead of the host or another containerContainer network mode and DNS or service names
Firewall or host ACL blocking portBackend is healthy and listening, but connections from the nginx host are refusediptables -L or firewall-cmd --list-all, then test with nc or bash /dev/tcp from the nginx host
IPv6/IPv4 localhost mismatchUpstream configured as localhost; backend binds only to IPv4 (127.0.0.1) while nginx resolves to [::1]Use explicit 127.0.0.1 in the upstream definition instead of localhost
Port or upstream definition mismatchnginx points to port 9000 but the backend listens on 3000, or a typo exists in the upstream blocknginx -T output for the upstream server directive

Quick checks

Run these read-only checks first.

# Check recent upstream connection errors
tail -1000 /var/log/nginx/error.log | grep -E "connect\(\) failed.*111"

# Validate nginx configuration syntax
nginx -t

# Verify backend process is running (example: php-fpm)
ps aux | grep php-fpm | grep -v grep

# Verify listening sockets on the backend host
ss -tlnp

# Test direct TCP connectivity from the nginx host
timeout 2 bash -c "echo > /dev/tcp/<backend-host>/<port>" 2>/dev/null && echo "UP" || echo "DOWN"

# Identify which upstream server failed from error logs
grep "connect() failed (111" /var/log/nginx/error.log | grep -oP 'upstream: "\K[^"]+' | sort | uniq -c | sort -rn

How to diagnose it

  1. Read the error log line. Extract the upstream address from the upstream: field. Confirm the timestamp correlates with the 502 responses.

  2. Verify backend process health. On the backend host, check that the application process is running. In container environments, check container or pod status. Look for recent OOM kills or crash loops.

  3. Verify the listening socket. Run ss -tlnp on the backend host. Look for the expected port in the Local Address:Port column. If the backend is bound to 127.0.0.1:8080 but nginx connects from another host or container, that explains the refusal. If the port does not appear at all, the backend process failed to start or bound to a different port.

  4. Test direct connectivity from the nginx host. Use nc -zv <host> <port> or bash -c 'echo > /dev/tcp/<host>/<port>' from the nginx host or container. If this fails with connection refused, the network path is clear but the port is closed.

  5. Check container networking. If nginx and the backend are in separate Docker containers or Kubernetes pods, ensure they share a network or use the correct service DNS name. Never use localhost or 127.0.0.1 to reach another container. In Docker Compose, use the service name; in Kubernetes, use the ClusterIP service DNS or the pod IP if headless.

  6. Check firewall rules. If the backend is listening and the port is correct, verify that host firewalls or cloud security groups allow traffic from the nginx host to the backend port.

  7. Check for IPv6/IPv4 mismatch. If the upstream is defined as localhost, change it to 127.0.0.1 to rule out [::1] vs 127.0.0.1 binding mismatches.

  8. Review passive health check state. If you have multiple upstream servers, check whether the failing server was temporarily marked down. Look for no live upstreams or repeated 111 errors to the same peer. With a single upstream, nginx continues sending traffic to a dead backend.

  9. Correlate with access log timing. If your access log format includes $upstream_connect_time, a value of - or a sudden spike indicates connection establishment failures. Compare $upstream_response_time and $request_time to confirm the delay is at connect time, not during response transfer.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
HTTP 502 rateDirect user impact from refused upstream connectionsSustained nonzero rate, or a spike after deploys or restarts
$upstream_connect_timeIsolates connection establishment from response time; 111 errors appear as failures or spikesValues of -, or P95 > 100 ms in a local datacenter
$upstream_response_timeReveals if remaining healthy backends are overloaded after peers dropP95 trending up after upstream failures
Error log connect() failed (111)Exact symptom count and backend identificationAny sustained rate > 0
Active connections in Writing stateConnections pile up waiting for backends, accelerating exhaustionWriting sustained above 60% of active connections
accepts - handled gapDetects connection drops if upstream failures cascade to saturationGap increasing for > 60 seconds

Fixes

Backend process down

Restart the application process. Check backend logs for OOM kills, segfaults, or CrashLoopBackOff events immediately before the 111 errors started. If the backend is managed by systemd, check systemctl status <service> and journalctl -u <service> -n 50. If the backend uses systemd socket activation, verify the socket unit is listening with systemctl status <service>.socket and that the service unit started.

Wrong bind address or port mismatch

Reconfigure the backend to bind to 0.0.0.0:<port> or the specific interface nginx uses. If the backend intentionally binds to 127.0.0.1 for security, nginx must run on the same host and network namespace, or you must use a Unix domain socket. Ensure the socket path is readable by the nginx worker process user and that the path matches the upstream block exactly. Update the nginx upstream definition if the backend port changed.

Container networking mismatch

Replace localhost and 127.0.0.1 in upstream definitions with the Docker Compose service name, Kubernetes service DNS name, or the container bridge IP. Verify both containers share the same network or can route to each other.

Firewall or security group blocking

Add an allow rule for the nginx host to reach the backend port. Verify with nc or /dev/tcp from the nginx host after applying the change.

IPv6/IPv4 mismatch

Change upstream definitions from localhost to the explicit IPv4 or IPv6 address that matches the backend bind configuration.

Passive health check flapping

If transient 111 errors occur during rolling updates or brief restarts, consider increasing max_fails from the default of 1 to 3, or raising fail_timeout. Tradeoff: this delays detection of genuine failures. Only tune this if you observe false positives correlated with normal deployment behavior.

Prevention

Expose a lightweight health check endpoint on every backend so external monitors detect absence before user traffic is affected. Avoid localhost in upstream definitions; use explicit IPs or service names. Manage firewall rules as code and audit them after infrastructure changes. Ensure container orchestration uses shared networking or proper service discovery rather than loopback assumptions. Size upstream pools so that the loss of one backend does not overload the remainder.

How Netdata helps

  • Correlate HTTP 502 spikes with error log entries for connect() failed (111) to confirm the pattern.
  • Plot $upstream_connect_time to isolate connection refusal from backend slowness.
  • Track active connection states (Reading, Writing, Waiting) to detect upstream failure pile-up.
  • Alert on the accepts - handled gap to catch connection exhaustion before it cascades.
  • Monitor worker process count and file descriptor utilization to rule out nginx-side saturation that mimics upstream problems.