nginx recv() failed (104: Connection reset by peer) while reading from upstream

You tail the nginx error log and see this:

[error] ... recv() failed (104: Connection reset by peer) while reading response header from upstream

The client gets a 502 Bad Gateway. The error is not a configuration syntax problem, and the service was working five minutes ago. This message means the upstream server sent a TCP RST while nginx was mid-read on an upstream connection. The reset originates from the backend or from a network middlebox between nginx and the backend. It does not come from nginx itself, and it does not come from the client.

This article covers the upstream-side variant. If your log says “while reading client request body,” the reset came from the client and requires a different diagnosis. Here we focus on why backends drop nginx mid-request, how to distinguish the six common root causes, and what to fix.

What this means

errno 104 is ECONNRESET. In the upstream context, the peer sent a TCP RST. nginx logs this while actively reading from the upstream connection, usually during a proxy_pass transaction. The request typically dies with a 502 response to the client.

This error is distinct from “upstream prematurely closed connection,” which fires when nginx detects that a reusable keepalive connection was closed by the backend before nginx attempted to use it. The recv() 104 error fires when nginx is actively reading from a connection that was valid when the request started but was reset during the response phase.

flowchart TD
    A[nginx recv 104 from upstream] --> B{Backend crash or OOM?}
    B -->|Yes| C[Check dmesg and app logs]
    B -->|No| D{Idle timeout race?}
    D -->|Yes| E[Align keepalive and LB timeouts]
    D -->|No| F{Large responses only?}
    F -->|Yes| G[Check path MTU]
    F -->|No| H[Verify upstream health and proxy_next_upstream]

Common causes

CauseWhat it looks likeFirst thing to check
Backend crash or restart mid-request502 spikes correlate with deploys, restarts, or application panicsBackend process uptime and application error logs
Backend OOM kill or worker termination502s appear suddenly under memory pressure; backend PID changesdmesg or kernel logs for OOM killer activity
Keepalive idle timeout raceIntermittent 502s on low-traffic backends; connect time of 0.000 in access logBackend keepalive_timeout versus nginx upstream keepalive timeout
Middlebox idle timeout (LB, firewall, NAT)Errors appear after a fixed idle interval, e.g., 60 or 350 secondsLoad balancer or proxy idle timeout settings between nginx and backend
Upstream keepalive reuse of closed connection502s after a burst followed by an idle period; backend closed the connection without nginx knowingBackend connection logs and upstream_connect_time distribution
MTU / PMTU black holeConsistent failures for large responses only; small requests and health checks succeedNetwork path MTU between nginx and backend

Quick checks

# Error frequency in the last hour
grep -Fc "recv() failed (104" /var/log/nginx/error.log

# Backend uptime by PID
ps -o pid,etime,command -p <BACKEND_PID>

# Recent OOM kills
dmesg -T | grep -i "killed process" | tail -10

# 502 responses with upstream timing
awk '$9 == 502 {print $0}' /var/log/nginx/access.log | tail -20

# Upstream connect time: 0.000 means keepalive reuse.
# Only works if your log_format includes upstream_connect_time=<value>
grep -oP 'upstream_connect_time=\K[0-9.]+' /var/log/nginx/access.log | tail -100 | sort -n | uniq -c

# Direct TCP check from nginx host
timeout 2 bash -c "echo > /dev/tcp/<BACKEND_IP>/<BACKEND_PORT>" && echo "up" || echo "down"

# Active connection states (requires stub_status on localhost)
curl -s http://127.0.0.1/nginx_status

# Upstream keepalive configuration
nginx -T 2>/dev/null | grep -A5 -B5 "keepalive"

How to diagnose it

  1. Confirm the reset is upstream-side. The phrases “while reading from upstream” or “while reading response header from upstream” confirm the RST came from the backend direction, not the client.
  2. Correlate timestamps. Match nginx error log entries with backend application logs, systemd journal, or container restart events. A backend crash usually produces an exact timestamp match.
  3. Check for OOM kills. Run dmesg -T | grep -i oom to see if the backend worker was killed by the kernel at the same time the 502s appeared.
  4. Inspect access log timing. You must log $upstream_connect_time, $upstream_header_time, and $upstream_response_time. If $upstream_connect_time is 0.000, the connection was reused from the keepalive pool. A 0.000 connect time followed by a 502 strongly suggests the backend closed the idle connection before nginx used it, or the backend died while the connection was pooled.
  5. Identify the specific backend. Log $upstream_addr to see which server in the upstream group sent the RST. If one backend accounts for all errors, it is the culprit.
  6. Test for idle timeout races. If errors occur only after periods of low traffic, compare the backend’s keepalive timeout with nginx’s keepalive_timeout in the upstream block. The backend must not close connections faster than nginx expects.
  7. Check network middleboxes. If there is a cloud load balancer, firewall, or NAT gateway between nginx and the backend, verify its idle connection timeout. Many cloud LBs default to 60 seconds. If nginx’s keepalive idle time exceeds this, the LB may send an RST to nginx when it attempts reuse.
  8. Rule out MTU issues. If large responses fail consistently but small health checks pass, test path MTU between nginx and the backend.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
HTTP 502 raterecv() 104 errors usually surface as 502 Bad GatewaySustained 502 rate >1% of requests
Upstream connect time0.000 indicates keepalive reuse; spikes mean new TCP setupSudden absence of 0.000 values correlating with 502s
Upstream response timeA backend about to crash may slow down before resettingP95 upstream response time >80% of proxy_read_timeout
Backend process restartsCrashes directly cause mid-request RSTsRestart count increasing outside of deploy windows
Kernel OOM killsOOM killer terminates backend workers without graceful closeOOM events in dmesg matching 502 spikes
Active connections in Writing stateHigh Writing indicates slow or hung upstreamsWriting >90% of active connections with low throughput
Error log rate for 104Direct indicator of RST frequencySustained recv() 104 errors in error log

Fixes

Backend crash or OOM kill

Fix the application stability or memory leak. If the backend uses a worker model, ensure worker restarts are graceful. Increase memory limits or fix the leak. As a mitigation, configure proxy_next_upstream error timeout so nginx retries the request on another backend. This only helps for idempotent GET requests. POST and PUT retries may duplicate side effects, so do not blindly retry mutating requests.

Keepalive idle timeout race

Align timeouts so the backend never closes a connection before nginx expects it. If the backend’s keepalive timeout is 5 seconds and nginx holds connections for 60 seconds, nginx will attempt to reuse dead connections. Either increase the backend timeout or decrease nginx’s keepalive_timeout in the upstream block. Reducing keepalive_requests is another option; it forces more frequent connection rotation, which avoids stale pooled connections at the cost of extra TCP handshakes.

Middlebox idle timeout

Increase the idle timeout on the load balancer or firewall to be larger than nginx’s upstream keepalive timeout. If you cannot change the middlebox, reduce nginx’s keepalive_timeout below the middlebox limit. As a last resort, disable upstream keepalive for that backend to avoid reuse races entirely.

Upstream keepalive pool churn

If $upstream_connect_time is rarely 0.000, the keepalive pool is ineffective. Verify the keepalive directive is set in the upstream block. Ensure the backend sends HTTP/1.1 with Connection: keep-alive. For HTTP/1.0 backends, configure proxy_http_version 1.1 and proxy_set_header Connection "" in the nginx location block.

MTU or network path issues

If large responses trigger the error while small requests succeed, verify consistent path MTU between nginx and backend. Ensure ICMP fragmentation needed messages are not dropped by intermediate firewalls.

Prevention

  • Log upstream timing variables. Include $upstream_connect_time, $upstream_response_time, and $upstream_addr in your access log format. Without these, you cannot distinguish a keepalive race from a backend crash.
  • Align idle timeouts. Ensure backend keepalive timeouts exceed or match nginx’s upstream keepalive settings. Nginx should be the first to close an idle connection, not the backend.
  • Set worker shutdown timeout. Configure worker_shutdown_timeout to prevent old workers from lingering indefinitely during reloads and holding open stale upstream connections.
  • Monitor backend health independently. A backend that is alive enough to accept TCP but unhealthy enough to crash mid-response needs application-level monitoring, not just a port check.
  • Configure retries carefully. Use proxy_next_upstream error timeout for read-only endpoints. Avoid retrying mutating requests unless the application is idempotent.

How Netdata helps

  • Correlate 502 spikes with backend CPU, memory, and disk metrics to spot OOM or resource exhaustion before checking dmesg.
  • Watch nginx Writing state growth. A climbing Writing count with flat throughput signals slow upstreams that may reset.
  • Compare requests per second against 5xx rate: flat RPS with a 5xx spike points to backend crashes; dropping RPS with climbing connections points to a connection leak.
  • Set alerts on backend process restarts. A process that cycles outside of deploy windows often produces recv() 104 errors.
  • Ingest access logs via the web log collector to chart upstream response time percentiles and upstream address distribution.