nginx recv() failed (104: Connection reset by peer) while reading from upstream
You tail the nginx error log and see this:
[error] ... recv() failed (104: Connection reset by peer) while reading response header from upstream
The client gets a 502 Bad Gateway. The error is not a configuration syntax problem, and the service was working five minutes ago. This message means the upstream server sent a TCP RST while nginx was mid-read on an upstream connection. The reset originates from the backend or from a network middlebox between nginx and the backend. It does not come from nginx itself, and it does not come from the client.
This article covers the upstream-side variant. If your log says “while reading client request body,” the reset came from the client and requires a different diagnosis. Here we focus on why backends drop nginx mid-request, how to distinguish the six common root causes, and what to fix.
What this means
errno 104 is ECONNRESET. In the upstream context, the peer sent a TCP RST. nginx logs this while actively reading from the upstream connection, usually during a proxy_pass transaction. The request typically dies with a 502 response to the client.
This error is distinct from “upstream prematurely closed connection,” which fires when nginx detects that a reusable keepalive connection was closed by the backend before nginx attempted to use it. The recv() 104 error fires when nginx is actively reading from a connection that was valid when the request started but was reset during the response phase.
flowchart TD
A[nginx recv 104 from upstream] --> B{Backend crash or OOM?}
B -->|Yes| C[Check dmesg and app logs]
B -->|No| D{Idle timeout race?}
D -->|Yes| E[Align keepalive and LB timeouts]
D -->|No| F{Large responses only?}
F -->|Yes| G[Check path MTU]
F -->|No| H[Verify upstream health and proxy_next_upstream]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Backend crash or restart mid-request | 502 spikes correlate with deploys, restarts, or application panics | Backend process uptime and application error logs |
| Backend OOM kill or worker termination | 502s appear suddenly under memory pressure; backend PID changes | dmesg or kernel logs for OOM killer activity |
| Keepalive idle timeout race | Intermittent 502s on low-traffic backends; connect time of 0.000 in access log | Backend keepalive_timeout versus nginx upstream keepalive timeout |
| Middlebox idle timeout (LB, firewall, NAT) | Errors appear after a fixed idle interval, e.g., 60 or 350 seconds | Load balancer or proxy idle timeout settings between nginx and backend |
| Upstream keepalive reuse of closed connection | 502s after a burst followed by an idle period; backend closed the connection without nginx knowing | Backend connection logs and upstream_connect_time distribution |
| MTU / PMTU black hole | Consistent failures for large responses only; small requests and health checks succeed | Network path MTU between nginx and backend |
Quick checks
# Error frequency in the last hour
grep -Fc "recv() failed (104" /var/log/nginx/error.log
# Backend uptime by PID
ps -o pid,etime,command -p <BACKEND_PID>
# Recent OOM kills
dmesg -T | grep -i "killed process" | tail -10
# 502 responses with upstream timing
awk '$9 == 502 {print $0}' /var/log/nginx/access.log | tail -20
# Upstream connect time: 0.000 means keepalive reuse.
# Only works if your log_format includes upstream_connect_time=<value>
grep -oP 'upstream_connect_time=\K[0-9.]+' /var/log/nginx/access.log | tail -100 | sort -n | uniq -c
# Direct TCP check from nginx host
timeout 2 bash -c "echo > /dev/tcp/<BACKEND_IP>/<BACKEND_PORT>" && echo "up" || echo "down"
# Active connection states (requires stub_status on localhost)
curl -s http://127.0.0.1/nginx_status
# Upstream keepalive configuration
nginx -T 2>/dev/null | grep -A5 -B5 "keepalive"
How to diagnose it
- Confirm the reset is upstream-side. The phrases “while reading from upstream” or “while reading response header from upstream” confirm the RST came from the backend direction, not the client.
- Correlate timestamps. Match nginx error log entries with backend application logs, systemd journal, or container restart events. A backend crash usually produces an exact timestamp match.
- Check for OOM kills. Run
dmesg -T | grep -i oomto see if the backend worker was killed by the kernel at the same time the 502s appeared. - Inspect access log timing. You must log
$upstream_connect_time,$upstream_header_time, and$upstream_response_time. If$upstream_connect_timeis 0.000, the connection was reused from the keepalive pool. A 0.000 connect time followed by a 502 strongly suggests the backend closed the idle connection before nginx used it, or the backend died while the connection was pooled. - Identify the specific backend. Log
$upstream_addrto see which server in the upstream group sent the RST. If one backend accounts for all errors, it is the culprit. - Test for idle timeout races. If errors occur only after periods of low traffic, compare the backend’s keepalive timeout with nginx’s
keepalive_timeoutin the upstream block. The backend must not close connections faster than nginx expects. - Check network middleboxes. If there is a cloud load balancer, firewall, or NAT gateway between nginx and the backend, verify its idle connection timeout. Many cloud LBs default to 60 seconds. If nginx’s keepalive idle time exceeds this, the LB may send an RST to nginx when it attempts reuse.
- Rule out MTU issues. If large responses fail consistently but small health checks pass, test path MTU between nginx and the backend.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| HTTP 502 rate | recv() 104 errors usually surface as 502 Bad Gateway | Sustained 502 rate >1% of requests |
| Upstream connect time | 0.000 indicates keepalive reuse; spikes mean new TCP setup | Sudden absence of 0.000 values correlating with 502s |
| Upstream response time | A backend about to crash may slow down before resetting | P95 upstream response time >80% of proxy_read_timeout |
| Backend process restarts | Crashes directly cause mid-request RSTs | Restart count increasing outside of deploy windows |
| Kernel OOM kills | OOM killer terminates backend workers without graceful close | OOM events in dmesg matching 502 spikes |
| Active connections in Writing state | High Writing indicates slow or hung upstreams | Writing >90% of active connections with low throughput |
| Error log rate for 104 | Direct indicator of RST frequency | Sustained recv() 104 errors in error log |
Fixes
Backend crash or OOM kill
Fix the application stability or memory leak. If the backend uses a worker model, ensure worker restarts are graceful. Increase memory limits or fix the leak. As a mitigation, configure proxy_next_upstream error timeout so nginx retries the request on another backend. This only helps for idempotent GET requests. POST and PUT retries may duplicate side effects, so do not blindly retry mutating requests.
Keepalive idle timeout race
Align timeouts so the backend never closes a connection before nginx expects it. If the backend’s keepalive timeout is 5 seconds and nginx holds connections for 60 seconds, nginx will attempt to reuse dead connections. Either increase the backend timeout or decrease nginx’s keepalive_timeout in the upstream block. Reducing keepalive_requests is another option; it forces more frequent connection rotation, which avoids stale pooled connections at the cost of extra TCP handshakes.
Middlebox idle timeout
Increase the idle timeout on the load balancer or firewall to be larger than nginx’s upstream keepalive timeout. If you cannot change the middlebox, reduce nginx’s keepalive_timeout below the middlebox limit. As a last resort, disable upstream keepalive for that backend to avoid reuse races entirely.
Upstream keepalive pool churn
If $upstream_connect_time is rarely 0.000, the keepalive pool is ineffective. Verify the keepalive directive is set in the upstream block. Ensure the backend sends HTTP/1.1 with Connection: keep-alive. For HTTP/1.0 backends, configure proxy_http_version 1.1 and proxy_set_header Connection "" in the nginx location block.
MTU or network path issues
If large responses trigger the error while small requests succeed, verify consistent path MTU between nginx and backend. Ensure ICMP fragmentation needed messages are not dropped by intermediate firewalls.
Prevention
- Log upstream timing variables. Include
$upstream_connect_time,$upstream_response_time, and$upstream_addrin your access log format. Without these, you cannot distinguish a keepalive race from a backend crash. - Align idle timeouts. Ensure backend keepalive timeouts exceed or match nginx’s upstream keepalive settings. Nginx should be the first to close an idle connection, not the backend.
- Set worker shutdown timeout. Configure
worker_shutdown_timeoutto prevent old workers from lingering indefinitely during reloads and holding open stale upstream connections. - Monitor backend health independently. A backend that is alive enough to accept TCP but unhealthy enough to crash mid-response needs application-level monitoring, not just a port check.
- Configure retries carefully. Use
proxy_next_upstream error timeoutfor read-only endpoints. Avoid retrying mutating requests unless the application is idempotent.
How Netdata helps
- Correlate 502 spikes with backend CPU, memory, and disk metrics to spot OOM or resource exhaustion before checking
dmesg. - Watch nginx
Writingstate growth. A climbingWritingcount with flat throughput signals slow upstreams that may reset. - Compare
requests per secondagainst5xx rate: flat RPS with a 5xx spike points to backend crashes; dropping RPS with climbing connections points to a connection leak. - Set alerts on backend process restarts. A process that cycles outside of deploy windows often produces recv() 104 errors.
- Ingest access logs via the web log collector to chart upstream response time percentiles and upstream address distribution.
Related guides
- How NGINX actually works in production: a mental model for operators
- nginx 413 Request Entity Too Large: client_max_body_size explained
- nginx 499 status code: why clients close connections before the response
- nginx 500 Internal Server Error: how to diagnose it
- nginx 502 Bad Gateway: causes and how to fix it
- nginx 503 Service Temporarily Unavailable: causes and fixes
- nginx 504 Gateway Time-out: causes and fixes
- NGINX active connections climbing: reading, writing, waiting explained
- nginx: bind() to 0.0.0.0:80 failed (98: Address already in use)
- NGINX backend cascade failure: when slow upstreams take down everything
- nginx: a client request body is buffered to a temporary file - what it means
- NGINX proxy cache hit rate is low: measuring and improving it







