NGINX SSL/TLS handshake CPU saturation: detection and tuning
NGINX latency climbs while requests per second flatline. Worker processes are pinned near 100% CPU, yet active connections are nowhere near the worker_connections limit. Access logs show fast upstream response times, but $request_time is an order of magnitude larger. The bottleneck is not the network, disk, or backends: it is the TLS handshake.
When workers burn CPU on cryptography, the single-threaded event loop has no time left for request processing. Every new TCP connection that requires a full SSL handshake adds asymmetric crypto workload. If clients do not resume sessions and the connection rate is high, throughput collapses even though the machine has plenty of idle connection slots. This guide shows how to detect, diagnose, and tune for SSL handshake CPU saturation.
What this means
TLS handshakes are the most CPU-intensive operation nginx performs. A worker can handle tens of thousands of plain HTTP requests per second, but only hundreds to low thousands of RSA handshakes per second before it saturates a core. Because each worker runs a single-threaded event loop, crypto work blocks all other connections on that worker.
The characteristic signature is CPU saturation decoupled from connection count: thousands of unused connection slots while every worker is stuck in SSL_do_handshake. The root cause is almost always a flood of new connections combined with poor session resumption. Short-lived clients, upstream load balancer pool churn, or a cold session cache after a restart can all trigger the same pattern: connection rate climbs, request throughput stalls, and latency rises across the board. When workers are too slow to call accept(), the kernel listen queue fills and connections drop before nginx sees them.
flowchart TD
A[High new connection rate] --> B{Session resumption?}
B -->|Cache miss or new session| C[Full TLS handshake]
B -->|Cache hit| D[Reused session]
C --> E[Worker CPU saturates]
E --> F[Event loop starved]
F --> G[Requests queue and timeout]
E --> H[Accept queue backs up]
H --> I[Kernel drops SYNs]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Session cache too small or cold | Low cache hit rate; CPU spikes after restart or traffic surge | ssl_session_cache size versus connection rate |
| Session timeout too short | Resumption works briefly then falls off | ssl_session_timeout value (default is 5 minutes) |
| Session tickets disabled | Low hit rate despite adequate cache size | Whether ssl_session_tickets is enabled |
| TLS 1.2-only with RSA key exchange | Very high CPU per new connection | $ssl_protocol distribution in access logs |
| Short-lived connections without keepalive | High new-connection rate, low active connection count | stub_status reading/writing/waiting breakdown |
| SSL handshake flood | Sudden connection spike; TcpExtListenOverflows rising | Connection rate versus baseline |
Quick checks
Run these read-only checks to confirm the pattern.
Check connection states and throughput from stub_status:
# Active connections, accepts, handled, requests
curl -s http://127.0.0.1/nginx_status
Check per-worker CPU. A worker near 100% is fully saturated:
# CPU per worker
for pid in $(pgrep -P $(cat /var/run/nginx.pid)); do
cpu=$(ps -o %cpu= -p $pid)
echo "Worker $pid: ${cpu}%"
done
Measure SSL session resumption (requires $ssl_session_reused in log_format):
# Rough SSL session resumption hit rate
tail -n 10000 /var/log/nginx/access.log | \
awk '{for(i=1;i<=NF;i++) if($i=="r" || $i==".") count[$i]++}
END {reused=count["r"]+0; new_sess=count["."]+0;
print "Hit rate:", reused/(reused+new_sess)*100 "%"}'
Check for dropped connections. A growing gap means nginx cannot keep up:
# Dropped connections (accepts - handled)
curl -s http://127.0.0.1/nginx_status | awk '/^[[:space:]]*[0-9]/ {print "gap=" $1-$2; exit}'
Check kernel-level drops that do not appear in nginx logs:
# Kernel listen queue overflows
nstat -a 2>/dev/null | awk '/TcpExtListenOverflows/ {print $2}'
Verify which TLS versions clients are actually negotiating:
# TLS version distribution (requires $ssl_protocol in log_format)
tail -n 10000 /var/log/nginx/access.log | \
awk '{for(i=1;i<=NF;i++) if($i ~ /^TLSv/) count[$i]++}
END {for(v in count) print v": "count[v]}'
Check for handshake errors:
# Recent SSL handshake errors
tail -1000 /var/log/nginx/error.log | grep -c 'SSL_do_handshake() failed'
How to diagnose it
- Confirm CPU is SSL-bound, not upstream-bound. Compare
$request_timeand$upstream_response_time. If upstream is fast but total time is high while CPU is pinned, the delay is in nginx. Corroborate with a low SSL session hit rate. - Check per-worker CPU distribution. If all workers are near 100%, the load is systemic. If one worker is hot while others are idle, suspect uneven connection distribution from
reuseportor a single noisy client. - Measure session resumption. Parse
$ssl_session_reusedfrom access logs. A sustained hit rate below 70% means most connections pay full handshake cost. - Compare connection acceptance rate to request completion rate. If
acceptsgrows rapidly butrequestsdoes not, workers spend time on connections that never generate requests because handshakes crowd out the event loop. - Check the kernel listen queue. If
TcpExtListenOverflowsis increasing, workers are too busy with crypto to pull new connections from the backlog. - Inspect error logs for
SSL_do_handshake() failedor certificate errors that force retries and add overhead.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Worker CPU utilization per process | Workers are single-threaded; 100% on one core blocks the event loop | Sustained >90% per worker |
| SSL session cache hit rate | Full handshakes are expensive; resumption is cheap | <70% sustained |
| Connection rate vs. request rate | Handshakes inflate connection count without yielding requests | Connection rate high, RPS flat or falling |
$request_time minus $upstream_response_time | Isolates client-facing and nginx overhead from backend latency | Gap widening while upstream is stable |
| Dropped connections (accepts - handled) | Workers too saturated to accept new connections | Gap increasing over time |
TcpExtListenOverflows | Kernel drops connections before nginx sees them | Counter increasing |
Fixes
Increase shared session cache size and timeout
The default ssl_session_timeout is 5 minutes, often too short for production traffic. Size the cache using the rough formula connection_rate x session_timeout / 4000 megabytes. Always use ssl_session_cache shared:SSL:<size> so all workers access the same pool. Without the shared: prefix, each worker maintains an isolated cache and resumption fails when a client reconnects to a different process.
Enable session tickets
Session tickets move session state to the client, reducing server memory pressure. If your security posture allows it, enabling tickets improves resumption for clients that do not use cache-based reuse. In a multi-node deployment, synchronize ticket keys across all instances or resumption breaks on failover.
Enable TLS 1.3 and HTTP/2
TLS 1.3 reduces full handshakes to 1-RTT and supports PSK-based resumption that avoids expensive key exchange. HTTP/2 multiplexes requests over a single connection, directly reducing the total number of handshakes required. Enable both unless legacy clients prevent it.
Maximize connection reuse
Increase keepalive_timeout and keepalive_requests to hold client connections open longer, assuming you have connection headroom. Monitor the Waiting connection count in stub_status; if Waiting connections crowd out new ones, you have tuned too aggressively. For upstream connections, verify that the keepalive directive is configured in upstream blocks so backend TLS handshakes are also reused.
Rate-limit new connections
If the connection spike is caused by a flash crowd or attack, use limit_conn on $binary_remote_addr to cap concurrent connections per source. This protects the event loop from being monopolized by handshake load.
Offload TLS termination
If nginx remains CPU-bound after tuning session reuse, move TLS termination to a dedicated edge load balancer or layer-4 terminator. This is the last resort when the local CPU budget cannot support the required connection rate.
Prevention
- Size the shared session cache for peak unique connections, not average load.
- Monitor SSL session hit rate as a leading indicator of CPU pressure.
- Log
$ssl_protocol,$ssl_cipher, and$ssl_session_reusedto track resumption efficiency and protocol distribution. - Keep
worker_connectionswell above peak active connections so keepalive can be maintained without exhausting slots. - Plan CPU capacity for peak handshake rate, not just request throughput.
How Netdata helps
- Netdata correlates per-worker CPU utilization with nginx connection and request rates to spot when handshake load is the bottleneck rather than application latency.
- Alerts on
nginx.connections_acceptedminusnginx.connections_handleddetect silent connection drops before users report timeouts. - Kernel-level charts for
TcpExtListenOverflowsreveal when the accept queue drops connections due to slow worker acceptance. - Access-log integration tracks
$ssl_session_reusedhit rate and TLS version distribution when the log format includes them. - Per-process CPU breakdown shows whether one worker is overloaded or all workers are saturated, distinguishing between
reuseportimbalance and systemic SSL load.
Related guides
- How NGINX actually works in production: a mental model for operators
- nginx 413 Request Entity Too Large: client_max_body_size explained
- nginx 499 status code: why clients close connections before the response
- nginx 500 Internal Server Error: how to diagnose it
- nginx 502 Bad Gateway: causes and how to fix it
- nginx 503 Service Temporarily Unavailable: causes and fixes
- nginx 504 Gateway Time-out: causes and fixes
- NGINX active connections climbing: reading, writing, waiting explained
- NGINX backend cascade failure: when slow upstreams take down everything
- nginx: a client request body is buffered to a temporary file - what it means
- nginx connect() failed (111: Connection refused) while connecting to upstream
- NGINX connection exhaustion: detection, diagnosis, and prevention







