$ guides / nginx / nginx-ssl-handshake-cpu-saturation ▌

Operations Guides

NGINX SSL/TLS handshake CPU saturation: detection and tuning

NGINX latency climbs while requests per second flatline. Worker processes are pinned near 100% CPU, yet active connections are nowhere near the worker_connections limit. Access logs show fast upstream response times, but $request_time is an order of magnitude larger. The bottleneck is not the network, disk, or backends: it is the TLS handshake.

When workers burn CPU on cryptography, the single-threaded event loop has no time left for request processing. Every new TCP connection that requires a full SSL handshake adds asymmetric crypto workload. If clients do not resume sessions and the connection rate is high, throughput collapses even though the machine has plenty of idle connection slots. This guide shows how to detect, diagnose, and tune for SSL handshake CPU saturation.

What this means

TLS handshakes are the most CPU-intensive operation nginx performs. A worker can handle tens of thousands of plain HTTP requests per second, but only hundreds to low thousands of RSA handshakes per second before it saturates a core. Because each worker runs a single-threaded event loop, crypto work blocks all other connections on that worker.

The characteristic signature is CPU saturation decoupled from connection count: thousands of unused connection slots while every worker is stuck in SSL_do_handshake. The root cause is almost always a flood of new connections combined with poor session resumption. Short-lived clients, upstream load balancer pool churn, or a cold session cache after a restart can all trigger the same pattern: connection rate climbs, request throughput stalls, and latency rises across the board. When workers are too slow to call accept(), the kernel listen queue fills and connections drop before nginx sees them.

flowchart TD
    A[High new connection rate] --> B{Session resumption?}
    B -->|Cache miss or new session| C[Full TLS handshake]
    B -->|Cache hit| D[Reused session]
    C --> E[Worker CPU saturates]
    E --> F[Event loop starved]
    F --> G[Requests queue and timeout]
    E --> H[Accept queue backs up]
    H --> I[Kernel drops SYNs]

Common causes

Cause	What it looks like	First thing to check
Session cache too small or cold	Low cache hit rate; CPU spikes after restart or traffic surge	`ssl_session_cache` size versus connection rate
Session timeout too short	Resumption works briefly then falls off	`ssl_session_timeout` value (default is 5 minutes)
Session tickets disabled	Low hit rate despite adequate cache size	Whether `ssl_session_tickets` is enabled
TLS 1.2-only with RSA key exchange	Very high CPU per new connection	`$ssl_protocol` distribution in access logs
Short-lived connections without keepalive	High new-connection rate, low active connection count	`stub_status` reading/writing/waiting breakdown
SSL handshake flood	Sudden connection spike; `TcpExtListenOverflows` rising	Connection rate versus baseline

Quick checks

Run these read-only checks to confirm the pattern.

Check connection states and throughput from stub_status:

# Active connections, accepts, handled, requests
curl -s http://127.0.0.1/nginx_status

Check per-worker CPU. A worker near 100% is fully saturated:

# CPU per worker
for pid in $(pgrep -P $(cat /var/run/nginx.pid)); do
  cpu=$(ps -o %cpu= -p $pid)
  echo "Worker $pid: ${cpu}%"
done

Measure SSL session resumption (requires $ssl_session_reused in log_format):

# Rough SSL session resumption hit rate
tail -n 10000 /var/log/nginx/access.log | \
  awk '{for(i=1;i<=NF;i++) if($i=="r" || $i==".") count[$i]++}
       END {reused=count["r"]+0; new_sess=count["."]+0;
            print "Hit rate:", reused/(reused+new_sess)*100 "%"}'

Check for dropped connections. A growing gap means nginx cannot keep up:

# Dropped connections (accepts - handled)
curl -s http://127.0.0.1/nginx_status | awk '/^[[:space:]]*[0-9]/ {print "gap=" $1-$2; exit}'

Check kernel-level drops that do not appear in nginx logs:

# Kernel listen queue overflows
nstat -a 2>/dev/null | awk '/TcpExtListenOverflows/ {print $2}'

Verify which TLS versions clients are actually negotiating:

# TLS version distribution (requires $ssl_protocol in log_format)
tail -n 10000 /var/log/nginx/access.log | \
  awk '{for(i=1;i<=NF;i++) if($i ~ /^TLSv/) count[$i]++}
       END {for(v in count) print v": "count[v]}'

Check for handshake errors:

# Recent SSL handshake errors
tail -1000 /var/log/nginx/error.log | grep -c 'SSL_do_handshake() failed'

How to diagnose it

Confirm CPU is SSL-bound, not upstream-bound. Compare $request_time and $upstream_response_time. If upstream is fast but total time is high while CPU is pinned, the delay is in nginx. Corroborate with a low SSL session hit rate.
Check per-worker CPU distribution. If all workers are near 100%, the load is systemic. If one worker is hot while others are idle, suspect uneven connection distribution from reuseport or a single noisy client.
Measure session resumption. Parse $ssl_session_reused from access logs. A sustained hit rate below 70% means most connections pay full handshake cost.
Compare connection acceptance rate to request completion rate. If accepts grows rapidly but requests does not, workers spend time on connections that never generate requests because handshakes crowd out the event loop.
Check the kernel listen queue. If TcpExtListenOverflows is increasing, workers are too busy with crypto to pull new connections from the backlog.
Inspect error logs for SSL_do_handshake() failed or certificate errors that force retries and add overhead.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Worker CPU utilization per process	Workers are single-threaded; 100% on one core blocks the event loop	Sustained >90% per worker
SSL session cache hit rate	Full handshakes are expensive; resumption is cheap	<70% sustained
Connection rate vs. request rate	Handshakes inflate connection count without yielding requests	Connection rate high, RPS flat or falling
`$request_time` minus `$upstream_response_time`	Isolates client-facing and nginx overhead from backend latency	Gap widening while upstream is stable
Dropped connections (accepts - handled)	Workers too saturated to accept new connections	Gap increasing over time
`TcpExtListenOverflows`	Kernel drops connections before nginx sees them	Counter increasing

Fixes

Increase shared session cache size and timeout

The default ssl_session_timeout is 5 minutes, often too short for production traffic. Size the cache using the rough formula connection_rate x session_timeout / 4000 megabytes. Always use ssl_session_cache shared:SSL:<size> so all workers access the same pool. Without the shared: prefix, each worker maintains an isolated cache and resumption fails when a client reconnects to a different process.

Enable session tickets

Session tickets move session state to the client, reducing server memory pressure. If your security posture allows it, enabling tickets improves resumption for clients that do not use cache-based reuse. In a multi-node deployment, synchronize ticket keys across all instances or resumption breaks on failover.

Enable TLS 1.3 and HTTP/2

TLS 1.3 reduces full handshakes to 1-RTT and supports PSK-based resumption that avoids expensive key exchange. HTTP/2 multiplexes requests over a single connection, directly reducing the total number of handshakes required. Enable both unless legacy clients prevent it.

Maximize connection reuse

Increase keepalive_timeout and keepalive_requests to hold client connections open longer, assuming you have connection headroom. Monitor the Waiting connection count in stub_status; if Waiting connections crowd out new ones, you have tuned too aggressively. For upstream connections, verify that the keepalive directive is configured in upstream blocks so backend TLS handshakes are also reused.

Rate-limit new connections

If the connection spike is caused by a flash crowd or attack, use limit_conn on $binary_remote_addr to cap concurrent connections per source. This protects the event loop from being monopolized by handshake load.

Offload TLS termination

If nginx remains CPU-bound after tuning session reuse, move TLS termination to a dedicated edge load balancer or layer-4 terminator. This is the last resort when the local CPU budget cannot support the required connection rate.

Prevention

Size the shared session cache for peak unique connections, not average load.
Monitor SSL session hit rate as a leading indicator of CPU pressure.
Log $ssl_protocol, $ssl_cipher, and $ssl_session_reused to track resumption efficiency and protocol distribution.
Keep worker_connections well above peak active connections so keepalive can be maintained without exhausting slots.
Plan CPU capacity for peak handshake rate, not just request throughput.

How Netdata helps

Netdata correlates per-worker CPU utilization with nginx connection and request rates to spot when handshake load is the bottleneck rather than application latency.
Alerts on nginx.connections_accepted minus nginx.connections_handled detect silent connection drops before users report timeouts.
Kernel-level charts for TcpExtListenOverflows reveal when the accept queue drops connections due to slow worker acceptance.
Access-log integration tracks $ssl_session_reused hit rate and TLS version distribution when the log format includes them.
Per-process CPU breakdown shows whether one worker is overloaded or all workers are saturated, distinguishing between reuseport imbalance and systemic SSL load.

The Netdata solution

Web server monitoring with Netdata

Netdata monitors NGINX with per-second request, connection, and latency metrics plus ML anomaly detection. Correlate connection and file-descriptor exhaustion, upstream cascade failures, buffer spill, and TLS CPU with the host signals behind them.

See web server monitoring → Start monitoring free

NGINX SSL/TLS handshake CPU saturation: detection and tuning

NGINX SSL/TLS handshake CPU saturation: detection and tuning

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Increase shared session cache size and timeout

Enable session tickets

Enable TLS 1.3 and HTTP/2

Maximize connection reuse

Rate-limit new connections

Offload TLS termination

Prevention

How Netdata helps

Related guides

Web server monitoring with Netdata