NGINX SSL/TLS handshake CPU saturation: detection and tuning

NGINX latency climbs while requests per second flatline. Worker processes are pinned near 100% CPU, yet active connections are nowhere near the worker_connections limit. Access logs show fast upstream response times, but $request_time is an order of magnitude larger. The bottleneck is not the network, disk, or backends: it is the TLS handshake.

When workers burn CPU on cryptography, the single-threaded event loop has no time left for request processing. Every new TCP connection that requires a full SSL handshake adds asymmetric crypto workload. If clients do not resume sessions and the connection rate is high, throughput collapses even though the machine has plenty of idle connection slots. This guide shows how to detect, diagnose, and tune for SSL handshake CPU saturation.

What this means

TLS handshakes are the most CPU-intensive operation nginx performs. A worker can handle tens of thousands of plain HTTP requests per second, but only hundreds to low thousands of RSA handshakes per second before it saturates a core. Because each worker runs a single-threaded event loop, crypto work blocks all other connections on that worker.

The characteristic signature is CPU saturation decoupled from connection count: thousands of unused connection slots while every worker is stuck in SSL_do_handshake. The root cause is almost always a flood of new connections combined with poor session resumption. Short-lived clients, upstream load balancer pool churn, or a cold session cache after a restart can all trigger the same pattern: connection rate climbs, request throughput stalls, and latency rises across the board. When workers are too slow to call accept(), the kernel listen queue fills and connections drop before nginx sees them.

flowchart TD
    A[High new connection rate] --> B{Session resumption?}
    B -->|Cache miss or new session| C[Full TLS handshake]
    B -->|Cache hit| D[Reused session]
    C --> E[Worker CPU saturates]
    E --> F[Event loop starved]
    F --> G[Requests queue and timeout]
    E --> H[Accept queue backs up]
    H --> I[Kernel drops SYNs]

Common causes

CauseWhat it looks likeFirst thing to check
Session cache too small or coldLow cache hit rate; CPU spikes after restart or traffic surgessl_session_cache size versus connection rate
Session timeout too shortResumption works briefly then falls offssl_session_timeout value (default is 5 minutes)
Session tickets disabledLow hit rate despite adequate cache sizeWhether ssl_session_tickets is enabled
TLS 1.2-only with RSA key exchangeVery high CPU per new connection$ssl_protocol distribution in access logs
Short-lived connections without keepaliveHigh new-connection rate, low active connection countstub_status reading/writing/waiting breakdown
SSL handshake floodSudden connection spike; TcpExtListenOverflows risingConnection rate versus baseline

Quick checks

Run these read-only checks to confirm the pattern.

Check connection states and throughput from stub_status:

# Active connections, accepts, handled, requests
curl -s http://127.0.0.1/nginx_status

Check per-worker CPU. A worker near 100% is fully saturated:

# CPU per worker
for pid in $(pgrep -P $(cat /var/run/nginx.pid)); do
  cpu=$(ps -o %cpu= -p $pid)
  echo "Worker $pid: ${cpu}%"
done

Measure SSL session resumption (requires $ssl_session_reused in log_format):

# Rough SSL session resumption hit rate
tail -n 10000 /var/log/nginx/access.log | \
  awk '{for(i=1;i<=NF;i++) if($i=="r" || $i==".") count[$i]++}
       END {reused=count["r"]+0; new_sess=count["."]+0;
            print "Hit rate:", reused/(reused+new_sess)*100 "%"}'

Check for dropped connections. A growing gap means nginx cannot keep up:

# Dropped connections (accepts - handled)
curl -s http://127.0.0.1/nginx_status | awk '/^[[:space:]]*[0-9]/ {print "gap=" $1-$2; exit}'

Check kernel-level drops that do not appear in nginx logs:

# Kernel listen queue overflows
nstat -a 2>/dev/null | awk '/TcpExtListenOverflows/ {print $2}'

Verify which TLS versions clients are actually negotiating:

# TLS version distribution (requires $ssl_protocol in log_format)
tail -n 10000 /var/log/nginx/access.log | \
  awk '{for(i=1;i<=NF;i++) if($i ~ /^TLSv/) count[$i]++}
       END {for(v in count) print v": "count[v]}'

Check for handshake errors:

# Recent SSL handshake errors
tail -1000 /var/log/nginx/error.log | grep -c 'SSL_do_handshake() failed'

How to diagnose it

  1. Confirm CPU is SSL-bound, not upstream-bound. Compare $request_time and $upstream_response_time. If upstream is fast but total time is high while CPU is pinned, the delay is in nginx. Corroborate with a low SSL session hit rate.
  2. Check per-worker CPU distribution. If all workers are near 100%, the load is systemic. If one worker is hot while others are idle, suspect uneven connection distribution from reuseport or a single noisy client.
  3. Measure session resumption. Parse $ssl_session_reused from access logs. A sustained hit rate below 70% means most connections pay full handshake cost.
  4. Compare connection acceptance rate to request completion rate. If accepts grows rapidly but requests does not, workers spend time on connections that never generate requests because handshakes crowd out the event loop.
  5. Check the kernel listen queue. If TcpExtListenOverflows is increasing, workers are too busy with crypto to pull new connections from the backlog.
  6. Inspect error logs for SSL_do_handshake() failed or certificate errors that force retries and add overhead.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Worker CPU utilization per processWorkers are single-threaded; 100% on one core blocks the event loopSustained >90% per worker
SSL session cache hit rateFull handshakes are expensive; resumption is cheap<70% sustained
Connection rate vs. request rateHandshakes inflate connection count without yielding requestsConnection rate high, RPS flat or falling
$request_time minus $upstream_response_timeIsolates client-facing and nginx overhead from backend latencyGap widening while upstream is stable
Dropped connections (accepts - handled)Workers too saturated to accept new connectionsGap increasing over time
TcpExtListenOverflowsKernel drops connections before nginx sees themCounter increasing

Fixes

Increase shared session cache size and timeout

The default ssl_session_timeout is 5 minutes, often too short for production traffic. Size the cache using the rough formula connection_rate x session_timeout / 4000 megabytes. Always use ssl_session_cache shared:SSL:<size> so all workers access the same pool. Without the shared: prefix, each worker maintains an isolated cache and resumption fails when a client reconnects to a different process.

Enable session tickets

Session tickets move session state to the client, reducing server memory pressure. If your security posture allows it, enabling tickets improves resumption for clients that do not use cache-based reuse. In a multi-node deployment, synchronize ticket keys across all instances or resumption breaks on failover.

Enable TLS 1.3 and HTTP/2

TLS 1.3 reduces full handshakes to 1-RTT and supports PSK-based resumption that avoids expensive key exchange. HTTP/2 multiplexes requests over a single connection, directly reducing the total number of handshakes required. Enable both unless legacy clients prevent it.

Maximize connection reuse

Increase keepalive_timeout and keepalive_requests to hold client connections open longer, assuming you have connection headroom. Monitor the Waiting connection count in stub_status; if Waiting connections crowd out new ones, you have tuned too aggressively. For upstream connections, verify that the keepalive directive is configured in upstream blocks so backend TLS handshakes are also reused.

Rate-limit new connections

If the connection spike is caused by a flash crowd or attack, use limit_conn on $binary_remote_addr to cap concurrent connections per source. This protects the event loop from being monopolized by handshake load.

Offload TLS termination

If nginx remains CPU-bound after tuning session reuse, move TLS termination to a dedicated edge load balancer or layer-4 terminator. This is the last resort when the local CPU budget cannot support the required connection rate.

Prevention

  • Size the shared session cache for peak unique connections, not average load.
  • Monitor SSL session hit rate as a leading indicator of CPU pressure.
  • Log $ssl_protocol, $ssl_cipher, and $ssl_session_reused to track resumption efficiency and protocol distribution.
  • Keep worker_connections well above peak active connections so keepalive can be maintained without exhausting slots.
  • Plan CPU capacity for peak handshake rate, not just request throughput.

How Netdata helps

  • Netdata correlates per-worker CPU utilization with nginx connection and request rates to spot when handshake load is the bottleneck rather than application latency.
  • Alerts on nginx.connections_accepted minus nginx.connections_handled detect silent connection drops before users report timeouts.
  • Kernel-level charts for TcpExtListenOverflows reveal when the accept queue drops connections due to slow worker acceptance.
  • Access-log integration tracks $ssl_session_reused hit rate and TLS version distribution when the log format includes them.
  • Per-process CPU breakdown shows whether one worker is overloaded or all workers are saturated, distinguishing between reuseport imbalance and systemic SSL load.