NGINX SSL session cache: improving TLS resumption and cutting CPU

TLS handshakes are the most CPU-intensive routine operation in nginx. A worker terminating RSA-2048 TLS can handle only a few hundred full handshakes per second, compared to tens of thousands of plain HTTP requests. In production, every reconnecting client that repeats a full handshake wastes CPU and adds latency. The ssl_session_cache directive exists to eliminate that waste by allowing session resumption across connections. Yet many configurations either omit it, under-size it, or misunderstand how TLS 1.3 changes resumption behavior. This article explains the mechanism, sizing, and the signals that tell you whether your cache is working.

What it is and why it matters

Without session resumption, every new TLS connection performs a full handshake. That means asymmetric cryptography, certificate chain verification, and key exchange. For high-churn workloads, such as APIs with short-lived connections, mobile clients, or CDN origin pulls, this cost accumulates fast. The result is SSL termination overload: workers peg CPU, the event loop slows, and latency rises for all requests.

Session caching mitigates this by storing negotiated session parameters so that compatible clients can resume with an abbreviated handshake on their next visit. nginx supports two mechanisms: a shared-memory session ID cache (shared:NAME:SIZE) and session tickets (ssl_session_tickets). The shared cache is visible to all workers, while tickets are self-contained encrypted blobs issued to clients. Both reduce CPU, but they behave differently under TLS 1.2 and TLS 1.3, and they share the same memory zone in modern nginx.

How it works

flowchart TD
    Client([Client]) -->|New connection| Check{Session ID or ticket presented?}
    Check -->|No| Full[Full TLS handshake]
    Check -->|Yes| Lookup{Cache lookup or ticket validation}
    Lookup -->|Valid| Resume[Resumed handshake]
    Lookup -->|Invalid| Full
    Full -->|Update shared cache| Cache[(ssl_session_cache shared:SSL)]
    Resume -->|Low CPU| Worker[nginx worker]
    Full -->|High CPU| Worker

The directive syntax includes off, none, and the two storage variants. The production pattern is shared:SSL:SIZE alone. A per-worker OpenSSL cache exists, but it fragments memory across processes and prevents workers from resuming sessions negotiated by their peers, so avoid it in multi-worker deployments.

A shared cache of 1 MB holds approximately 4000 sessions, at roughly 262 bytes per entry. The default ssl_session_timeout is five minutes. To size the zone, use:

required_MB = new_tls_sessions_per_second * ssl_session_timeout_seconds / 4000

For example, 1000 new sessions per second with a ten-minute timeout needs roughly 150 MB. Undersizing causes eviction, which manifests as a dropping hit rate and rising CPU.

To measure effectiveness, add $ssl_session_reused to your log format. It returns "r" for resumed and "." for full handshakes. Calculate hit rate by dividing resumed sessions by total TLS connections over a stable window. Because a cold cache produces a temporary floor, measure at least one full ssl_session_timeout after startup or reload to get a baseline. A healthy production deployment should see a hit rate above 90%. Below 80% is worth investigating; below 50% means the cache is either cold, too small, or unreachable.

Session tickets enable stateless resumption. The server encrypts session state into a ticket sent to the client, which presents it on reconnect. With ssl_session_tickets on, nginx stores ticket keys in the shared zone so all workers can issue and validate them consistently. If you disable tickets, TLS 1.2 clients must rely on the shared ID cache, which increases memory pressure. In TLS 1.3, disabling tickets disables resumption entirely, because TLS 1.3 uses tickets exclusively.

Where it shows up in production

Cold starts. After an nginx restart, the shared cache is empty. Every reconnecting client performs a full handshake until the cache warms. This produces a predictable CPU spike that decays over the first few thousand connections. If you see sustained high CPU after the cache should be warm, the zone is likely undersized or sessions are expiring too quickly.

Reload vs. restart. The shared cache survives nginx -s reload because the master process retains the zone. It is lost only on full process restart. If CPU spikes after a restart but not after a reload, the cache is behaving as expected.

TLS 1.3-only edge termination. If your nginx handles only TLS 1.3, sizing the cache for thousands of session IDs is largely wasted. The zone still needs enough space for ticket keys and metadata, but the dominant CPU win comes from ticket-based resumption, not from session ID lookup.

Vhost consolidation. Multiple server blocks can reference the same shared cache name (shared:SSL:10m). Subsequent declarations retrieve the existing zone rather than creating a new one, and conflicting sizes are silently ignored. This means the first declared size wins for all vhosts. Audit your configuration with nginx -T | grep ssl_session_cache to detect duplicates that might under-size a vhost.

Tradeoffs and when to use it

Shared cache sizing. A larger cache retains more sessions and improves hit rate, but consumes resident memory that is no longer available for connection buffers or the operating system page cache. Size for your peak new-session rate multiplied by your desired reuse window, then add headroom.

Ticket key synchronization. Without a shared cache, each worker generates independent ticket keys. A client that receives a ticket from one worker may fail resumption when reconnecting to another. This manifests as an unexpectedly low hit rate on multi-worker deployments even when tickets are enabled. The shared zone is required to synchronize keys across workers.

Session tickets vs. session IDs. Disabling session tickets forces TLS 1.2 clients to rely on the shared ID cache. This increases memory pressure and reduces resumption rates for mobile clients that rotate network interfaces. In TLS 1.3, disabling tickets disables resumption entirely. TLS 1.3 early data (0-RTT) is possible but introduces replay risk and is not enabled by default. If you disable tickets, ensure the shared cache is large enough to absorb the entire ID-based workload.

Timeout selection. The default ssl_session_timeout of five minutes is conservative. Production deployments often use 10m, 1h, or 1d depending on client behavior. Longer timeouts reduce CPU but increase the memory footprint and widen the window for session replay. Adjust to match your security model and connection churn rate.

Expired session retention. nginx does not actively purge expired sessions from the shared cache. They remain until evicted by LRU churn. For deployments that rely solely on ID-based resumption, stale session data persists beyond the timeout. This wastes zone capacity but does not affect performance beyond reducing available space for valid sessions.

Avoid mixing cache types. Combining a per-worker cache with a shared zone is less efficient than using the shared zone alone.

Signals to watch in production

SignalWhy it mattersWarning sign
SSL session cache hit rate ($ssl_session_reused)Directly measures how many connections avoid a full handshakeSustained below 80%; sudden drop after a config change
Worker CPU utilization per processHigh CPU with high connection rate indicates handshake saturationPer-worker CPU above 80% sustained with elevated new connection rate
New connection rate vs. request rateHandshake storms show high connection rate with stagnant request throughputConnection rate spikes but requests per second do not
accepts - handled gapDropped connections under load can follow CPU saturation if workers cannot keep upGap increasing for more than 60 seconds
TLS version distribution ($ssl_protocol)TLS 1.3-only deployments benefit less from ID cache sizing; signal guides capacity decisionsSudden shift to TLS 1.2 may increase cache pressure unexpectedly

How Netdata helps

  • Correlate per-worker CPU utilization with SSL session cache hit rate to distinguish handshake saturation from application-level CPU load.
  • Track the accepts - handled gap alongside connection rate to detect admission loss that precedes visible errors.
  • Monitor TLS version distribution shifts that change the effective value of your session cache sizing.
  • Alert on dropping SSL session cache hit rate before CPU saturation becomes visible.
  • Compare hit rate against time since process start to separate cold-start behavior from chronic undersizing.