$ guides / nginx / nginx-cache-hit-rate-low ▌

Operations Guides

NGINX proxy cache hit rate is low: measuring and improving it

Low NGINX proxy cache hit rate shows up as upstream CPU climbing or origin traffic higher than expected. Access logs show MISS and BYPASS where you expect HIT. When caching fails, every request reaches the backend, adding latency and load. The symptom is usually a gradual slide from 85% to 40% over a day, or a collapse to zero after a deployment or restart. Root cause is often configuration drift: a new header, a changed query parameter, or a keys_zone sized for last year’s traffic. Diagnose by measuring which cache status dominates, then trace that status back to the directive or upstream behavior that produces it.

What this means

NGINX proxy caching stores upstream responses on disk and tracks them in an in-memory index inside a shared memory zone configured via proxy_cache_path. The cache loader rebuilds this index after restart by scanning disk files, and the cache manager evicts entries when disk or zone limits are reached. When a request arrives, NGINX looks up the cache key in that index. If the entry exists and is still valid, NGINX serves it directly. The variable $upstream_cache_status records the outcome of that lookup for every request: HIT, MISS, BYPASS, EXPIRED, STALE, UPDATING, or REVALIDATED.

A healthy cache shows mostly HIT. EXPIRED and UPDATING are normal housekeeping. STALE is good only if you use stale serving to survive upstream outages. BYPASS means the request was intentionally skipped. MISS means the item was not in the cache, was evicted, or could not be cached. A sustained MISS rate above your baseline means the cache is not protecting your upstream.

flowchart TD
    A[Cache hit rate low] --> B{Recent restart?}
    B -->|Yes| C[Cold cache: loader warming]
    B -->|No| D{Many BYPASS?}
    D -->|Yes| E[Check bypass conditions]
    D -->|No| F{Same URI many MISS?}
    F -->|Yes| G[Check cache key and Vary]
    F -->|No| H{Zone or disk full?}
    H -->|Yes| I[Check keys_zone and max_size]
    H -->|No| J[Check upstream cache headers]

Common causes

Cause	What it looks like	First thing to check
Cold cache after restart	Hit rate near zero immediately after restart; recovers over minutes	NGINX uptime and cache loader progress
Cache key too specific	Same URI logged as MISS repeatedly; high MISS rate for identical assets	Cache key configuration for injected cookies, query strings, or headers
Vary header fragmentation	MISS for resources that should be identical; many cache files per URI	Upstream response Vary headers
Cache zone or disk exhaustion	Previously cached content starts MISSing; error log mentions allocation failures	Disk usage under proxy_cache_path and keys_zone size
Upstream non-cacheable headers	All responses MISS despite correct NGINX config; BYPASS is low	Upstream Cache-Control or Expires headers
Bypass misconfiguration	High BYPASS rate for traffic that should be cached	Configuration rules that skip cache lookup or prevent storage

Quick checks

# Cache status distribution in recent traffic.
# Adjust the pattern to match your log_format.
grep -oP 'cache_status=\K[A-Z]+' /var/log/nginx/access.log | tail -10000 | sort | uniq -c | sort -rn

# Error log for cache zone or vary header issues.
tail -1000 /var/log/nginx/error.log | grep -iE 'could not allocate|vary|cache'

# Disk usage of the cache directory.
# Replace the path with your proxy_cache_path value.
du -sh /var/cache/nginx/ 2>/dev/null

# Configured cache path and zone sizes.
nginx -T 2>/dev/null | grep -E 'proxy_cache_path|keys_zone'

# Cache bypass rules in config.
nginx -T 2>/dev/null | grep -iE 'bypass|no_cache'

# Upstream cache headers for a test URI.
curl -s -o /dev/null -D - http://localhost/test | grep -iE 'cache-control|vary'

# NGINX uptime to correlate with cold cache.
ps -o etime= -p $(cat /var/run/nginx.pid)

# Approximate cached file count on disk.
# This walks the full cache tree and can be I/O intensive on large caches.
find /var/cache/nginx/ -type f 2>/dev/null | wc -l

How to diagnose it

Establish the baseline and current hit rate. Parse $upstream_cache_status from access logs over a representative window: at least 10,000 requests or 15 minutes. Calculate the ratio of HIT to total cacheable requests. A sudden drop greater than 10% from baseline is abnormal.
Check for cold cache. Look at NGINX uptime. If the process restarted recently, the cache loader may still be rebuilding the in-memory index from disk files. During this window, valid cache files exist on disk but NGINX does not know about them, so every request shows MISS. After restart, cache metadata is cold and hit rate starts at zero until the loader finishes.
Look at the status breakdown. If BYPASS dominates, audit your configuration for conditions that force cache lookup to be skipped. If EXPIRED dominates, your TTLs may be too short or upstream Cache-Control is forcing frequent revalidation. If MISS dominates, proceed to key and header analysis.
Inspect the cache key. A cache key that is too specific is a common cause of low hit rate. If the key includes per-user variables such as cookies or request headers, identical resources generate separate cache entries. Keep the cache key as stable as possible. Avoid including variables that change per request, such as tracking parameters or session identifiers, unless every variant genuinely needs a separate cached copy.
Check upstream Vary headers. The Vary header reduces hit rate because NGINX stores separate versions per header value. Use curl or access logs to see what Vary values the upstream returns. Broad values like Cookie or User-Agent fragment the cache into near-zero hit rates. Work with the upstream application to emit narrower Vary values, or accept the lower hit rate and size the cache accordingly.
Verify zone and disk capacity. A 10m keys_zone holds approximately 80,000 keys. If the zone fills, NGINX evicts entries via LRU. Look for “could not allocate node” errors in the error log, which indicate zone exhaustion. Check disk usage against proxy_cache_path max_size. If the disk is full or inactive timeouts are aggressive, valid entries may be evicted prematurely. Remember that shared memory zone sizes cannot be changed via reload; they require a full restart.
Confirm upstream cacheability. Responses that carry Cache-Control: private, no-store, or similar directives are not cached. Verify that upstream responses carry appropriate headers and do not mark everything uncacheable. If the upstream is outside your control, define explicit cache validity timeouts in your configuration for specific status codes using proxy_cache_valid, but note that this only helps when the response is otherwise cacheable.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Cache status distribution (HIT/MISS/BYPASS/STALE)	Shows whether the cache is effective or traffic is reaching upstream	HIT drops more than 10% from baseline
Cache disk usage vs max_size	Reveals if eviction is driven by disk pressure rather than TTL	Usage at 90%+ of max_size
keys_zone allocation errors	Indicates the in-memory index cannot track more entries	“could not allocate node” in error log
Upstream request rate	Inverse indicator: more upstream load when cache fails	Upstream RPS spikes while NGINX RPS is flat
STALE rate	High STALE can mask an upstream outage	STALE above baseline without planned upstream maintenance

Fixes

Cold cache and thundering herd

If the cache is empty after a restart, enable cache locking so that only one request fills the cache for a given key while others wait. Enable stale serving with the updating parameter so that MISS storms do not overwhelm the upstream during repopulation.

Cache key tuning

Remove high-cardinality variables from the cache key. If your application requires session awareness, handle it at the application layer rather than in the cache key. Adding per-request cookies or unique query parameters without a strong reason is the most common cause of a near-zero hit rate.

Vary header fragmentation

If the upstream returns wide Vary headers, NGINX stores separate copies for each variant. This reduces hit rate. Reduce unnecessary Vary values at the upstream, or accept the lower hit rate and size the cache and zone accordingly.

Zone and disk exhaustion

Increase the keys_zone size if the in-memory index is exhausted. A 10m keys_zone holds approximately 80,000 keys. If you cache more objects, raise this proportionally. Increase max_size or reduce the inactive timeout if disk eviction is aggressive. Because shared memory zone sizes cannot be changed via reload, plan a restart to apply new zone sizes.

Upstream cacheability

Ensure the upstream application sets cache-friendly headers. If the upstream marks responses as private or no-store, NGINX will not cache them. When you cannot change the upstream, review your configuration to confirm that caching rules match the actual response characteristics.

Masked upstream failures via STALE

If you rely on STALE serving during outages, monitor it closely. Serving stale content indefinitely can hide upstream failures for hours or days. A high STALE rate is a signal to investigate upstream health, not a reason to relax.

Prevention

Size your upstreams to survive cold starts. After restart, hit rate is near zero until the loader finishes. Upstreams must handle full traffic load without cache protection during that window.
Monitor keys_zone capacity. Zone exhaustion produces allocation errors and forces LRU eviction. Size the zone for at least 2x your expected peak unique entry count.
Enable cache locking to prevent multiple concurrent requests from hitting the upstream for the same uncached key.
Keep Vary headers narrow. Avoid varying on per-user headers unless necessary.
Track cache hit rate as a baseline metric. A gradual decline over days is easier to fix than a sudden collapse.
Review bypass rules during every configuration change. A misplaced conditional can turn a location that should cache into a permanent BYPASS.

How Netdata helps

Correlates cache hit rate with upstream response time and error rate, so you can see whether a drop in HIT is causing backend overload.
Tracks cache status distribution over time, making cold-cache events and gradual erosion visible.
Alerts on sudden drops in HIT ratio or spikes in MISS rate.
Surfaces upstream load increases that coincide with cache failures, helping distinguish cache issues from genuine traffic growth.

The Netdata solution

Web server monitoring with Netdata

Netdata monitors NGINX with per-second request, connection, and latency metrics plus ML anomaly detection. Correlate connection and file-descriptor exhaustion, upstream cascade failures, buffer spill, and TLS CPU with the host signals behind them.

See web server monitoring → Start monitoring free

NGINX proxy cache hit rate is low: measuring and improving it

NGINX proxy cache hit rate is low: measuring and improving it

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Cold cache and thundering herd

Cache key tuning

Vary header fragmentation

Zone and disk exhaustion

Upstream cacheability

Masked upstream failures via STALE

Prevention

How Netdata helps

Related guides

Web server monitoring with Netdata