NGINX proxy cache hit rate is low: measuring and improving it
Low NGINX proxy cache hit rate shows up as upstream CPU climbing or origin traffic higher than expected. Access logs show MISS and BYPASS where you expect HIT. When caching fails, every request reaches the backend, adding latency and load. The symptom is usually a gradual slide from 85% to 40% over a day, or a collapse to zero after a deployment or restart. Root cause is often configuration drift: a new header, a changed query parameter, or a keys_zone sized for last year’s traffic. Diagnose by measuring which cache status dominates, then trace that status back to the directive or upstream behavior that produces it.
What this means
NGINX proxy caching stores upstream responses on disk and tracks them in an in-memory index inside a shared memory zone configured via proxy_cache_path. The cache loader rebuilds this index after restart by scanning disk files, and the cache manager evicts entries when disk or zone limits are reached. When a request arrives, NGINX looks up the cache key in that index. If the entry exists and is still valid, NGINX serves it directly. The variable $upstream_cache_status records the outcome of that lookup for every request: HIT, MISS, BYPASS, EXPIRED, STALE, UPDATING, or REVALIDATED.
A healthy cache shows mostly HIT. EXPIRED and UPDATING are normal housekeeping. STALE is good only if you use stale serving to survive upstream outages. BYPASS means the request was intentionally skipped. MISS means the item was not in the cache, was evicted, or could not be cached. A sustained MISS rate above your baseline means the cache is not protecting your upstream.
flowchart TD
A[Cache hit rate low] --> B{Recent restart?}
B -->|Yes| C[Cold cache: loader warming]
B -->|No| D{Many BYPASS?}
D -->|Yes| E[Check bypass conditions]
D -->|No| F{Same URI many MISS?}
F -->|Yes| G[Check cache key and Vary]
F -->|No| H{Zone or disk full?}
H -->|Yes| I[Check keys_zone and max_size]
H -->|No| J[Check upstream cache headers]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Cold cache after restart | Hit rate near zero immediately after restart; recovers over minutes | NGINX uptime and cache loader progress |
| Cache key too specific | Same URI logged as MISS repeatedly; high MISS rate for identical assets | Cache key configuration for injected cookies, query strings, or headers |
| Vary header fragmentation | MISS for resources that should be identical; many cache files per URI | Upstream response Vary headers |
| Cache zone or disk exhaustion | Previously cached content starts MISSing; error log mentions allocation failures | Disk usage under proxy_cache_path and keys_zone size |
| Upstream non-cacheable headers | All responses MISS despite correct NGINX config; BYPASS is low | Upstream Cache-Control or Expires headers |
| Bypass misconfiguration | High BYPASS rate for traffic that should be cached | Configuration rules that skip cache lookup or prevent storage |
Quick checks
# Cache status distribution in recent traffic.
# Adjust the pattern to match your log_format.
grep -oP 'cache_status=\K[A-Z]+' /var/log/nginx/access.log | tail -10000 | sort | uniq -c | sort -rn
# Error log for cache zone or vary header issues.
tail -1000 /var/log/nginx/error.log | grep -iE 'could not allocate|vary|cache'
# Disk usage of the cache directory.
# Replace the path with your proxy_cache_path value.
du -sh /var/cache/nginx/ 2>/dev/null
# Configured cache path and zone sizes.
nginx -T 2>/dev/null | grep -E 'proxy_cache_path|keys_zone'
# Cache bypass rules in config.
nginx -T 2>/dev/null | grep -iE 'bypass|no_cache'
# Upstream cache headers for a test URI.
curl -s -o /dev/null -D - http://localhost/test | grep -iE 'cache-control|vary'
# NGINX uptime to correlate with cold cache.
ps -o etime= -p $(cat /var/run/nginx.pid)
# Approximate cached file count on disk.
# This walks the full cache tree and can be I/O intensive on large caches.
find /var/cache/nginx/ -type f 2>/dev/null | wc -l
How to diagnose it
Establish the baseline and current hit rate. Parse $upstream_cache_status from access logs over a representative window: at least 10,000 requests or 15 minutes. Calculate the ratio of HIT to total cacheable requests. A sudden drop greater than 10% from baseline is abnormal.
Check for cold cache. Look at NGINX uptime. If the process restarted recently, the cache loader may still be rebuilding the in-memory index from disk files. During this window, valid cache files exist on disk but NGINX does not know about them, so every request shows MISS. After restart, cache metadata is cold and hit rate starts at zero until the loader finishes.
Look at the status breakdown. If BYPASS dominates, audit your configuration for conditions that force cache lookup to be skipped. If EXPIRED dominates, your TTLs may be too short or upstream Cache-Control is forcing frequent revalidation. If MISS dominates, proceed to key and header analysis.
Inspect the cache key. A cache key that is too specific is a common cause of low hit rate. If the key includes per-user variables such as cookies or request headers, identical resources generate separate cache entries. Keep the cache key as stable as possible. Avoid including variables that change per request, such as tracking parameters or session identifiers, unless every variant genuinely needs a separate cached copy.
Check upstream Vary headers. The Vary header reduces hit rate because NGINX stores separate versions per header value. Use curl or access logs to see what Vary values the upstream returns. Broad values like Cookie or User-Agent fragment the cache into near-zero hit rates. Work with the upstream application to emit narrower Vary values, or accept the lower hit rate and size the cache accordingly.
Verify zone and disk capacity. A 10m keys_zone holds approximately 80,000 keys. If the zone fills, NGINX evicts entries via LRU. Look for “could not allocate node” errors in the error log, which indicate zone exhaustion. Check disk usage against proxy_cache_path max_size. If the disk is full or inactive timeouts are aggressive, valid entries may be evicted prematurely. Remember that shared memory zone sizes cannot be changed via reload; they require a full restart.
Confirm upstream cacheability. Responses that carry Cache-Control: private, no-store, or similar directives are not cached. Verify that upstream responses carry appropriate headers and do not mark everything uncacheable. If the upstream is outside your control, define explicit cache validity timeouts in your configuration for specific status codes using proxy_cache_valid, but note that this only helps when the response is otherwise cacheable.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Cache status distribution (HIT/MISS/BYPASS/STALE) | Shows whether the cache is effective or traffic is reaching upstream | HIT drops more than 10% from baseline |
| Cache disk usage vs max_size | Reveals if eviction is driven by disk pressure rather than TTL | Usage at 90%+ of max_size |
| keys_zone allocation errors | Indicates the in-memory index cannot track more entries | “could not allocate node” in error log |
| Upstream request rate | Inverse indicator: more upstream load when cache fails | Upstream RPS spikes while NGINX RPS is flat |
| STALE rate | High STALE can mask an upstream outage | STALE above baseline without planned upstream maintenance |
Fixes
Cold cache and thundering herd
If the cache is empty after a restart, enable cache locking so that only one request fills the cache for a given key while others wait. Enable stale serving with the updating parameter so that MISS storms do not overwhelm the upstream during repopulation.
Cache key tuning
Remove high-cardinality variables from the cache key. If your application requires session awareness, handle it at the application layer rather than in the cache key. Adding per-request cookies or unique query parameters without a strong reason is the most common cause of a near-zero hit rate.
Vary header fragmentation
If the upstream returns wide Vary headers, NGINX stores separate copies for each variant. This reduces hit rate. Reduce unnecessary Vary values at the upstream, or accept the lower hit rate and size the cache and zone accordingly.
Zone and disk exhaustion
Increase the keys_zone size if the in-memory index is exhausted. A 10m keys_zone holds approximately 80,000 keys. If you cache more objects, raise this proportionally. Increase max_size or reduce the inactive timeout if disk eviction is aggressive. Because shared memory zone sizes cannot be changed via reload, plan a restart to apply new zone sizes.
Upstream cacheability
Ensure the upstream application sets cache-friendly headers. If the upstream marks responses as private or no-store, NGINX will not cache them. When you cannot change the upstream, review your configuration to confirm that caching rules match the actual response characteristics.
Masked upstream failures via STALE
If you rely on STALE serving during outages, monitor it closely. Serving stale content indefinitely can hide upstream failures for hours or days. A high STALE rate is a signal to investigate upstream health, not a reason to relax.
Prevention
- Size your upstreams to survive cold starts. After restart, hit rate is near zero until the loader finishes. Upstreams must handle full traffic load without cache protection during that window.
- Monitor keys_zone capacity. Zone exhaustion produces allocation errors and forces LRU eviction. Size the zone for at least 2x your expected peak unique entry count.
- Enable cache locking to prevent multiple concurrent requests from hitting the upstream for the same uncached key.
- Keep Vary headers narrow. Avoid varying on per-user headers unless necessary.
- Track cache hit rate as a baseline metric. A gradual decline over days is easier to fix than a sudden collapse.
- Review bypass rules during every configuration change. A misplaced conditional can turn a location that should cache into a permanent BYPASS.
How Netdata helps
- Correlates cache hit rate with upstream response time and error rate, so you can see whether a drop in HIT is causing backend overload.
- Tracks cache status distribution over time, making cold-cache events and gradual erosion visible.
- Alerts on sudden drops in HIT ratio or spikes in MISS rate.
- Surfaces upstream load increases that coincide with cache failures, helping distinguish cache issues from genuine traffic growth.
Related guides
- How NGINX actually works in production: a mental model for operators
- nginx 413 Request Entity Too Large: client_max_body_size explained
- nginx 499 status code: why clients close connections before the response
- nginx 500 Internal Server Error: how to diagnose it
- nginx 502 Bad Gateway: causes and how to fix it
- nginx 503 Service Temporarily Unavailable: causes and fixes
- nginx 504 Gateway Time-out: causes and fixes
- NGINX active connections climbing: reading, writing, waiting explained
- NGINX backend cascade failure: when slow upstreams take down everything
- nginx: a client request body is buffered to a temporary file - what it means
- nginx connect() failed (111: Connection refused) while connecting to upstream
- NGINX connection exhaustion: detection, diagnosis, and prevention







