$ guides / mysql / mysql-buffer-pool-hit-ratio-collapse ▌

Operations Guides

MySQL InnoDB buffer pool hit ratio collapse: the cliff edge

Your OLTP queries were running in single-digit milliseconds. Now every query is taking seconds, the disk subsystem is saturated, Threads_running is climbing toward max_connections, and the buffer pool hit ratio, which sat at 99.9% for months, just fell through 95% and keeps dropping.

This is the InnoDB buffer pool cliff edge. When the working set exceeds the buffer pool, pages are evicted before they can be reused. Every miss becomes a physical disk read. The degradation is non-linear: 99.9% to 99% is a slow bleed, 99% to 95% is rapid, and below 95% disk saturation, uniform latency inflation, and connection exhaustion turn a capacity problem into an availability incident.

Because every query is affected simultaneously, operators often misdiagnose this as a lock storm, network partition, or runaway write workload. The real signature is uniform latency inflation across all query types, paired with exponential growth in Innodb_buffer_pool_reads and disk read I/O.

What this means

The buffer pool is a fixed-size LRU cache of 16 KB data and index pages, split into a young (hot) sublist and an old (cold) sublist. InnoDB uses midpoint insertion: pages read by queries start in the old sublist and are promoted to the young sublist only on repeated access within a short window. Full table scans populate the old sublist, protecting hot pages from being evicted by one-off large queries. This works while the working set fits in memory.

The hit ratio is computed as 1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests). Both counters are cumulative since server startup. After months of uptime, the cumulative ratio can hide a recent collapse because the denominator is enormous. Compute deltas over a fixed interval, such as 60 seconds.

When the working set exceeds the pool, InnoDB thrashes: pages are read from disk, touched once, evicted to make room for newer pages, and re-read on the next access. Disk I/O utilization approaches 100%. Query latency rises from milliseconds to seconds. Because queries take longer, connections that would have been reused are now occupied. Threads_connected climbs toward max_connections, and eventually new connections are refused.

flowchart TD
    A[Working set exceeds buffer pool] --> B[LRU eviction churn]
    B --> C[Hit ratio collapses]
    C --> D[Disk read I/O saturates]
    D --> E[All query latency rises]
    E --> F[Connections held longer]
    F --> G[Threads_connected climbs]
    G --> H[New connections refused]

A critical nuance: Innodb_buffer_pool_pages_free near zero is normal. InnoDB maintains a minimal free list and recycles pages from the LRU tail. Zero free pages indicate pressure only when Innodb_buffer_pool_wait_free is simultaneously nonzero. That counter measures how many times a query had to wait because no clean page was available, which means dirty pages had to be flushed before the read could proceed.

Common causes

Cause	What it looks like	First thing to check
Runaway table scan	Hit ratio drops suddenly after a new query or report job; one large `SELECT` visible in processlist	`SHOW FULL PROCESSLIST` for queries against large tables; delta in `Handler_read_rnd_next`
Organic working set growth	Hit ratio trends downward over weeks; no single query to blame; data size increased steadily	`Innodb_buffer_pool_pages_data` versus total pages; total data directory growth
Cold buffer pool after restart	Hit ratio is terrible immediately after restart; improves gradually over time	`SHOW GLOBAL STATUS LIKE 'Uptime'`; `Innodb_buffer_pool_reads` trending down
Buffer pool undersized from install	Hit ratio never reached 99% even during low traffic; `Innodb_buffer_pool_wait_free` > 0 under normal load	Compare `innodb_buffer_pool_size` to data size and available RAM

Quick checks

-- Compute hit ratio from deltas (run twice, 60 seconds apart)
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';

-- Check for acute memory pressure
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_wait_free';

-- Check pool fullness and dirty page ratio
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_%';

-- Scanning queries and long-running SELECTs
SHOW FULL PROCESSLIST;
SHOW GLOBAL STATUS LIKE 'Handler_read%';

# Check disk read saturation at the OS level
iostat -x 1 5

-- Confirm the server is not still warming up after restart
SHOW GLOBAL STATUS LIKE 'Uptime';

How to diagnose it

Confirm scope. Verify latency increased across all query types, not just one table or schema. If only one table is affected, suspect a metadata lock cascade or InnoDB row lock contention. Buffer pool collapse is global.
Compute the interval-based hit ratio. Take two snapshots of Innodb_buffer_pool_reads and Innodb_buffer_pool_read_requests 60 seconds apart. Calculate 1 - (delta_reads / delta_requests). If delta_requests is zero, the workload is idle during the window. For many OLTP workloads, below 99% is concerning; below 95% usually means you are over the cliff. Ignore the cumulative lifetime ratio if the server has been up for months.
Check for page wait pressure. Query Innodb_buffer_pool_wait_free. A sustained nonzero rate means queries are blocked while InnoDB flushes dirty pages to make room. This is more reliable than hit ratio alone for acute memory pressure.
Inspect active queries. Use SHOW FULL PROCESSLIST to look for long-running SELECT statements against large tables. Check for missing WHERE clauses, large OFFSET values, or unindexed columns. Correlate with a rising Handler_read_rnd_next delta to confirm scanning.
Correlate with disk metrics. If disk read throughput is at the hardware limit while write throughput and write I/O queue depth remain normal, the working set has exceeded memory. The disk is spending most of its time serving random page reads.
Rule out cold start. Check Uptime. If the server has been running for less than 1800 seconds, the buffer pool may still be warming. A low hit ratio shortly after restart is expected unless innodb_buffer_pool_dump_at_shutdown and innodb_buffer_pool_load_at_startup are enabled and have completed.
Assess capacity headroom. Compare Innodb_buffer_pool_pages_data to Innodb_buffer_pool_pages_total. If data pages occupy nearly the entire pool and wait_free is rising, the pool is too small for the working set.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Buffer pool hit ratio (interval)	Fraction of reads that avoid disk	< 99% sustained for OLTP; < 95% is critical
`Innodb_buffer_pool_wait_free`	Queries blocked waiting for clean pages	Sustained nonzero rate
`Innodb_buffer_pool_reads` rate	Volume of physical disk reads	Exponential increase correlating with hit ratio drop
`Threads_running`	Active query concurrency	Sustained spike without throughput increase indicates queuing
Disk read I/O utilization	Hardware saturation	Read bandwidth near limit while write bandwidth normal
`Innodb_buffer_pool_pages_data` / total	Data fill of the pool	Near total combined with `wait_free` > 0 = overflow

Fixes

Kill the scanning query. If SHOW FULL PROCESSLIST reveals a specific query causing a large scan, terminate it with KILL <thread_id>. This stops the eviction pressure immediately; hot pages re-warm naturally over the next minutes. Tradeoff: the query fails and must be rescheduled, ideally off-peak.

Warning: KILL is destructive. The client receives an error. A DML query will roll back its current transaction; a read-only query simply aborts. Confirm the thread is not a replication applier or critical maintenance job before killing.

Increase innodb_buffer_pool_size. In MySQL 5.7.5+, this is dynamic. Set it at runtime:

SET GLOBAL innodb_buffer_pool_size = <bytes>;

The change applies in chunks; monitor progress with:

SHOW STATUS LIKE 'Innodb_buffer_pool_resize_status';

Size the pool to cover the working set, targeting 60-80% of available RAM on a dedicated database host. Leave headroom for the OS, per-connection buffers, and temporary tables. Tradeoff: less memory for the OS filesystem cache and per-connection buffers. Monitor for OOM risk, especially in containerized deployments where cgroup limits are stricter than physical RAM.

Restart with a larger pool (pre-5.7.5 or when dynamic resize is insufficient). If you are on an older version or need more headroom than a dynamic resize allows, schedule a restart with a larger innodb_buffer_pool_size in the configuration. Tradeoff: downtime and a cold cache. Enable innodb_buffer_pool_dump_at_shutdown and innodb_buffer_pool_load_at_startup to shorten warmup.

Offload analytic scans. If the collapse was triggered by periodic reporting, ETL, or analytical queries, move that workload to a read replica or a separate data warehouse. Tradeoff: additional infrastructure cost and replication lag to monitor.

Prevention

Track the interval-based hit ratio as a trend, not a snapshot. A drop from 99.98% to 99.5% is an early warning that the working set is approaching the pool limit. Alert on Innodb_buffer_pool_wait_free with a sustained threshold; it is a more reliable acute signal than hit ratio alone because it directly measures blocking behavior.

Size the buffer pool to 60-80% of available RAM and maintain headroom. Track Innodb_buffer_pool_pages_data growth versus total pages. If your working set is growing by 5 GB per month and you have 10 GB of headroom, you have roughly two months before the cliff.

Enable innodb_buffer_pool_dump_at_shutdown and innodb_buffer_pool_load_at_startup so restarts do not mask capacity problems with cold-cache blindness. After any restart, ignore hit ratio alerts for the first 30 minutes.

How Netdata helps

Computes the buffer pool hit ratio from delta intervals rather than cumulative lifetime values, so a recent collapse is not hidden by months of good history.
Surfaces Innodb_buffer_pool_wait_free on the same dashboard as the hit ratio, making it easy to distinguish “pool full but healthy” from “pool full and blocking queries.”
Correlates disk I/O metrics with MySQL metrics in real time, so you can see read saturation and hit ratio collapse on one chart.
Alerts on the composite pattern of hit ratio drop combined with rising Threads_running and disk read saturation, reducing false positives from cold starts or legitimate batch jobs.
Tracks Innodb_buffer_pool_pages_data growth over weeks, giving a runway estimate before the working set exceeds RAM.

The Netdata solution

MySQL monitoring with Netdata

Netdata monitors MySQL and MariaDB with per-second metrics and ML anomaly detection. Track connection usage, query throughput, slow queries, redo-log pressure, and replication lag alongside the host and storage signals that explain them.

See MySQL monitoring → Start monitoring free

MySQL InnoDB buffer pool hit ratio collapse: the cliff edge

MySQL InnoDB buffer pool hit ratio collapse: the cliff edge

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Prevention

How Netdata helps

Related guides

MySQL monitoring with Netdata