MySQL InnoDB redo log checkpoint stall: when all writes freeze
InnoDB redo log checkpoint stalls freeze all writes synchronously when checkpoint age reaches the end of the circular redo log. INSERTs, UPDATEs, and DELETEs that complete in milliseconds hang for seconds. Threads_running climbs while the Questions rate collapses. Read queries may still return initially, but soon every connection waits for a write lock or commit acknowledgment. Then throughput recovers just as abruptly. This is not a disk failure, deadlock, or runaway query. It is MySQL’s write path running out of runway.
What this means
InnoDB buffers changes in the buffer pool and records them sequentially in the redo log, a fixed-size circular write-ahead log. A checkpoint flushes dirty pages to data files, allowing the redo log space before that point to be reused. Checkpoint age is the distance between the current log sequence number (LSN) and the last checkpoint LSN.
In MySQL 8.0.30 and later, compute checkpoint age from SHOW GLOBAL STATUS:
Innodb_redo_log_current_lsn - Innodb_redo_log_checkpoint_lsn
In earlier versions, parse SHOW ENGINE INNODB STATUS and subtract Last checkpoint at from Log sequence number in the LOG section. Total capacity is innodb_redo_log_capacity in 8.0.30+, or innodb_log_file_size multiplied by innodb_log_files_in_group in older versions.
As write rate exceeds flush rate, checkpoint age grows. Below 75% of capacity, adaptive flushing keeps pace. Between 75% and roughly 85%, flushing turns aggressive and consumes I/O bandwidth. Above 90%, InnoDB triggers a sharp checkpoint. All transactions generating redo block synchronously until the corresponding dirty pages are flushed. Throughput collapses, not degrades. The stall persists until the checkpoint advances, which may take seconds to minutes depending on dirty page volume and storage speed.
flowchart TD
A[Heavy write workload] --> B[Dirty pages accumulate]
B --> C[Checkpoint age grows]
C --> D{Checkpoint age > 75% capacity}
D -->|No| E[Adaptive flushing]
D -->|Yes| F[Aggressive flushing]
F --> G{Checkpoint age > 90% capacity}
G -->|No| H[Writes continue with heavy I/O]
G -->|Yes| I[Sharp synchronous checkpoint]
I --> J[All writes stall]
J --> K[Queries pile up]
K --> L[Threads_running spikes]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Redo log capacity too small for sustained write rate | Checkpoint age trends upward daily and peaks above 75% during normal load | SHOW GLOBAL VARIABLES LIKE 'innodb_redo_log_capacity' (or innodb_log_file_size and innodb_log_files_in_group) versus peak write rate |
| Storage I/O cannot sustain page cleaner flush rate | Dirty page ratio stays flat or rises during write spikes; disk latency is high but queue depth is low | SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_dirty' ratio and OS disk metrics |
| Burst write pattern (bulk load, large UPDATE) | Checkpoint age spikes sharply during batch jobs or schema migrations | SHOW PROCESSLIST for large write queries; SHOW ENGINE INNODB STATUS for LSN growth |
| Insufficient page cleaner threads | Single-threaded flushing on a multi-core host; flush rate does not scale with dirty pages | SHOW GLOBAL VARIABLES LIKE 'innodb_page_cleaners' versus buffer pool instances |
Quick checks
# Checkpoint age (MySQL 8.0.30+)
mysql -e "SELECT (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_redo_log_current_lsn') - (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_redo_log_checkpoint_lsn') AS checkpoint_age;"
# Redo log capacity (MySQL 8.0.30+)
mysql -e "SHOW GLOBAL VARIABLES LIKE 'innodb_redo_log_capacity';"
# Checkpoint age and capacity (all versions via SHOW ENGINE INNODB STATUS)
mysql -e "SHOW ENGINE INNODB STATUS\G" | grep -E "Log sequence number|Last checkpoint at"
# Dirty page ratio
mysql -e "SELECT (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_dirty') / (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_total') * 100 AS dirty_pct;"
# Throughput collapse confirmation (two samples, 10s apart)
mysql -e "SHOW GLOBAL STATUS LIKE 'Questions';"
sleep 10
mysql -e "SHOW GLOBAL STATUS LIKE 'Questions';"
# Pending I/O (check if disk is saturated or flushing is starved)
mysql -e "SHOW GLOBAL STATUS WHERE Variable_name IN ('Innodb_data_pending_fsyncs', 'Innodb_data_pending_writes', 'Innodb_os_log_pending_fsyncs');"
How to diagnose it
Confirm the stall is write-specific. During a checkpoint stall, read queries initially continue while write transactions block at commit. Check
SHOW PROCESSLIST. If write queries hang while simple SELECTs still return, the durability path is blocked rather than a metadata lock or row lock cascade.Compute checkpoint age as a ratio of total capacity. A ratio above 90% confirms the stall. Between 75% and 90%, the instance is in the danger zone.
Check
Innodb_log_waits. If this counter is nonzero, the log buffer may be undersized, but this is a separate mechanism from checkpoint stall. Do not mistake log buffer pressure for redo log capacity exhaustion.Verify that dirty pages are not being flushed fast enough. If the dirty page ratio is high and disk I/O pending counters are low, the storage subsystem may be under-provisioned for the configured
innodb_io_capacity_max.Find the trigger. A single large transaction, an unbatched bulk load, or a schema migration can push checkpoint age over the edge during an otherwise stable workload. Identify the largest writer in
information_schema.INNODB_TRXordered bytrx_rows_modified.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Checkpoint age / redo capacity | Direct measure of runway until sharp checkpoint | Sustained above 75%; any excursion above 90% |
| Questions or commit rate | Confirms user-facing impact | Sudden drop while connections remain stable |
| Threads_running | Shows query pile-up during stall | Spikes while Questions rate drops |
| Innodb_buffer_pool_pages_dirty ratio | Indicates flush pressure | Rising toward innodb_max_dirty_pages_pct |
| Innodb_log_waits | Log buffer pressure, distinct from checkpoint stall | Any nonzero sustained rate |
| Innodb_data_pending_fsyncs / pending_writes | Storage saturation signal | Sustained nonzero values |
Fixes
Immediate relief
If storage has headroom, raise innodb_io_capacity_max dynamically to let page cleaners flush faster. This shortens the stall but does not fix the underlying capacity mismatch.
SET GLOBAL innodb_io_capacity_max = <value_matching_storage_iops>;
Warning: This increases disk I/O pressure. Verify storage latency and throughput headroom before raising.
If a specific large write transaction is the trigger and it is safe to interrupt, kill it. Stopping redo generation lets the checkpoint advance.
KILL <trx_mysql_thread_id>;
Warning: Killing a transaction triggers rollback, which is expensive and generates additional redo. Use only when the transaction is known to be non-critical and the rollback cost is acceptable.
Short-term fix
Increase redo log capacity. In MySQL 8.0.30 and later, this is dynamic and takes effect without restart.
SET GLOBAL innodb_redo_log_capacity = <new_size>;
Larger redo logs extend crash recovery time. Size for peak workload, not infinity.
Pre-8.0.30, changing innodb_log_file_size or innodb_log_files_in_group requires a clean shutdown, configuration change, and restart. Plan this during a maintenance window.
Storage and configuration tuning
Ensure innodb_io_capacity and innodb_io_capacity_max reflect your storage hardware rather than defaults. If dirty pages accumulate steadily during peak hours, increase innodb_page_cleaners up to the number of buffer pool instances (capped by available CPU cores).
Prevention
Size redo log capacity so that peak checkpoint age stays below 50% of total capacity. This leaves headroom for bursts. Track checkpoint age as a ratio, not an absolute LSN value. Monitor the trend over days, not just spot values. Schedule large bulk loads or schema migrations during low-traffic windows, or break them into smaller transactions. Revisit innodb_io_capacity_max after any storage upgrade or downgrade.
How Netdata helps
- Correlates
Innodb_redo_log_current_lsnandInnodb_redo_log_checkpoint_lsnto chart checkpoint age in real time on MySQL 8.0.30+. - Surfaces commit rate and
Questionsrate on the same dashboard to reveal throughput collapse. - Alerts on checkpoint age ratio crossing 75% and 90% thresholds, with delay guards to suppress noise during restart.
- Distinguishes checkpoint stalls from metadata lock cascades by overlaying
Threads_runningwith row-lock wait and metadata-lock wait indicators. - Tracks
Innodb_buffer_pool_pages_dirtyratio alongside checkpoint age to reveal whether the root cause is capacity undersizing or flush undersizing.
Related guides
- How MySQL actually works in production: a mental model for operators
- MySQL Aborted_connects and Aborted_clients climbing: diagnosis
- MySQL connection exhaustion: detection, diagnosis, and prevention
- MySQL innodb_deadlock_detect=OFF: when deadlock detection becomes the bottleneck
- MySQL ERROR 1213: Deadlock found when trying to get lock; try restarting transaction
- MySQL gap locks and next-key locks: surprising deadlocks under REPEATABLE READ
- MySQL ERROR 1205: Lock wait timeout exceeded; try restarting transaction
- MySQL long-running transactions: detecting and killing the silent blocker
- MySQL: got a packet bigger than ‘max_allowed_packet’ bytes - causes and fixes
- MySQL metadata lock cascade: how one ALTER TABLE freezes a whole table
- MySQL monitoring checklist: the signals every production instance needs
- MySQL monitoring maturity model: from survival to expert







