MySQL InnoDB redo log checkpoint stall: when all writes freeze

InnoDB redo log checkpoint stalls freeze all writes synchronously when checkpoint age reaches the end of the circular redo log. INSERTs, UPDATEs, and DELETEs that complete in milliseconds hang for seconds. Threads_running climbs while the Questions rate collapses. Read queries may still return initially, but soon every connection waits for a write lock or commit acknowledgment. Then throughput recovers just as abruptly. This is not a disk failure, deadlock, or runaway query. It is MySQL’s write path running out of runway.

What this means

InnoDB buffers changes in the buffer pool and records them sequentially in the redo log, a fixed-size circular write-ahead log. A checkpoint flushes dirty pages to data files, allowing the redo log space before that point to be reused. Checkpoint age is the distance between the current log sequence number (LSN) and the last checkpoint LSN.

In MySQL 8.0.30 and later, compute checkpoint age from SHOW GLOBAL STATUS:

Innodb_redo_log_current_lsn - Innodb_redo_log_checkpoint_lsn

In earlier versions, parse SHOW ENGINE INNODB STATUS and subtract Last checkpoint at from Log sequence number in the LOG section. Total capacity is innodb_redo_log_capacity in 8.0.30+, or innodb_log_file_size multiplied by innodb_log_files_in_group in older versions.

As write rate exceeds flush rate, checkpoint age grows. Below 75% of capacity, adaptive flushing keeps pace. Between 75% and roughly 85%, flushing turns aggressive and consumes I/O bandwidth. Above 90%, InnoDB triggers a sharp checkpoint. All transactions generating redo block synchronously until the corresponding dirty pages are flushed. Throughput collapses, not degrades. The stall persists until the checkpoint advances, which may take seconds to minutes depending on dirty page volume and storage speed.

flowchart TD
    A[Heavy write workload] --> B[Dirty pages accumulate]
    B --> C[Checkpoint age grows]
    C --> D{Checkpoint age > 75% capacity}
    D -->|No| E[Adaptive flushing]
    D -->|Yes| F[Aggressive flushing]
    F --> G{Checkpoint age > 90% capacity}
    G -->|No| H[Writes continue with heavy I/O]
    G -->|Yes| I[Sharp synchronous checkpoint]
    I --> J[All writes stall]
    J --> K[Queries pile up]
    K --> L[Threads_running spikes]

Common causes

CauseWhat it looks likeFirst thing to check
Redo log capacity too small for sustained write rateCheckpoint age trends upward daily and peaks above 75% during normal loadSHOW GLOBAL VARIABLES LIKE 'innodb_redo_log_capacity' (or innodb_log_file_size and innodb_log_files_in_group) versus peak write rate
Storage I/O cannot sustain page cleaner flush rateDirty page ratio stays flat or rises during write spikes; disk latency is high but queue depth is lowSHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_dirty' ratio and OS disk metrics
Burst write pattern (bulk load, large UPDATE)Checkpoint age spikes sharply during batch jobs or schema migrationsSHOW PROCESSLIST for large write queries; SHOW ENGINE INNODB STATUS for LSN growth
Insufficient page cleaner threadsSingle-threaded flushing on a multi-core host; flush rate does not scale with dirty pagesSHOW GLOBAL VARIABLES LIKE 'innodb_page_cleaners' versus buffer pool instances

Quick checks

# Checkpoint age (MySQL 8.0.30+)
mysql -e "SELECT (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_redo_log_current_lsn') - (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_redo_log_checkpoint_lsn') AS checkpoint_age;"
# Redo log capacity (MySQL 8.0.30+)
mysql -e "SHOW GLOBAL VARIABLES LIKE 'innodb_redo_log_capacity';"
# Checkpoint age and capacity (all versions via SHOW ENGINE INNODB STATUS)
mysql -e "SHOW ENGINE INNODB STATUS\G" | grep -E "Log sequence number|Last checkpoint at"
# Dirty page ratio
mysql -e "SELECT (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_dirty') / (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_total') * 100 AS dirty_pct;"
# Throughput collapse confirmation (two samples, 10s apart)
mysql -e "SHOW GLOBAL STATUS LIKE 'Questions';"
sleep 10
mysql -e "SHOW GLOBAL STATUS LIKE 'Questions';"
# Pending I/O (check if disk is saturated or flushing is starved)
mysql -e "SHOW GLOBAL STATUS WHERE Variable_name IN ('Innodb_data_pending_fsyncs', 'Innodb_data_pending_writes', 'Innodb_os_log_pending_fsyncs');"

How to diagnose it

  1. Confirm the stall is write-specific. During a checkpoint stall, read queries initially continue while write transactions block at commit. Check SHOW PROCESSLIST. If write queries hang while simple SELECTs still return, the durability path is blocked rather than a metadata lock or row lock cascade.

  2. Compute checkpoint age as a ratio of total capacity. A ratio above 90% confirms the stall. Between 75% and 90%, the instance is in the danger zone.

  3. Check Innodb_log_waits. If this counter is nonzero, the log buffer may be undersized, but this is a separate mechanism from checkpoint stall. Do not mistake log buffer pressure for redo log capacity exhaustion.

  4. Verify that dirty pages are not being flushed fast enough. If the dirty page ratio is high and disk I/O pending counters are low, the storage subsystem may be under-provisioned for the configured innodb_io_capacity_max.

  5. Find the trigger. A single large transaction, an unbatched bulk load, or a schema migration can push checkpoint age over the edge during an otherwise stable workload. Identify the largest writer in information_schema.INNODB_TRX ordered by trx_rows_modified.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Checkpoint age / redo capacityDirect measure of runway until sharp checkpointSustained above 75%; any excursion above 90%
Questions or commit rateConfirms user-facing impactSudden drop while connections remain stable
Threads_runningShows query pile-up during stallSpikes while Questions rate drops
Innodb_buffer_pool_pages_dirty ratioIndicates flush pressureRising toward innodb_max_dirty_pages_pct
Innodb_log_waitsLog buffer pressure, distinct from checkpoint stallAny nonzero sustained rate
Innodb_data_pending_fsyncs / pending_writesStorage saturation signalSustained nonzero values

Fixes

Immediate relief

If storage has headroom, raise innodb_io_capacity_max dynamically to let page cleaners flush faster. This shortens the stall but does not fix the underlying capacity mismatch.

SET GLOBAL innodb_io_capacity_max = <value_matching_storage_iops>;

Warning: This increases disk I/O pressure. Verify storage latency and throughput headroom before raising.

If a specific large write transaction is the trigger and it is safe to interrupt, kill it. Stopping redo generation lets the checkpoint advance.

KILL <trx_mysql_thread_id>;

Warning: Killing a transaction triggers rollback, which is expensive and generates additional redo. Use only when the transaction is known to be non-critical and the rollback cost is acceptable.

Short-term fix

Increase redo log capacity. In MySQL 8.0.30 and later, this is dynamic and takes effect without restart.

SET GLOBAL innodb_redo_log_capacity = <new_size>;

Larger redo logs extend crash recovery time. Size for peak workload, not infinity.

Pre-8.0.30, changing innodb_log_file_size or innodb_log_files_in_group requires a clean shutdown, configuration change, and restart. Plan this during a maintenance window.

Storage and configuration tuning

Ensure innodb_io_capacity and innodb_io_capacity_max reflect your storage hardware rather than defaults. If dirty pages accumulate steadily during peak hours, increase innodb_page_cleaners up to the number of buffer pool instances (capped by available CPU cores).

Prevention

Size redo log capacity so that peak checkpoint age stays below 50% of total capacity. This leaves headroom for bursts. Track checkpoint age as a ratio, not an absolute LSN value. Monitor the trend over days, not just spot values. Schedule large bulk loads or schema migrations during low-traffic windows, or break them into smaller transactions. Revisit innodb_io_capacity_max after any storage upgrade or downgrade.

How Netdata helps

  • Correlates Innodb_redo_log_current_lsn and Innodb_redo_log_checkpoint_lsn to chart checkpoint age in real time on MySQL 8.0.30+.
  • Surfaces commit rate and Questions rate on the same dashboard to reveal throughput collapse.
  • Alerts on checkpoint age ratio crossing 75% and 90% thresholds, with delay guards to suppress noise during restart.
  • Distinguishes checkpoint stalls from metadata lock cascades by overlaying Threads_running with row-lock wait and metadata-lock wait indicators.
  • Tracks Innodb_buffer_pool_pages_dirty ratio alongside checkpoint age to reveal whether the root cause is capacity undersizing or flush undersizing.