MySQL slow commits with idle CPU: redo and binlog fsync pressure

Slow commits or transaction timeouts with idle CPU usually mean the bottleneck is in the durability path, not the query engine.

With innodb_flush_log_at_trx_commit=1 and sync_binlog=1, every COMMIT triggers an InnoDB redo log fsync() and a binary log fsync() before returning to the client. These are sequential, latency-bound operations. When storage cannot drain fsync requests as fast as the server generates them, commits queue up while the CPU waits.

This is not a checkpoint stall. A checkpoint stall happens when the redo log fills and InnoDB must flush dirty pages synchronously to reclaim log capacity. Fsync pressure happens when the log flusher thread cannot complete fsync() calls fast enough to keep up with the commit rate, even though redo log capacity is abundant. Check Innodb_os_log_pending_fsyncs to tell the difference.

What this means

When a client issues COMMIT, InnoDB writes redo entries to the log buffer. A dedicated log flusher thread issues fsync() to make them durable. If binary logging is enabled and sync_binlog >= 1, the server also fsyncs the binary log in the ordered commit pipeline. User threads wait until their log sequence number is flushed.

If storage latency is elevated, the flusher thread falls behind. Pending fsyncs accumulate in OS and device queues. Committing transactions block on the durability barrier. Because the wait is I/O-bound, CPU stays low while query latency spikes.

The critical signals are Innodb_os_log_pending_fsyncs and Innodb_os_log_pending_writes. In a healthy system these are zero. Sustained values greater than zero mean storage cannot keep up with the commit rate.

flowchart TD
    A[Client COMMIT] --> B[Redo log buffer write]
    B --> C[Redo log fsync]
    C --> D[Binlog fsync]
    D --> E[Commit returns]
    C -.->|Slow storage| F[Pending redo fsyncs]
    D -.->|Slow storage| G[Pending binlog fsyncs]
    F --> H[Innodb_os_log_pending_fsyncs > 0]
    G --> I[Commit latency spikes]
    H --> I

Common causes

CauseWhat it looks likeFirst thing to check
Storage latency spikeInnodb_os_log_pending_fsyncs > 0 sustained; OS disk await elevated; no query plan regressionsiostat -x on the redo and binlog devices
Durability settings too strict for storage tierEvery commit triggers two fsyncs; commit rate exceeds storage IOPS ceilinginnodb_flush_log_at_trx_commit and sync_binlog
Redo log and binlog sharing the same deviceBoth sequential fsync streams compete for the same disk queue; latency doubles under loadFilesystem paths for redo log and binlog
Shared storage contentionPeriodic latency spikes that correlate with neighbor activity; no single query culpritHypervisor or cloud volume metrics
Network-attached storage latencyFsync latency measured in tens to hundreds of milliseconds; affects all durable writesNFS or SAN latency metrics

Quick checks

-- Pending redo fsyncs and writes
SHOW GLOBAL STATUS LIKE 'Innodb_os_log_pending%';
-- Aggregate pending fsyncs across all tablespaces
SHOW GLOBAL STATUS LIKE 'Innodb_data_pending_fsyncs';
-- Durability configuration
SHOW GLOBAL VARIABLES WHERE Variable_name IN ('innodb_flush_log_at_trx_commit', 'sync_binlog', 'innodb_flush_method');
-- InnoDB FILE I/O for pending log flushes
SHOW ENGINE INNODB STATUS\G
-- Look for: Pending flushes (fsync) log: N
-- Checkpoint age to rule out capacity stall
SHOW ENGINE INNODB STATUS\G
-- Compare Log sequence number to Last checkpoint at
# OS-level disk latency on the transaction log device
iostat -x 1

How to diagnose it

  1. Confirm the pattern. Check SHOW GLOBAL STATUS for Threads_running and Threads_connected. If Threads_running is well below the CPU core count, the buffer pool hit ratio is healthy, and queries are still slow, suspect the durability path rather than query complexity or CPU saturation.
  2. Check pending fsyncs. Run SHOW GLOBAL STATUS LIKE 'Innodb_os_log_pending%';. Sustained nonzero values for Innodb_os_log_pending_fsyncs or Innodb_os_log_pending_writes indicate the redo log flusher thread is behind and fsyncs are stacking up in the storage queue.
  3. Read InnoDB status. Run SHOW ENGINE INNODB STATUS; and look at the FILE I/O section. Pending flushes (fsync) log: N with N > 0 confirms the storage layer is not draining fsync requests fast enough.
  4. Rule out checkpoint stall. Calculate checkpoint age from the LOG section (Log sequence number minus Last checkpoint at) and compare it to total redo log capacity. In MySQL 8.0.30+, check Innodb_redo_log_capacity_resized. If checkpoint age is well below 75% of capacity but fsyncs are pending, the issue is fsync speed, not log capacity.
  5. Quantify the fsync load. Check innodb_flush_log_at_trx_commit and sync_binlog. If both are 1, every commit generates two fsyncs. Multiply your commit rate by two and compare that to the sustained IOPS and latency profile of your storage. If the product exceeds what the device can deliver at low latency, you have a mathematical mismatch.
  6. Inspect disk latency. Run iostat -x 1 on the devices hosting the redo logs and binary logs. Elevated await with low throughput means the device is struggling with synchronous operations, not bandwidth. Watch aqu-sz (average queue size); values above the device queue-depth limit indicate saturation.
  7. Check for large transactions. Large transactions increase the amount of data flushed per fsync and hold resources in the commit pipeline longer. Review Binlog_cache_disk_use and information_schema.INNODB_TRX for abnormally large writers or long-running transactions.
  8. Correlate with application metrics. If end-to-end commit latency tracks storage fsync latency one-for-one, the diagnosis is confirmed.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Innodb_os_log_pending_fsyncsDirect gauge of redo log flusher thread backlogSustained value > 0
Innodb_os_log_pending_writesPending log writes waiting for I/OSustained value > 0
Innodb_data_pending_fsyncsAggregate fsync backlog across InnoDB data filesSustained value > 0
Checkpoint age / redo log capacityDistinguishes fsync pressure from capacity exhaustionAge < 75% but commits are still slow = fsync speed issue
Innodb_os_log_fsyncs rateBaseline for fsync throughput capacityRate collapses while pending fsyncs rise = storage is stalling
Questions or commit rateApplication-visible throughputSudden drop with steady Threads_connected while CPU is idle

Fixes

Reduce fsync frequency

If the storage tier cannot sustain the current fsync rate, relax durability knobs. Each change trades safety for speed.

  • Set innodb_flush_log_at_trx_commit=2. Redo log entries are written to the OS buffer on every commit, but fsync() happens only once per second. This eliminates the per-commit redo fsync. The tradeoff is up to one second of data loss on an OS crash or power failure. It is safe against a mysqld process crash.
  • Set sync_binlog=0. The binary log fsync is deferred to the operating system. The tradeoff is risk of binlog corruption or data loss on crash. Alternatively, set sync_binlog to a value greater than 1 to fsync the binlog once every N commits instead of every commit.
  • Do not change both settings to zero unless you explicitly accept the risk of losing committed transactions.

Isolate redo and binlog I/O

If the redo log and binary log reside on the same physical device, their sequential fsync streams compete for the same disk queue. Move them to separate volumes backed by distinct physical devices or controllers so that a spike in one does not inflate latency for the other. Update innodb_log_group_home_dir and log_bin accordingly; the change requires a restart.

Upgrade or tune storage

If the storage subsystem is the bottleneck, increase its synchronous IOPS capacity. Options include moving to local NVMe, increasing provisioned IOPS on cloud volumes, resolving RAID rebuilds, or moving off oversubscribed shared storage. If the storage has a write-back cache, ensure it is battery-backed or non-volatile before relying on it for durability.

Prevention

  • Monitor Innodb_os_log_pending_fsyncs continuously. Any sustained nonzero value means storage headroom is gone and commit latency will follow.
  • Size storage for peak commit rate multiplied by the fsync multiplier imposed by your durability settings. With innodb_flush_log_at_trx_commit=1 and sync_binlog=1, every commit generates two fsyncs.
  • Keep transaction logs and binlogs on separate volumes when infrastructure allows.
  • Avoid colocating high-fsync workloads such as backups, schema migrations, or batch imports on the same storage controller as the transaction logs.
  • Track transaction latency in performance_schema so that fsync pressure is visible before application timeouts trigger.

How Netdata helps

  • Netdata tracks Innodb_os_log_pending_fsyncs and Innodb_os_log_pending_writes without manual polling of SHOW GLOBAL STATUS.
  • It plots MySQL commit latency against OS disk await and utilization on the same timeline, confirming whether storage is the bottleneck.
  • Checkpoint age and pending fsyncs are visualized together, distinguishing a capacity stall from a write-speed stall in one view.
  • Alerts on sustained nonzero pending fsyncs provide early warning before commit latency degrades into application timeouts.