MySQL replication lag spiral: Seconds_Behind_Source growing without bound

You check SHOW REPLICA STATUS and see Seconds_Behind_Source is not just elevated; it is climbing. Five minutes ago it was 60. Now it is 120. The replica is not catching up; it is falling further behind.

This is a replication lag death spiral. The SQL apply thread (or threads) cannot replay events as fast as the source generates them. Relay logs accumulate. Lag compounds because longer apply queues mean larger effective transactions, longer lock holds, and greater exposure to blocking queries on the replica.

Left unchecked, the spiral ends one of two ways: the relay log partition fills and the I/O thread stalls, or the source’s binary log expires before the replica consumes the events and replication breaks irrecoverably.

This pattern differs from a transient spike. A single large transaction causes a sudden lag jump that eventually recovers. A spiral shows monotonic growth with no plateau. The source is usually healthy; the problem is isolated to replica apply throughput.

What this means

The source writes committed transactions to its binary log. The replica’s I/O thread copies events into local relay logs. The SQL thread (or coordinator and worker threads in multi-threaded mode) reads the relay log and replays events against the replica.

A death spiral begins when apply rate is persistently lower than fetch rate. The SQL thread falls behind while the I/O thread continues pulling events. Relay log files grow. Because Seconds_Behind_Source compares the timestamp of the last applied event to the current time on the replica, it grows linearly with the backlog. On single-threaded replicas, one core does all apply work while the source may use many. Even with multi-threaded apply, replica_preserve_commit_order=ON can serialize workers when transactions have dependencies, or when foreign keys and auto-increment inserts force commit-order serialization.

As lag grows, the window of unapplied transactions widens. If the source purges binlogs based on binlog_expire_logs_seconds or expire_logs_days, and the replica has not yet fetched or applied events inside the purge window, replication stops with an irrecoverable error. If the replica’s disk fills first, the I/O thread blocks on write, halting event fetch but leaving the SQL thread to drain what remains. In both cases, the replica is effectively offline for reads that require consistency.

flowchart TD
    A[Source write rate exceeds replica apply rate] --> B[Relay log accumulates]
    B --> C[Seconds_Behind_Source grows]
    C --> D[Apply transactions grow larger]
    D --> E[Apply throughput drops]
    E --> C
    B --> F[Relay log fills disk or source binlog expires]
    F --> G[Replication breaks or I/O thread stalls]

Common causes

CauseWhat it looks likeFirst thing to check
Single-threaded applyreplica_parallel_workers is 0 or 1; replica CPU is mostly idle while lag grows steadily under moderate source loadSHOW GLOBAL VARIABLES LIKE 'replica_parallel_workers';
Large transaction replaySeconds_Behind_Source jumps sharply and then plateaus or creeps; one large event blocks the SQL threadSHOW REPLICA STATUS\G for Relay_Log_Space and recent position changes
Replica resource saturationReplica CPU or disk I/O maxed; apply bottlenecked on hardwareSHOW GLOBAL STATUS LIKE 'Threads_running'; and OS I/O metrics
Long-running queries blocking applyreplica_preserve_commit_order=ON and a long SELECT or backup holds locks that the applier needsSHOW PROCESSLIST and performance_schema.data_lock_waits
Missing primary keys with row-based replicationReplica CPU high during apply; apply thread performs full table scans to locate rowsSHOW GLOBAL STATUS LIKE 'Handler_read_rnd_next'; on replica

Quick checks

SHOW REPLICA STATUS\G

Look for Seconds_Behind_Source, Replica_IO_Running, Replica_SQL_Running, and Relay_Log_Space.

SHOW GLOBAL VARIABLES LIKE 'replica_parallel_workers';

Values 0 or 1 mean the replica is applying events serially.

SHOW GLOBAL STATUS LIKE 'Threads_running';

Sustained values above 4 times the CPU core count indicate saturation.

SHOW PROCESSLIST;

Identify long-running SELECTs or backups that may block the applier.

SELECT * FROM performance_schema.replication_applier_status_by_worker LIMIT 5;

A failed worker stops all apply threads on multi-threaded replicas.

SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_current_waits';

Nonzero sustained values mean transactions are waiting on locks.

If you stream replica metrics through Netdata, correlate the mysql.replication_lag chart with disk.await on the relay log partition and per-core CPU utilization to distinguish apply bottlenecks from I/O saturation.

How to diagnose it

  1. Confirm the spiral. Sample Seconds_Behind_Source and Relay_Log_Space every 60 seconds for 5 minutes. If both increase monotonically and both replication threads report Yes, you have a spiral, not a transient spike.
  2. Determine apply parallelism. Run SHOW GLOBAL VARIABLES LIKE 'replica_parallel_workers';. If the value is 0 or 1 on a multi-core host, single-threaded apply is the likely bottleneck.
  3. Identify large transactions. If lag jumped suddenly and holds steady, the SQL thread is likely replaying one large transaction. Check the source for recent bulk INSERT, UPDATE, DELETE, or DDL.
  4. Check replica resource saturation. Compare replica CPU utilization and disk I/O latency to the source. If the replica is smaller or contended, it cannot match source write throughput.
  5. Detect blocking queries. On the replica, run SHOW PROCESSLIST. Look for long-running queries that hold locks needed by the applier, especially with replica_preserve_commit_order=ON.
  6. Measure time to failure. Compare the lag growth rate to the source’s binlog expiry window (binlog_expire_logs_seconds or expire_logs_days). Project when unapplied events will fall outside the purge window. Check relay log partition free space.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Seconds_Behind_SourceTimestamp-based lag estimateIncreasing monotonically over 5+ minutes while threads are Yes
Relay_Log_SpaceTotal bytes of relay logs waiting to be appliedGrowing continuously
Replica_IO_Running / Replica_SQL_RunningBasic replication healthAny value other than Yes
replica_parallel_workersApply parallelism ceilingSet to 0 or 1 on multi-core replicas
Threads_running on replicaActive concurrent work and contentionSustained above 4 times CPU cores
GTID set differenceExact transaction-level lagGTID_SUBTRACT(source_set, replica_set) is non-empty and growing
Checkpoint age vs redo log capacityWrite path saturation on replicaGreater than 75 percent of capacity during peak apply

Fixes

Single-threaded apply

Warning: STOP REPLICA interrupts replication and makes the replica stale for reads. Run only during a maintenance window or when stale reads are acceptable.

Stop the replica threads, increase replica_parallel_workers, and start them again. On MySQL 8.0.27+ the default is 4; on older versions set it explicitly to match core count. Tradeoff: with replica_preserve_commit_order=ON (default), commit order is enforced and some workloads still serialize. Turning it off allows higher concurrency but can produce temporary consistency gaps.

Relay