MySQL replication lag spiral: Seconds_Behind_Source growing without bound
You check SHOW REPLICA STATUS and see Seconds_Behind_Source is not just elevated; it is climbing. Five minutes ago it was 60. Now it is 120. The replica is not catching up; it is falling further behind.
This is a replication lag death spiral. The SQL apply thread (or threads) cannot replay events as fast as the source generates them. Relay logs accumulate. Lag compounds because longer apply queues mean larger effective transactions, longer lock holds, and greater exposure to blocking queries on the replica.
Left unchecked, the spiral ends one of two ways: the relay log partition fills and the I/O thread stalls, or the source’s binary log expires before the replica consumes the events and replication breaks irrecoverably.
This pattern differs from a transient spike. A single large transaction causes a sudden lag jump that eventually recovers. A spiral shows monotonic growth with no plateau. The source is usually healthy; the problem is isolated to replica apply throughput.
What this means
The source writes committed transactions to its binary log. The replica’s I/O thread copies events into local relay logs. The SQL thread (or coordinator and worker threads in multi-threaded mode) reads the relay log and replays events against the replica.
A death spiral begins when apply rate is persistently lower than fetch rate. The SQL thread falls behind while the I/O thread continues pulling events. Relay log files grow. Because Seconds_Behind_Source compares the timestamp of the last applied event to the current time on the replica, it grows linearly with the backlog. On single-threaded replicas, one core does all apply work while the source may use many. Even with multi-threaded apply, replica_preserve_commit_order=ON can serialize workers when transactions have dependencies, or when foreign keys and auto-increment inserts force commit-order serialization.
As lag grows, the window of unapplied transactions widens. If the source purges binlogs based on binlog_expire_logs_seconds or expire_logs_days, and the replica has not yet fetched or applied events inside the purge window, replication stops with an irrecoverable error. If the replica’s disk fills first, the I/O thread blocks on write, halting event fetch but leaving the SQL thread to drain what remains. In both cases, the replica is effectively offline for reads that require consistency.
flowchart TD
A[Source write rate exceeds replica apply rate] --> B[Relay log accumulates]
B --> C[Seconds_Behind_Source grows]
C --> D[Apply transactions grow larger]
D --> E[Apply throughput drops]
E --> C
B --> F[Relay log fills disk or source binlog expires]
F --> G[Replication breaks or I/O thread stalls]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Single-threaded apply | replica_parallel_workers is 0 or 1; replica CPU is mostly idle while lag grows steadily under moderate source load | SHOW GLOBAL VARIABLES LIKE 'replica_parallel_workers'; |
| Large transaction replay | Seconds_Behind_Source jumps sharply and then plateaus or creeps; one large event blocks the SQL thread | SHOW REPLICA STATUS\G for Relay_Log_Space and recent position changes |
| Replica resource saturation | Replica CPU or disk I/O maxed; apply bottlenecked on hardware | SHOW GLOBAL STATUS LIKE 'Threads_running'; and OS I/O metrics |
| Long-running queries blocking apply | replica_preserve_commit_order=ON and a long SELECT or backup holds locks that the applier needs | SHOW PROCESSLIST and performance_schema.data_lock_waits |
| Missing primary keys with row-based replication | Replica CPU high during apply; apply thread performs full table scans to locate rows | SHOW GLOBAL STATUS LIKE 'Handler_read_rnd_next'; on replica |
Quick checks
SHOW REPLICA STATUS\G
Look for Seconds_Behind_Source, Replica_IO_Running, Replica_SQL_Running, and Relay_Log_Space.
SHOW GLOBAL VARIABLES LIKE 'replica_parallel_workers';
Values 0 or 1 mean the replica is applying events serially.
SHOW GLOBAL STATUS LIKE 'Threads_running';
Sustained values above 4 times the CPU core count indicate saturation.
SHOW PROCESSLIST;
Identify long-running SELECTs or backups that may block the applier.
SELECT * FROM performance_schema.replication_applier_status_by_worker LIMIT 5;
A failed worker stops all apply threads on multi-threaded replicas.
SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_current_waits';
Nonzero sustained values mean transactions are waiting on locks.
If you stream replica metrics through Netdata, correlate the mysql.replication_lag chart with disk.await on the relay log partition and per-core CPU utilization to distinguish apply bottlenecks from I/O saturation.
How to diagnose it
- Confirm the spiral. Sample
Seconds_Behind_SourceandRelay_Log_Spaceevery 60 seconds for 5 minutes. If both increase monotonically and both replication threads reportYes, you have a spiral, not a transient spike. - Determine apply parallelism. Run
SHOW GLOBAL VARIABLES LIKE 'replica_parallel_workers';. If the value is 0 or 1 on a multi-core host, single-threaded apply is the likely bottleneck. - Identify large transactions. If lag jumped suddenly and holds steady, the SQL thread is likely replaying one large transaction. Check the source for recent bulk INSERT, UPDATE, DELETE, or DDL.
- Check replica resource saturation. Compare replica CPU utilization and disk I/O latency to the source. If the replica is smaller or contended, it cannot match source write throughput.
- Detect blocking queries. On the replica, run
SHOW PROCESSLIST. Look for long-running queries that hold locks needed by the applier, especially withreplica_preserve_commit_order=ON. - Measure time to failure. Compare the lag growth rate to the source’s binlog expiry window (
binlog_expire_logs_secondsorexpire_logs_days). Project when unapplied events will fall outside the purge window. Check relay log partition free space.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Seconds_Behind_Source | Timestamp-based lag estimate | Increasing monotonically over 5+ minutes while threads are Yes |
Relay_Log_Space | Total bytes of relay logs waiting to be applied | Growing continuously |
Replica_IO_Running / Replica_SQL_Running | Basic replication health | Any value other than Yes |
replica_parallel_workers | Apply parallelism ceiling | Set to 0 or 1 on multi-core replicas |
Threads_running on replica | Active concurrent work and contention | Sustained above 4 times CPU cores |
| GTID set difference | Exact transaction-level lag | GTID_SUBTRACT(source_set, replica_set) is non-empty and growing |
| Checkpoint age vs redo log capacity | Write path saturation on replica | Greater than 75 percent of capacity during peak apply |
Fixes
Single-threaded apply
Warning: STOP REPLICA interrupts replication and makes the replica stale for reads. Run only during a maintenance window or when stale reads are acceptable.
Stop the replica threads, increase replica_parallel_workers, and start them again. On MySQL 8.0.27+ the default is 4; on older versions set it explicitly to match core count. Tradeoff: with replica_preserve_commit_order=ON (default), commit order is enforced and some workloads still serialize. Turning it off allows higher concurrency but can produce temporary consistency gaps.







