MySQL replication lag spiral: Seconds_Behind_Source growing without bound

You check SHOW REPLICA STATUS and see Seconds_Behind_Source is not just elevated; it is climbing. Five minutes ago it was 60. Now it is 120. The replica is not catching up; it is falling further behind.

This is a replication lag death spiral. The SQL apply thread (or threads) cannot replay events as fast as the source generates them. Relay logs accumulate. Lag compounds because longer apply queues mean larger effective transactions, longer lock holds, and greater exposure to blocking queries on the replica.

Left unchecked, the spiral ends one of two ways: the relay log partition fills and the I/O thread stalls, or the source’s binary log expires before the replica consumes the events and replication breaks irrecoverably.

This pattern differs from a transient spike. A single large transaction causes a sudden lag jump that eventually recovers. A spiral shows monotonic growth with no plateau. The source is usually healthy; the problem is isolated to replica apply throughput.

What this means

The source writes committed transactions to its binary log. The replica’s I/O thread copies events into local relay logs. The SQL thread (or coordinator and worker threads in multi-threaded mode) reads the relay log and replays events against the replica.

A death spiral begins when apply rate is persistently lower than fetch rate. The SQL thread falls behind while the I/O thread continues pulling events. Relay log files grow. Because Seconds_Behind_Source compares the timestamp of the last applied event to the current time on the replica, it grows linearly with the backlog. On single-threaded replicas, one core does all apply work while the source may use many. Even with multi-threaded apply, replica_preserve_commit_order=ON can serialize workers when transactions have dependencies, or when foreign keys and auto-increment inserts force commit-order serialization.

As lag grows, the window of unapplied transactions widens. If the source purges binlogs based on binlog_expire_logs_seconds or expire_logs_days, and the replica has not yet fetched or applied events inside the purge window, replication stops with an irrecoverable error. If the replica’s disk fills first, the I/O thread blocks on write, halting event fetch but leaving the SQL thread to drain what remains. In both cases, the replica is effectively offline for reads that require consistency.

flowchart TD
    A[Source write rate exceeds replica apply rate] --> B[Relay log accumulates]
    B --> C[Seconds_Behind_Source grows]
    C --> D[Apply transactions grow larger]
    D --> E[Apply throughput drops]
    E --> C
    B --> F[Relay log fills disk or source binlog expires]
    F --> G[Replication breaks or I/O thread stalls]

Common causes

Cause	What it looks like	First thing to check
Single-threaded apply	`replica_parallel_workers` is 0 or 1; replica CPU is mostly idle while lag grows steadily under moderate source load	`SHOW GLOBAL VARIABLES LIKE 'replica_parallel_workers';`
Large transaction replay	`Seconds_Behind_Source` jumps sharply and then plateaus or creeps; one large event blocks the SQL thread	`SHOW REPLICA STATUS\G` for `Relay_Log_Space` and recent position changes
Replica resource saturation	Replica CPU or disk I/O maxed; apply bottlenecked on hardware	`SHOW GLOBAL STATUS LIKE 'Threads_running';` and OS I/O metrics
Long-running queries blocking apply	`replica_preserve_commit_order=ON` and a long SELECT or backup holds locks that the applier needs	`SHOW PROCESSLIST` and `performance_schema.data_lock_waits`
Missing primary keys with row-based replication	Replica CPU high during apply; apply thread performs full table scans to locate rows	`SHOW GLOBAL STATUS LIKE 'Handler_read_rnd_next';` on replica

Quick checks

SHOW REPLICA STATUS\G

Look for Seconds_Behind_Source, Replica_IO_Running, Replica_SQL_Running, and Relay_Log_Space.

SHOW GLOBAL VARIABLES LIKE 'replica_parallel_workers';

Values 0 or 1 mean the replica is applying events serially.

SHOW GLOBAL STATUS LIKE 'Threads_running';

Sustained values above 4 times the CPU core count indicate saturation.

SHOW PROCESSLIST;

Identify long-running SELECTs or backups that may block the applier.

SELECT * FROM performance_schema.replication_applier_status_by_worker LIMIT 5;

A failed worker stops all apply threads on multi-threaded replicas.

SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_current_waits';

Nonzero sustained values mean transactions are waiting on locks.

If you stream replica metrics through Netdata, correlate the mysql.replication_lag chart with disk.await on the relay log partition and per-core CPU utilization to distinguish apply bottlenecks from I/O saturation.

How to diagnose it

Confirm the spiral. Sample Seconds_Behind_Source and Relay_Log_Space every 60 seconds for 5 minutes. If both increase monotonically and both replication threads report Yes, you have a spiral, not a transient spike.
Determine apply parallelism. Run SHOW GLOBAL VARIABLES LIKE 'replica_parallel_workers';. If the value is 0 or 1 on a multi-core host, single-threaded apply is the likely bottleneck.
Identify large transactions. If lag jumped suddenly and holds steady, the SQL thread is likely replaying one large transaction. Check the source for recent bulk INSERT, UPDATE, DELETE, or DDL.
Check replica resource saturation. Compare replica CPU utilization and disk I/O latency to the source. If the replica is smaller or contended, it cannot match source write throughput.
Detect blocking queries. On the replica, run SHOW PROCESSLIST. Look for long-running queries that hold locks needed by the applier, especially with replica_preserve_commit_order=ON.
Measure time to failure. Compare the lag growth rate to the source’s binlog expiry window (binlog_expire_logs_seconds or expire_logs_days). Project when unapplied events will fall outside the purge window. Check relay log partition free space.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`Seconds_Behind_Source`	Timestamp-based lag estimate	Increasing monotonically over 5+ minutes while threads are `Yes`
`Relay_Log_Space`	Total bytes of relay logs waiting to be applied	Growing continuously
`Replica_IO_Running` / `Replica_SQL_Running`	Basic replication health	Any value other than `Yes`
`replica_parallel_workers`	Apply parallelism ceiling	Set to 0 or 1 on multi-core replicas
`Threads_running` on replica	Active concurrent work and contention	Sustained above 4 times CPU cores
GTID set difference	Exact transaction-level lag	`GTID_SUBTRACT(source_set, replica_set)` is non-empty and growing
Checkpoint age vs redo log capacity	Write path saturation on replica	Greater than 75 percent of capacity during peak apply

Fixes

Single-threaded apply

Warning: STOP REPLICA interrupts replication and makes the replica stale for reads. Run only during a maintenance window or when stale reads are acceptable.

Stop the replica threads, increase replica_parallel_workers, and start them again. On MySQL 8.0.27+ the default is 4; on older versions set it explicitly to match core count. Tradeoff: with replica_preserve_commit_order=ON (default), commit order is enforced and some workloads still serialize. Turning it off allows higher concurrency but can produce temporary consistency gaps.

Relay

The Netdata solution

MySQL monitoring with Netdata

Netdata monitors MySQL and MariaDB with per-second metrics and ML anomaly detection. Track connection usage, query throughput, slow queries, redo-log pressure, and replication lag alongside the host and storage signals that explain them.

See MySQL monitoring → Start monitoring free