MySQL relay log filling the replica disk: Relay_Log_Space and recovery
When SHOW REPLICA STATUS shows Seconds_Behind_Source climbing and Relay_Log_Space growing, the SQL thread is not applying events as fast as the IO thread fetches them. Normally MySQL purges each relay log after the SQL thread finishes it, so total disk usage stays bounded. When apply cannot keep up, files accumulate. If the partition fills, the IO thread stops. No new events arrive, lag becomes unbounded, and if the source purges binary logs before recovery, the replica requires a full resync.
What this means
Relay logs buffer the source’s binary log on the replica. The IO thread writes fetched events to relay log files; the SQL thread reads and replays them. Relay_Log_Space is the total size of all relay log files. It grows when the IO thread writes faster than the SQL thread consumes. Large transactions accelerate this because the IO thread must receive the entire transaction before apply can start. When the disk partition holding relay logs fills, the IO thread halts.
flowchart TD
A[Source writes events] --> B[IO thread fetches]
B --> C[Relay_Log_Space grows]
C --> D[SQL thread applies]
D --> E[Auto-purge applied logs]
C --> F{Disk full?}
F -->|No| C
F -->|Yes| G[IO thread stops]
G --> H[Lag compounds]
H --> I[Source binlog expiry]
I --> J[Replica requires rebuild]
E --> K[Disk space recovered]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Replication lag: apply slower than fetch | Steady growth in Relay_Log_Space and Seconds_Behind_Source | replica_parallel_workers and replica CPU/disk saturation |
| Large source transaction | Sudden jump in Relay_Log_Space and lag | Source processlist for long-running DDL or bulk INSERT |
relay_log_purge = OFF | Logs accumulate even when Seconds_Behind_Source is low | SHOW GLOBAL VARIABLES LIKE 'relay_log_purge' |
No relay_log_space_limit | Growth is unchecked until the filesystem is full | SHOW GLOBAL VARIABLES LIKE 'relay_log_space_limit' |
| Replica resource bottleneck | High Threads_running and disk I/O waits | OS disk metrics and Threads_running vs CPU cores |
Quick checks
# Relay log size and replication thread states
mysql -e "SHOW REPLICA STATUS\G" | grep -E "Relay_Log_Space|Replica_(IO|SQL)_Running|Seconds_Behind_Source|Last_.*_Error"
# Relay log configuration
mysql -e "SHOW GLOBAL VARIABLES LIKE 'relay_log%';"
# Disk usage of the relay log filesystem (adjust path to your datadir)
df -h /var/lib/mysql/
# Largest relay log files (path depends on datadir and naming convention)
du -sh /var/lib/mysql/relay-log.* 2>/dev/null | sort -rh | head -5
SHOW GLOBAL VARIABLES LIKE 'relay_log_purge';
SHOW GLOBAL VARIABLES LIKE 'relay_log_space_limit';
-- Source write rate to gauge inbound pressure (run on the source)
SHOW GLOBAL STATUS LIKE 'Com_%';
On MySQL 5.7, use SHOW SLAVE STATUS instead of SHOW REPLICA STATUS.
How to diagnose it
- Confirm continuous growth. Sample
Relay_Log_Spacetwice over 30-60 seconds. If it increases whileSeconds_Behind_Sourcealso increases, the replica is falling behind and accumulating logs. - Check if the IO thread is still fetching. If
Replica_IO_RunningisNo, the disk may already be full or there may be a network or authentication error. ReadLast_IO_Error. - Verify auto-purge. If
relay_log_purgeisOFF, logs are retained after apply. This is sometimes intentional for debugging, but it causes unbounded growth. - Identify a large event. A step-change in
Relay_Log_Spacerather than steady growth suggests a large transaction. On the source, check for bulk operations or DDL that generate large binlog events. - Assess replica apply capacity. Check
Threads_runningon the replica. If it is consistently above the CPU core count, the replica is saturated. Checkreplica_parallel_workers; values of 0 or 1 on a busy source mean single-threaded apply is likely the bottleneck. - Calculate disk runway. Compare the growth rate of
Relay_Log_Spaceto free space on the relay log partition. Ifrelay_log_space_limitis set, compare current usage to the limit. - Check binlog expiry risk. On the source, compare the replica’s lag to
binlog_expire_logs_seconds(orexpire_logs_dayson 5.7). If lag exceeds the expiry window, the source may purge events the replica still needs.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Relay_Log_Space | Total bytes of all relay log files | Growing continuously while lag grows |
Seconds_Behind_Source | Replica freshness | Increasing monotonically while the IO thread is active |
Replica_IO_Running / Replica_SQL_Running | Whether replication is active | Not Yes |
relay_log_space_limit | Configured ceiling for relay logs | Not set, or current usage approaching the limit |
relay_log_purge | Automatic cleanup after apply | OFF |
| Relay log disk free space | Prevents IO thread stall | < 20% of partition capacity |
Source write rate (Com_insert, Com_update, Com_delete) | Inbound event pressure | Sustained rate the replica cannot match |
Fixes
Immediate containment: stop the IO thread
If the disk is nearly full, stop fetching events to buy time before the IO thread halts with an out-of-space error:
STOP REPLICA IO_THREAD;
The SQL thread continues applying existing relay logs. Lag will increase and source binlogs may expire, so treat this as a temporary measure. Restart with START REPLICA IO_THREAD only after you have added capacity or reduced the apply bottleneck.
Root cause: enable parallel apply
If replica_parallel_workers is 0 or 1 and the source write rate is high, the SQL thread is the bottleneck. Increase parallel workers if your workload and MySQL version support it. Multi-threaded apply increases CPU and memory usage on the replica, but it is the standard fix for apply-bound replication.
Root cause: re-enable relay log purge
If relay_log_purge was disabled, re-enable it so applied files are removed automatically:
SET GLOBAL relay_log_purge = ON;
If you disabled it to retain logs for debugging, copy the files you need before re-enabling.
Root cause: large transaction
If a single large transaction caused the spike, wait for the SQL thread to finish applying it. Do not stop the SQL thread unless the disk is critical, because interrupting a partial transaction replay can leave tables in an inconsistent state. If you must stop, use STOP REPLICA SQL_THREAD, but be prepared to handle the transaction boundary on restart.
Disk full recovery
If the disk is already full and the IO thread has stopped, free space without touching relay logs. Remove old application or OS logs, temporary files, or MySQL error/slow query logs. If the replica has binary logging enabled and is not a source to other replicas, purge its oldest binary logs.
Warning: Do not delete relay logs manually. The SQL thread expects the relay log index to match the filesystem; removing files causes replication errors. If MySQL cannot start because the disk is completely full, move the oldest relay log files to another filesystem only as a last resort. If the SQL thread has not yet processed them, you must rebuild the replica.
After freeing space, start MySQL and check Replica_IO_Running. If it is not Yes, run START REPLICA IO_THREAD. Check Last_IO_Error; if the source has purged the binary logs corresponding to the replica’s Relay_Master_Log_File position, you need a full resync.
Prevention
- Set
relay_log_space_limitto a value lower than the partition size. When the limit is reached, the IO thread pauses instead of filling the filesystem. - Keep
relay_log_purgeenabled so applied logs are removed automatically. - Monitor
Relay_Log_Spaceas a trend. Steady growth is a leading indicator of lag. - Size the relay log partition independently from the data directory with enough headroom for the maximum expected lag window.
- Verify that replicas use multi-threaded apply if the source is write-heavy. Single-threaded replicas on busy sources are likely to accumulate relay logs during peak traffic.
How Netdata helps
- Correlates
Relay_Log_Space,Seconds_Behind_Source, and disk utilization on the replica to surface the compounding pattern in one view. - Alerts on replication thread state changes (for example, the IO thread stopping) without manual polling of
SHOW REPLICA STATUS. - Per-second source write metrics (
Com_insert,Com_update,Com_delete) can be compared against replica throughput to identify apply bottlenecks before disk fills. - Disk space alerts on the relay log partition provide early warning while there is still time to stop the IO thread gracefully.
Related guides
- How MySQL actually works in production: a mental model for operators
- MySQL Aborted_connects and Aborted_clients climbing: diagnosis
- [MySQL adaptive hash index latch contention: high







