MySQL relay log filling the replica disk: Relay_Log_Space and recovery

When SHOW REPLICA STATUS shows Seconds_Behind_Source climbing and Relay_Log_Space growing, the SQL thread is not applying events as fast as the IO thread fetches them. Normally MySQL purges each relay log after the SQL thread finishes it, so total disk usage stays bounded. When apply cannot keep up, files accumulate. If the partition fills, the IO thread stops. No new events arrive, lag becomes unbounded, and if the source purges binary logs before recovery, the replica requires a full resync.

What this means

Relay logs buffer the source’s binary log on the replica. The IO thread writes fetched events to relay log files; the SQL thread reads and replays them. Relay_Log_Space is the total size of all relay log files. It grows when the IO thread writes faster than the SQL thread consumes. Large transactions accelerate this because the IO thread must receive the entire transaction before apply can start. When the disk partition holding relay logs fills, the IO thread halts.

flowchart TD
    A[Source writes events] --> B[IO thread fetches]
    B --> C[Relay_Log_Space grows]
    C --> D[SQL thread applies]
    D --> E[Auto-purge applied logs]
    C --> F{Disk full?}
    F -->|No| C
    F -->|Yes| G[IO thread stops]
    G --> H[Lag compounds]
    H --> I[Source binlog expiry]
    I --> J[Replica requires rebuild]
    E --> K[Disk space recovered]

Common causes

CauseWhat it looks likeFirst thing to check
Replication lag: apply slower than fetchSteady growth in Relay_Log_Space and Seconds_Behind_Sourcereplica_parallel_workers and replica CPU/disk saturation
Large source transactionSudden jump in Relay_Log_Space and lagSource processlist for long-running DDL or bulk INSERT
relay_log_purge = OFFLogs accumulate even when Seconds_Behind_Source is lowSHOW GLOBAL VARIABLES LIKE 'relay_log_purge'
No relay_log_space_limitGrowth is unchecked until the filesystem is fullSHOW GLOBAL VARIABLES LIKE 'relay_log_space_limit'
Replica resource bottleneckHigh Threads_running and disk I/O waitsOS disk metrics and Threads_running vs CPU cores

Quick checks

# Relay log size and replication thread states
mysql -e "SHOW REPLICA STATUS\G" | grep -E "Relay_Log_Space|Replica_(IO|SQL)_Running|Seconds_Behind_Source|Last_.*_Error"
# Relay log configuration
mysql -e "SHOW GLOBAL VARIABLES LIKE 'relay_log%';"
# Disk usage of the relay log filesystem (adjust path to your datadir)
df -h /var/lib/mysql/
# Largest relay log files (path depends on datadir and naming convention)
du -sh /var/lib/mysql/relay-log.* 2>/dev/null | sort -rh | head -5
SHOW GLOBAL VARIABLES LIKE 'relay_log_purge';
SHOW GLOBAL VARIABLES LIKE 'relay_log_space_limit';
-- Source write rate to gauge inbound pressure (run on the source)
SHOW GLOBAL STATUS LIKE 'Com_%';

On MySQL 5.7, use SHOW SLAVE STATUS instead of SHOW REPLICA STATUS.

How to diagnose it

  1. Confirm continuous growth. Sample Relay_Log_Space twice over 30-60 seconds. If it increases while Seconds_Behind_Source also increases, the replica is falling behind and accumulating logs.
  2. Check if the IO thread is still fetching. If Replica_IO_Running is No, the disk may already be full or there may be a network or authentication error. Read Last_IO_Error.
  3. Verify auto-purge. If relay_log_purge is OFF, logs are retained after apply. This is sometimes intentional for debugging, but it causes unbounded growth.
  4. Identify a large event. A step-change in Relay_Log_Space rather than steady growth suggests a large transaction. On the source, check for bulk operations or DDL that generate large binlog events.
  5. Assess replica apply capacity. Check Threads_running on the replica. If it is consistently above the CPU core count, the replica is saturated. Check replica_parallel_workers; values of 0 or 1 on a busy source mean single-threaded apply is likely the bottleneck.
  6. Calculate disk runway. Compare the growth rate of Relay_Log_Space to free space on the relay log partition. If relay_log_space_limit is set, compare current usage to the limit.
  7. Check binlog expiry risk. On the source, compare the replica’s lag to binlog_expire_logs_seconds (or expire_logs_days on 5.7). If lag exceeds the expiry window, the source may purge events the replica still needs.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Relay_Log_SpaceTotal bytes of all relay log filesGrowing continuously while lag grows
Seconds_Behind_SourceReplica freshnessIncreasing monotonically while the IO thread is active
Replica_IO_Running / Replica_SQL_RunningWhether replication is activeNot Yes
relay_log_space_limitConfigured ceiling for relay logsNot set, or current usage approaching the limit
relay_log_purgeAutomatic cleanup after applyOFF
Relay log disk free spacePrevents IO thread stall< 20% of partition capacity
Source write rate (Com_insert, Com_update, Com_delete)Inbound event pressureSustained rate the replica cannot match

Fixes

Immediate containment: stop the IO thread

If the disk is nearly full, stop fetching events to buy time before the IO thread halts with an out-of-space error:

STOP REPLICA IO_THREAD;

The SQL thread continues applying existing relay logs. Lag will increase and source binlogs may expire, so treat this as a temporary measure. Restart with START REPLICA IO_THREAD only after you have added capacity or reduced the apply bottleneck.

Root cause: enable parallel apply

If replica_parallel_workers is 0 or 1 and the source write rate is high, the SQL thread is the bottleneck. Increase parallel workers if your workload and MySQL version support it. Multi-threaded apply increases CPU and memory usage on the replica, but it is the standard fix for apply-bound replication.

Root cause: re-enable relay log purge

If relay_log_purge was disabled, re-enable it so applied files are removed automatically:

SET GLOBAL relay_log_purge = ON;

If you disabled it to retain logs for debugging, copy the files you need before re-enabling.

Root cause: large transaction

If a single large transaction caused the spike, wait for the SQL thread to finish applying it. Do not stop the SQL thread unless the disk is critical, because interrupting a partial transaction replay can leave tables in an inconsistent state. If you must stop, use STOP REPLICA SQL_THREAD, but be prepared to handle the transaction boundary on restart.

Disk full recovery

If the disk is already full and the IO thread has stopped, free space without touching relay logs. Remove old application or OS logs, temporary files, or MySQL error/slow query logs. If the replica has binary logging enabled and is not a source to other replicas, purge its oldest binary logs.

Warning: Do not delete relay logs manually. The SQL thread expects the relay log index to match the filesystem; removing files causes replication errors. If MySQL cannot start because the disk is completely full, move the oldest relay log files to another filesystem only as a last resort. If the SQL thread has not yet processed them, you must rebuild the replica.

After freeing space, start MySQL and check Replica_IO_Running. If it is not Yes, run START REPLICA IO_THREAD. Check Last_IO_Error; if the source has purged the binary logs corresponding to the replica’s Relay_Master_Log_File position, you need a full resync.

Prevention

  • Set relay_log_space_limit to a value lower than the partition size. When the limit is reached, the IO thread pauses instead of filling the filesystem.
  • Keep relay_log_purge enabled so applied logs are removed automatically.
  • Monitor Relay_Log_Space as a trend. Steady growth is a leading indicator of lag.
  • Size the relay log partition independently from the data directory with enough headroom for the maximum expected lag window.
  • Verify that replicas use multi-threaded apply if the source is write-heavy. Single-threaded replicas on busy sources are likely to accumulate relay logs during peak traffic.

How Netdata helps

  • Correlates Relay_Log_Space, Seconds_Behind_Source, and disk utilization on the replica to surface the compounding pattern in one view.
  • Alerts on replication thread state changes (for example, the IO thread stopping) without manual polling of SHOW REPLICA STATUS.
  • Per-second source write metrics (Com_insert, Com_update, Com_delete) can be compared against replica throughput to identify apply bottlenecks before disk fills.
  • Disk space alerts on the relay log partition provide early warning while there is still time to stop the IO thread gracefully.