$ guides / mysql / mysql-relay-log-disk-full ▌

Operations Guides

MySQL relay log filling the replica disk: Relay_Log_Space and recovery

When SHOW REPLICA STATUS shows Seconds_Behind_Source climbing and Relay_Log_Space growing, the SQL thread is not applying events as fast as the IO thread fetches them. Normally MySQL purges each relay log after the SQL thread finishes it, so total disk usage stays bounded. When apply cannot keep up, files accumulate. If the partition fills, the IO thread stops. No new events arrive, lag becomes unbounded, and if the source purges binary logs before recovery, the replica requires a full resync.

What this means

Relay logs buffer the source’s binary log on the replica. The IO thread writes fetched events to relay log files; the SQL thread reads and replays them. Relay_Log_Space is the total size of all relay log files. It grows when the IO thread writes faster than the SQL thread consumes. Large transactions accelerate this because the IO thread must receive the entire transaction before apply can start. When the disk partition holding relay logs fills, the IO thread halts.

flowchart TD
    A[Source writes events] --> B[IO thread fetches]
    B --> C[Relay_Log_Space grows]
    C --> D[SQL thread applies]
    D --> E[Auto-purge applied logs]
    C --> F{Disk full?}
    F -->|No| C
    F -->|Yes| G[IO thread stops]
    G --> H[Lag compounds]
    H --> I[Source binlog expiry]
    I --> J[Replica requires rebuild]
    E --> K[Disk space recovered]

Common causes

Cause	What it looks like	First thing to check
Replication lag: apply slower than fetch	Steady growth in `Relay_Log_Space` and `Seconds_Behind_Source`	`replica_parallel_workers` and replica CPU/disk saturation
Large source transaction	Sudden jump in `Relay_Log_Space` and lag	Source processlist for long-running DDL or bulk INSERT
`relay_log_purge = OFF`	Logs accumulate even when `Seconds_Behind_Source` is low	`SHOW GLOBAL VARIABLES LIKE 'relay_log_purge'`
No `relay_log_space_limit`	Growth is unchecked until the filesystem is full	`SHOW GLOBAL VARIABLES LIKE 'relay_log_space_limit'`
Replica resource bottleneck	High `Threads_running` and disk I/O waits	OS disk metrics and `Threads_running` vs CPU cores

Quick checks

# Relay log size and replication thread states
mysql -e "SHOW REPLICA STATUS\G" | grep -E "Relay_Log_Space|Replica_(IO|SQL)_Running|Seconds_Behind_Source|Last_.*_Error"

# Relay log configuration
mysql -e "SHOW GLOBAL VARIABLES LIKE 'relay_log%';"

# Disk usage of the relay log filesystem (adjust path to your datadir)
df -h /var/lib/mysql/

# Largest relay log files (path depends on datadir and naming convention)
du -sh /var/lib/mysql/relay-log.* 2>/dev/null | sort -rh | head -5

SHOW GLOBAL VARIABLES LIKE 'relay_log_purge';
SHOW GLOBAL VARIABLES LIKE 'relay_log_space_limit';

-- Source write rate to gauge inbound pressure (run on the source)
SHOW GLOBAL STATUS LIKE 'Com_%';

On MySQL 5.7, use SHOW SLAVE STATUS instead of SHOW REPLICA STATUS.

How to diagnose it

Confirm continuous growth. Sample Relay_Log_Space twice over 30-60 seconds. If it increases while Seconds_Behind_Source also increases, the replica is falling behind and accumulating logs.
Check if the IO thread is still fetching. If Replica_IO_Running is No, the disk may already be full or there may be a network or authentication error. Read Last_IO_Error.
Verify auto-purge. If relay_log_purge is OFF, logs are retained after apply. This is sometimes intentional for debugging, but it causes unbounded growth.
Identify a large event. A step-change in Relay_Log_Space rather than steady growth suggests a large transaction. On the source, check for bulk operations or DDL that generate large binlog events.
Assess replica apply capacity. Check Threads_running on the replica. If it is consistently above the CPU core count, the replica is saturated. Check replica_parallel_workers; values of 0 or 1 on a busy source mean single-threaded apply is likely the bottleneck.
Calculate disk runway. Compare the growth rate of Relay_Log_Space to free space on the relay log partition. If relay_log_space_limit is set, compare current usage to the limit.
Check binlog expiry risk. On the source, compare the replica’s lag to binlog_expire_logs_seconds (or expire_logs_days on 5.7). If lag exceeds the expiry window, the source may purge events the replica still needs.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`Relay_Log_Space`	Total bytes of all relay log files	Growing continuously while lag grows
`Seconds_Behind_Source`	Replica freshness	Increasing monotonically while the IO thread is active
`Replica_IO_Running` / `Replica_SQL_Running`	Whether replication is active	Not `Yes`
`relay_log_space_limit`	Configured ceiling for relay logs	Not set, or current usage approaching the limit
`relay_log_purge`	Automatic cleanup after apply	`OFF`
Relay log disk free space	Prevents IO thread stall	< 20% of partition capacity
Source write rate (`Com_insert`, `Com_update`, `Com_delete`)	Inbound event pressure	Sustained rate the replica cannot match

Fixes

Immediate containment: stop the IO thread

If the disk is nearly full, stop fetching events to buy time before the IO thread halts with an out-of-space error:

STOP REPLICA IO_THREAD;

The SQL thread continues applying existing relay logs. Lag will increase and source binlogs may expire, so treat this as a temporary measure. Restart with START REPLICA IO_THREAD only after you have added capacity or reduced the apply bottleneck.

Root cause: enable parallel apply

If replica_parallel_workers is 0 or 1 and the source write rate is high, the SQL thread is the bottleneck. Increase parallel workers if your workload and MySQL version support it. Multi-threaded apply increases CPU and memory usage on the replica, but it is the standard fix for apply-bound replication.

Root cause: re-enable relay log purge

If relay_log_purge was disabled, re-enable it so applied files are removed automatically:

SET GLOBAL relay_log_purge = ON;

If you disabled it to retain logs for debugging, copy the files you need before re-enabling.

Root cause: large transaction

If a single large transaction caused the spike, wait for the SQL thread to finish applying it. Do not stop the SQL thread unless the disk is critical, because interrupting a partial transaction replay can leave tables in an inconsistent state. If you must stop, use STOP REPLICA SQL_THREAD, but be prepared to handle the transaction boundary on restart.

Disk full recovery

If the disk is already full and the IO thread has stopped, free space without touching relay logs. Remove old application or OS logs, temporary files, or MySQL error/slow query logs. If the replica has binary logging enabled and is not a source to other replicas, purge its oldest binary logs.

Warning: Do not delete relay logs manually. The SQL thread expects the relay log index to match the filesystem; removing files causes replication errors. If MySQL cannot start because the disk is completely full, move the oldest relay log files to another filesystem only as a last resort. If the SQL thread has not yet processed them, you must rebuild the replica.

After freeing space, start MySQL and check Replica_IO_Running. If it is not Yes, run START REPLICA IO_THREAD. Check Last_IO_Error; if the source has purged the binary logs corresponding to the replica’s Relay_Master_Log_File position, you need a full resync.

Prevention

Set relay_log_space_limit to a value lower than the partition size. When the limit is reached, the IO thread pauses instead of filling the filesystem.
Keep relay_log_purge enabled so applied logs are removed automatically.
Monitor Relay_Log_Space as a trend. Steady growth is a leading indicator of lag.
Size the relay log partition independently from the data directory with enough headroom for the maximum expected lag window.
Verify that replicas use multi-threaded apply if the source is write-heavy. Single-threaded replicas on busy sources are likely to accumulate relay logs during peak traffic.

How Netdata helps

Correlates Relay_Log_Space, Seconds_Behind_Source, and disk utilization on the replica to surface the compounding pattern in one view.
Alerts on replication thread state changes (for example, the IO thread stopping) without manual polling of SHOW REPLICA STATUS.
Per-second source write metrics (Com_insert, Com_update, Com_delete) can be compared against replica throughput to identify apply bottlenecks before disk fills.
Disk space alerts on the relay log partition provide early warning while there is still time to stop the IO thread gracefully.

How MySQL actually works in production: a mental model for operators
MySQL Aborted_connects and Aborted_clients climbing: diagnosis
[MySQL adaptive hash index latch contention: high

The Netdata solution

MySQL monitoring with Netdata

Netdata monitors MySQL and MariaDB with per-second metrics and ML anomaly detection. Track connection usage, query throughput, slow queries, redo-log pressure, and replication lag alongside the host and storage signals that explain them.

See MySQL monitoring → Start monitoring free

MySQL relay log filling the replica disk: Relay_Log_Space and recovery

MySQL relay log filling the replica disk: Relay_Log_Space and recovery

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Immediate containment: stop the IO thread

Root cause: enable parallel apply

Root cause: re-enable relay log purge

Root cause: large transaction

Disk full recovery

Prevention

How Netdata helps

Related guides

MySQL monitoring with Netdata