$ guides / mysql / mysql-replica-io-sql-thread-stopped ▌

Operations Guides

MySQL Replica_IO_Running / Replica_SQL_Running not Yes: replication stopped

SHOW REPLICA STATUS\G (MySQL 8.0+) or SHOW SLAVE STATUS\G (MySQL 5.7) shows Replica_IO_Running or Replica_SQL_Running is not Yes. The IO thread may hang in Connecting, or the SQL thread may stop with an error. Replication no longer makes progress. Lag grows without bound, and if the source purges binary logs before recovery, you face a full rebuild.

Unlike many MySQL problems, a stopped replication thread does not self-heal. The SQL thread halts on the first apply error and waits for operator intervention. The IO thread retries automatically, but only up to the configured retry limit. Seconds_Behind_Source can show 0 even when the IO thread is disconnected and the SQL thread has consumed the relay log, so lag alone is not a reliable health signal.

This guide covers diagnosing which thread failed, reading the error fields, and choosing between in-place recovery and rebuilding.

What this means

MySQL replication uses two threads. The IO thread fetches binary log events from the source and writes them to the local relay log. The SQL thread reads the relay log and applies events to the replica’s data. Healthy replication requires both threads in the Yes state simultaneously.

If Replica_IO_Running is No, the replica cannot fetch new events. Common causes include network partitions, authentication failures, or the source purging a binary log file the replica still needs. The IO thread auto-retries every Connect_Retry seconds (default 60) until it succeeds or exhausts SOURCE_RETRY_COUNT.

If Replica_SQL_Running is No, events pile up in the relay log but are not applied. This is almost always an apply error: duplicate key, schema mismatch, or a transaction the replica cannot execute. The SQL thread does not auto-recover after an error. It stays stopped until you intervene.

With multi-threaded applier (replica_parallel_workers > 1), a single worker failing stops the coordinator and all workers. The coordinator drains its queue first, so there is a brief delay before everything halts. Check performance_schema.replication_applier_status_by_worker for per-worker detail.

flowchart TD
    A[SHOW REPLICA STATUS] --> B{IO thread Yes?}
    B -->|No| C[Check Last_IO_Error network auth binlog]
    B -->|Yes| D{SQL thread Yes?}
    D -->|No| E[Check Last_SQL_Error schema data drift]
    D -->|Yes| F[Check lag and relay log growth]
    C --> G[Fix source connectivity or rebuild]
    E --> H[Fix schema or resync table]

Common causes

Cause	What it looks like	First thing to check
Network, auth, or source unreachable	`Replica_IO_Running`: `No` or `Connecting`; `Last_IO_Error` shows connection failure or authentication failure	Source port reachability, `performance_schema.host_cache`, credential consistency, and whether the source has `skip_networking` enabled
Source purged binlog before replica fetched it	`Replica_IO_Running`: `No`; `Last_IO_Error` references fatal error 1236 or a missing log file	`SHOW BINARY LOGS` on the source; compare to the replica’s `Relay_Source_Log_File` and `Read_Source_Log_Pos` or GTID sets
Duplicate `server_id`	IO thread stops immediately after starting with a fatal error	`SELECT @@server_id` on both source and replica
Schema drift or apply error	`Replica_SQL_Running`: `No`; `Last_SQL_Error` shows duplicate key, missing table, or column mismatch	The failing query in `Last_SQL_Error`; compare schema objects between source and replica
Relay log space limit exceeded	IO thread stalls; `Relay_Log_Space` near limit	`relay_log_space_limit` and available disk on the replica
Multi-threaded applier worker failure	`Replica_SQL_Running`: `No`; coordinator stopped	`performance_schema.replication_applier_status_by_worker` for the specific worker error

Quick checks

Run these safe, read-only checks before making changes.

-- Check thread states and last errors
SHOW REPLICA STATUS\G

(On MySQL 5.7, use SHOW SLAVE STATUS\G.)

-- Inspect per-worker errors with multi-threaded applier
SELECT * FROM performance_schema.replication_applier_status_by_worker\G

-- Verify the source still holds the binary log the replica needs
SHOW BINARY LOGS;

-- Check for duplicate server_id
SELECT @@server_id;

-- Compare GTID sets if GTID is enabled
-- On source:
SELECT @@global.gtid_executed;
-- On replica compare Retrieved_Gtid_Set and Executed_Gtid_Set
SHOW REPLICA STATUS\G

-- Check relay log disk consumption
SHOW REPLICA STATUS\G
-- Field: Relay_Log_Space

How to diagnose it

Determine which thread stopped. Look at Replica_IO_Running and Replica_SQL_Running. If both are Yes but lag is growing, see the MySQL slow after restart: buffer pool warm-up and the cold cache guide. This article focuses on threads that are not Yes.
If the IO thread is No or Connecting for more than five minutes, read Last_IO_Error. This field contains the exact reason the IO thread stopped. Common messages include connection refused, access denied, or unknown binary log file.
Verify source accessibility. From the replica host, confirm TCP connectivity to the source’s MySQL port. Check whether the source has skip_networking enabled, whether the replication user still exists, and whether a firewall or security group has changed. Also verify that the source and replica do not share the same server_id.
Verify the source still has the required binary logs. Run SHOW BINARY LOGS on the source and compare the oldest file to the replica’s Relay_Source_Log_File and Read_Source_Log_Pos. If the source has purged past the replica’s position, the replica cannot reconnect without a rebuild.
If the SQL thread is No, read Last_SQL_Error carefully. The error message includes the exact query or event that failed. Note the error code and the database. Duplicate key errors usually mean split-brain writes or a previously skipped transaction. Missing table or column errors mean schema drift.
If using multi-threaded applier, check worker errors. SHOW REPLICA STATUS shows the coordinator error, but performance_schema.replication_applier_status_by_worker reveals which specific worker failed and on what transaction.
Check Relay_Log_Space on the replica. If the SQL thread is stopped but the IO thread is running, relay logs will accumulate. If the disk fills, the IO thread will also stop. If Relay_Log_Space is growing toward the partition limit, you can stop the IO thread to prevent disk exhaustion while you fix the SQL thread. Warning: STOP REPLICA IO_THREAD halts all event fetching. Use only if disk exhaustion is imminent.
If GTID is enabled, compare GTID sets. On the source, SELECT @@global.gtid_executed. On the replica, compare Retrieved_Gtid_Set and Executed_Gtid_Set. If the source has transactions the replica does not, that is normal lag. If the replica has transactions the source does not (GTID_SUBTRACT(replica_executed_gtid_set, source_gtid_executed) is non-empty), you have errant transactions that must be resolved before failover.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`Replica_IO_Running` / `Replica_SQL_Running`	Binary replication health	Any value other than `Yes` sustained longer than 60 seconds
`Seconds_Behind_Source`	Data freshness and whether lag is growing	`NULL` (replication broken) or monotonically increasing
`Relay_Log_Space`	Disk consumption on the replica	Growing continuously while the SQL thread is stopped
`Last_IO_Error` / `Last_SQL_Error`	Specific failure reason	Any non-empty error string
GTID set difference	Exact transaction-level divergence	`GTID_SUBTRACT(source_gtid_set, replica_gtid_set)` returns missing GTIDs while threads are stopped

Fixes

IO thread stopped

Network or authentication failure. Fix the network path, credentials, or source configuration. Verify the source is not blocking connections via bind_address or firewall rules. Once resolved, run START REPLICA IO_THREAD. Before starting, verify the source still holds the binary logs the replica needs if the IO thread was down for an extended time.

Source binlog purged. If the source has already removed the binary log file the replica needs, the replica cannot catch up. Your options are to rebuild the replica from a fresh backup or, if GTID is enabled and another replica holds the missing transactions, to provision from that replica. Do not attempt to skip binary log events manually unless you fully understand the data impact.

Duplicate server_id. Change the replica’s server_id to a unique value, then restart the replica threads.

SQL thread stopped

Schema drift. If the replica is missing a column, table, or index that exists on the source, apply the missing DDL to the replica, then run START REPLICA SQL_THREAD. If the drift is widespread, rebuild instead of fixing piecemeal.

Duplicate key or data divergence. This usually means a write reached the replica directly (split-brain) or a skipped transaction created inconsistency. Query the failing row on the source using the key from the error message, reconcile the data, and restart the SQL thread. If divergence is extensive, rebuild the replica. Avoid using sql_slave_skip_counter or manual GTID skips as a routine fix; each skip creates an errant transaction that complicates failover.

Multi-threaded applier worker failure. Query performance_schema.replication_applier_status_by_worker to find the worker that errored. Fix the underlying schema or data issue for that worker’s assigned database or transaction, then restart the SQL thread.

Tradeoffs. In-place fixes are faster but risk compounding divergence. Rebuilding from backup is slower and adds load to the source or backup infrastructure, but guarantees a consistent replica.

Prevention

Page if Replica_SQL_Running is not Yes for 60 seconds, if Replica_IO_Running is No for 60 seconds, or if Replica_IO_Running hangs in Connecting for 300 seconds.
Set source binary log expiry longer than maximum expected lag plus a safety margin.
Monitor Relay_Log_Space growth and disk free space on the replica.
Use GTID to simplify position management and enable precise consistency checks.
Set read_only or super_read_only on replicas to prevent split-brain writes.
Verify unique server_id across every instance in the topology during provisioning.

How Netdata helps

Collects Replica_IO_Running, Replica_SQL_Running, Seconds_Behind_Source, Relay_Log_Space, and error fields from SHOW REPLICA STATUS.
Correlates replication thread state with source write rate (Com_insert, Com_update, Com_delete) to distinguish network outages from apply bottlenecks.
Alerts on thread state transitions with sustained durations: SQL thread stopped for 60 seconds, IO thread stopped for 60 seconds, or IO thread stuck in Connecting for 300 seconds.
Tracks relay log growth rate alongside disk utilization to warn before the replica runs out of space.
Surfaces multi-threaded applier worker states and errors where instrumentation is available.

The Netdata solution

MySQL monitoring with Netdata

Netdata monitors MySQL and MariaDB with per-second metrics and ML anomaly detection. Track connection usage, query throughput, slow queries, redo-log pressure, and replication lag alongside the host and storage signals that explain them.

See MySQL monitoring → Start monitoring free

MySQL Replica_IO_Running / Replica_SQL_Running not Yes: replication stopped

MySQL Replica_IO_Running / Replica_SQL_Running not Yes: replication stopped

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

IO thread stopped

SQL thread stopped

Prevention

How Netdata helps

Related guides

MySQL monitoring with Netdata