MySQL Replica_IO_Running / Replica_SQL_Running not Yes: replication stopped
SHOW REPLICA STATUS\G (MySQL 8.0+) or SHOW SLAVE STATUS\G (MySQL 5.7) shows Replica_IO_Running or Replica_SQL_Running is not Yes. The IO thread may hang in Connecting, or the SQL thread may stop with an error. Replication no longer makes progress. Lag grows without bound, and if the source purges binary logs before recovery, you face a full rebuild.
Unlike many MySQL problems, a stopped replication thread does not self-heal. The SQL thread halts on the first apply error and waits for operator intervention. The IO thread retries automatically, but only up to the configured retry limit. Seconds_Behind_Source can show 0 even when the IO thread is disconnected and the SQL thread has consumed the relay log, so lag alone is not a reliable health signal.
This guide covers diagnosing which thread failed, reading the error fields, and choosing between in-place recovery and rebuilding.
What this means
MySQL replication uses two threads. The IO thread fetches binary log events from the source and writes them to the local relay log. The SQL thread reads the relay log and applies events to the replica’s data. Healthy replication requires both threads in the Yes state simultaneously.
If Replica_IO_Running is No, the replica cannot fetch new events. Common causes include network partitions, authentication failures, or the source purging a binary log file the replica still needs. The IO thread auto-retries every Connect_Retry seconds (default 60) until it succeeds or exhausts SOURCE_RETRY_COUNT.
If Replica_SQL_Running is No, events pile up in the relay log but are not applied. This is almost always an apply error: duplicate key, schema mismatch, or a transaction the replica cannot execute. The SQL thread does not auto-recover after an error. It stays stopped until you intervene.
With multi-threaded applier (replica_parallel_workers > 1), a single worker failing stops the coordinator and all workers. The coordinator drains its queue first, so there is a brief delay before everything halts. Check performance_schema.replication_applier_status_by_worker for per-worker detail.
flowchart TD
A[SHOW REPLICA STATUS] --> B{IO thread Yes?}
B -->|No| C[Check Last_IO_Error network auth binlog]
B -->|Yes| D{SQL thread Yes?}
D -->|No| E[Check Last_SQL_Error schema data drift]
D -->|Yes| F[Check lag and relay log growth]
C --> G[Fix source connectivity or rebuild]
E --> H[Fix schema or resync table]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Network, auth, or source unreachable | Replica_IO_Running: No or Connecting; Last_IO_Error shows connection failure or authentication failure | Source port reachability, performance_schema.host_cache, credential consistency, and whether the source has skip_networking enabled |
| Source purged binlog before replica fetched it | Replica_IO_Running: No; Last_IO_Error references fatal error 1236 or a missing log file | SHOW BINARY LOGS on the source; compare to the replica’s Relay_Source_Log_File and Read_Source_Log_Pos or GTID sets |
Duplicate server_id | IO thread stops immediately after starting with a fatal error | SELECT @@server_id on both source and replica |
| Schema drift or apply error | Replica_SQL_Running: No; Last_SQL_Error shows duplicate key, missing table, or column mismatch | The failing query in Last_SQL_Error; compare schema objects between source and replica |
| Relay log space limit exceeded | IO thread stalls; Relay_Log_Space near limit | relay_log_space_limit and available disk on the replica |
| Multi-threaded applier worker failure | Replica_SQL_Running: No; coordinator stopped | performance_schema.replication_applier_status_by_worker for the specific worker error |
Quick checks
Run these safe, read-only checks before making changes.
-- Check thread states and last errors
SHOW REPLICA STATUS\G
(On MySQL 5.7, use SHOW SLAVE STATUS\G.)
-- Inspect per-worker errors with multi-threaded applier
SELECT * FROM performance_schema.replication_applier_status_by_worker\G
-- Verify the source still holds the binary log the replica needs
SHOW BINARY LOGS;
-- Check for duplicate server_id
SELECT @@server_id;
-- Compare GTID sets if GTID is enabled
-- On source:
SELECT @@global.gtid_executed;
-- On replica compare Retrieved_Gtid_Set and Executed_Gtid_Set
SHOW REPLICA STATUS\G
-- Check relay log disk consumption
SHOW REPLICA STATUS\G
-- Field: Relay_Log_Space
How to diagnose it
Determine which thread stopped. Look at
Replica_IO_RunningandReplica_SQL_Running. If both areYesbut lag is growing, see the MySQL slow after restart: buffer pool warm-up and the cold cache guide. This article focuses on threads that are notYes.If the IO thread is
NoorConnectingfor more than five minutes, readLast_IO_Error. This field contains the exact reason the IO thread stopped. Common messages include connection refused, access denied, or unknown binary log file.Verify source accessibility. From the replica host, confirm TCP connectivity to the source’s MySQL port. Check whether the source has
skip_networkingenabled, whether the replication user still exists, and whether a firewall or security group has changed. Also verify that the source and replica do not share the sameserver_id.Verify the source still has the required binary logs. Run
SHOW BINARY LOGSon the source and compare the oldest file to the replica’sRelay_Source_Log_FileandRead_Source_Log_Pos. If the source has purged past the replica’s position, the replica cannot reconnect without a rebuild.If the SQL thread is
No, readLast_SQL_Errorcarefully. The error message includes the exact query or event that failed. Note the error code and the database. Duplicate key errors usually mean split-brain writes or a previously skipped transaction. Missing table or column errors mean schema drift.If using multi-threaded applier, check worker errors.
SHOW REPLICA STATUSshows the coordinator error, butperformance_schema.replication_applier_status_by_workerreveals which specific worker failed and on what transaction.Check
Relay_Log_Spaceon the replica. If the SQL thread is stopped but the IO thread is running, relay logs will accumulate. If the disk fills, the IO thread will also stop. IfRelay_Log_Spaceis growing toward the partition limit, you can stop the IO thread to prevent disk exhaustion while you fix the SQL thread. Warning:STOP REPLICA IO_THREADhalts all event fetching. Use only if disk exhaustion is imminent.If GTID is enabled, compare GTID sets. On the source,
SELECT @@global.gtid_executed. On the replica, compareRetrieved_Gtid_SetandExecuted_Gtid_Set. If the source has transactions the replica does not, that is normal lag. If the replica has transactions the source does not (GTID_SUBTRACT(replica_executed_gtid_set, source_gtid_executed)is non-empty), you have errant transactions that must be resolved before failover.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Replica_IO_Running / Replica_SQL_Running | Binary replication health | Any value other than Yes sustained longer than 60 seconds |
Seconds_Behind_Source | Data freshness and whether lag is growing | NULL (replication broken) or monotonically increasing |
Relay_Log_Space | Disk consumption on the replica | Growing continuously while the SQL thread is stopped |
Last_IO_Error / Last_SQL_Error | Specific failure reason | Any non-empty error string |
| GTID set difference | Exact transaction-level divergence | GTID_SUBTRACT(source_gtid_set, replica_gtid_set) returns missing GTIDs while threads are stopped |
Fixes
IO thread stopped
Network or authentication failure. Fix the network path, credentials, or source configuration. Verify the source is not blocking connections via bind_address or firewall rules. Once resolved, run START REPLICA IO_THREAD. Before starting, verify the source still holds the binary logs the replica needs if the IO thread was down for an extended time.
Source binlog purged. If the source has already removed the binary log file the replica needs, the replica cannot catch up. Your options are to rebuild the replica from a fresh backup or, if GTID is enabled and another replica holds the missing transactions, to provision from that replica. Do not attempt to skip binary log events manually unless you fully understand the data impact.
Duplicate server_id. Change the replica’s server_id to a unique value, then restart the replica threads.
SQL thread stopped
Schema drift. If the replica is missing a column, table, or index that exists on the source, apply the missing DDL to the replica, then run START REPLICA SQL_THREAD. If the drift is widespread, rebuild instead of fixing piecemeal.
Duplicate key or data divergence. This usually means a write reached the replica directly (split-brain) or a skipped transaction created inconsistency. Query the failing row on the source using the key from the error message, reconcile the data, and restart the SQL thread. If divergence is extensive, rebuild the replica. Avoid using sql_slave_skip_counter or manual GTID skips as a routine fix; each skip creates an errant transaction that complicates failover.
Multi-threaded applier worker failure. Query performance_schema.replication_applier_status_by_worker to find the worker that errored. Fix the underlying schema or data issue for that worker’s assigned database or transaction, then restart the SQL thread.
Tradeoffs. In-place fixes are faster but risk compounding divergence. Rebuilding from backup is slower and adds load to the source or backup infrastructure, but guarantees a consistent replica.
Prevention
- Page if
Replica_SQL_Runningis notYesfor 60 seconds, ifReplica_IO_RunningisNofor 60 seconds, or ifReplica_IO_Runninghangs inConnectingfor 300 seconds. - Set source binary log expiry longer than maximum expected lag plus a safety margin.
- Monitor
Relay_Log_Spacegrowth and disk free space on the replica. - Use GTID to simplify position management and enable precise consistency checks.
- Set
read_onlyorsuper_read_onlyon replicas to prevent split-brain writes. - Verify unique
server_idacross every instance in the topology during provisioning.
How Netdata helps
- Collects
Replica_IO_Running,Replica_SQL_Running,Seconds_Behind_Source,Relay_Log_Space, and error fields fromSHOW REPLICA STATUS. - Correlates replication thread state with source write rate (
Com_insert,Com_update,Com_delete) to distinguish network outages from apply bottlenecks. - Alerts on thread state transitions with sustained durations: SQL thread stopped for 60 seconds, IO thread stopped for 60 seconds, or IO thread stuck in
Connectingfor 300 seconds. - Tracks relay log growth rate alongside disk utilization to warn before the replica runs out of space.
- Surfaces multi-threaded applier worker states and errors where instrumentation is available.
Related guides
- How MySQL actually works in production: a mental model for operators
- MySQL Aborted_connects and Aborted_clients climbing: diagnosis
- MySQL adaptive hash index latch contention: high CPU, low throughput
- MySQL InnoDB buffer pool hit ratio collapse: the cliff edge
- MySQL slow after restart: buffer pool warm-up and the cold cache
- MySQL innodb_buffer_pool_size tuning: 60-80% of RAM and when that breaks
- MySQL Innodb_buffer_pool_wait_free > 0: buffer pool memory pressure
- MySQL InnoDB checkpoint age: the redo log capacity signal nobody watches
- MySQL connection exhaustion: detection, diagnosis, and prevention
- MySQL innodb_deadlock_detect=OFF: when deadlock detection becomes the bottleneck
- MySQL ERROR 1213: Deadlock found when trying to get lock; try restarting transaction
- MySQL FLUSH TABLES WITH READ LOCK stall: backups that freeze the server







