MySQL Replica_IO_Running / Replica_SQL_Running not Yes: replication stopped

SHOW REPLICA STATUS\G (MySQL 8.0+) or SHOW SLAVE STATUS\G (MySQL 5.7) shows Replica_IO_Running or Replica_SQL_Running is not Yes. The IO thread may hang in Connecting, or the SQL thread may stop with an error. Replication no longer makes progress. Lag grows without bound, and if the source purges binary logs before recovery, you face a full rebuild.

Unlike many MySQL problems, a stopped replication thread does not self-heal. The SQL thread halts on the first apply error and waits for operator intervention. The IO thread retries automatically, but only up to the configured retry limit. Seconds_Behind_Source can show 0 even when the IO thread is disconnected and the SQL thread has consumed the relay log, so lag alone is not a reliable health signal.

This guide covers diagnosing which thread failed, reading the error fields, and choosing between in-place recovery and rebuilding.

What this means

MySQL replication uses two threads. The IO thread fetches binary log events from the source and writes them to the local relay log. The SQL thread reads the relay log and applies events to the replica’s data. Healthy replication requires both threads in the Yes state simultaneously.

If Replica_IO_Running is No, the replica cannot fetch new events. Common causes include network partitions, authentication failures, or the source purging a binary log file the replica still needs. The IO thread auto-retries every Connect_Retry seconds (default 60) until it succeeds or exhausts SOURCE_RETRY_COUNT.

If Replica_SQL_Running is No, events pile up in the relay log but are not applied. This is almost always an apply error: duplicate key, schema mismatch, or a transaction the replica cannot execute. The SQL thread does not auto-recover after an error. It stays stopped until you intervene.

With multi-threaded applier (replica_parallel_workers > 1), a single worker failing stops the coordinator and all workers. The coordinator drains its queue first, so there is a brief delay before everything halts. Check performance_schema.replication_applier_status_by_worker for per-worker detail.

flowchart TD
    A[SHOW REPLICA STATUS] --> B{IO thread Yes?}
    B -->|No| C[Check Last_IO_Error network auth binlog]
    B -->|Yes| D{SQL thread Yes?}
    D -->|No| E[Check Last_SQL_Error schema data drift]
    D -->|Yes| F[Check lag and relay log growth]
    C --> G[Fix source connectivity or rebuild]
    E --> H[Fix schema or resync table]

Common causes

CauseWhat it looks likeFirst thing to check
Network, auth, or source unreachableReplica_IO_Running: No or Connecting; Last_IO_Error shows connection failure or authentication failureSource port reachability, performance_schema.host_cache, credential consistency, and whether the source has skip_networking enabled
Source purged binlog before replica fetched itReplica_IO_Running: No; Last_IO_Error references fatal error 1236 or a missing log fileSHOW BINARY LOGS on the source; compare to the replica’s Relay_Source_Log_File and Read_Source_Log_Pos or GTID sets
Duplicate server_idIO thread stops immediately after starting with a fatal errorSELECT @@server_id on both source and replica
Schema drift or apply errorReplica_SQL_Running: No; Last_SQL_Error shows duplicate key, missing table, or column mismatchThe failing query in Last_SQL_Error; compare schema objects between source and replica
Relay log space limit exceededIO thread stalls; Relay_Log_Space near limitrelay_log_space_limit and available disk on the replica
Multi-threaded applier worker failureReplica_SQL_Running: No; coordinator stoppedperformance_schema.replication_applier_status_by_worker for the specific worker error

Quick checks

Run these safe, read-only checks before making changes.

-- Check thread states and last errors
SHOW REPLICA STATUS\G

(On MySQL 5.7, use SHOW SLAVE STATUS\G.)

-- Inspect per-worker errors with multi-threaded applier
SELECT * FROM performance_schema.replication_applier_status_by_worker\G
-- Verify the source still holds the binary log the replica needs
SHOW BINARY LOGS;
-- Check for duplicate server_id
SELECT @@server_id;
-- Compare GTID sets if GTID is enabled
-- On source:
SELECT @@global.gtid_executed;
-- On replica compare Retrieved_Gtid_Set and Executed_Gtid_Set
SHOW REPLICA STATUS\G
-- Check relay log disk consumption
SHOW REPLICA STATUS\G
-- Field: Relay_Log_Space

How to diagnose it

  1. Determine which thread stopped. Look at Replica_IO_Running and Replica_SQL_Running. If both are Yes but lag is growing, see the MySQL slow after restart: buffer pool warm-up and the cold cache guide. This article focuses on threads that are not Yes.

  2. If the IO thread is No or Connecting for more than five minutes, read Last_IO_Error. This field contains the exact reason the IO thread stopped. Common messages include connection refused, access denied, or unknown binary log file.

  3. Verify source accessibility. From the replica host, confirm TCP connectivity to the source’s MySQL port. Check whether the source has skip_networking enabled, whether the replication user still exists, and whether a firewall or security group has changed. Also verify that the source and replica do not share the same server_id.

  4. Verify the source still has the required binary logs. Run SHOW BINARY LOGS on the source and compare the oldest file to the replica’s Relay_Source_Log_File and Read_Source_Log_Pos. If the source has purged past the replica’s position, the replica cannot reconnect without a rebuild.

  5. If the SQL thread is No, read Last_SQL_Error carefully. The error message includes the exact query or event that failed. Note the error code and the database. Duplicate key errors usually mean split-brain writes or a previously skipped transaction. Missing table or column errors mean schema drift.

  6. If using multi-threaded applier, check worker errors. SHOW REPLICA STATUS shows the coordinator error, but performance_schema.replication_applier_status_by_worker reveals which specific worker failed and on what transaction.

  7. Check Relay_Log_Space on the replica. If the SQL thread is stopped but the IO thread is running, relay logs will accumulate. If the disk fills, the IO thread will also stop. If Relay_Log_Space is growing toward the partition limit, you can stop the IO thread to prevent disk exhaustion while you fix the SQL thread. Warning: STOP REPLICA IO_THREAD halts all event fetching. Use only if disk exhaustion is imminent.

  8. If GTID is enabled, compare GTID sets. On the source, SELECT @@global.gtid_executed. On the replica, compare Retrieved_Gtid_Set and Executed_Gtid_Set. If the source has transactions the replica does not, that is normal lag. If the replica has transactions the source does not (GTID_SUBTRACT(replica_executed_gtid_set, source_gtid_executed) is non-empty), you have errant transactions that must be resolved before failover.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Replica_IO_Running / Replica_SQL_RunningBinary replication healthAny value other than Yes sustained longer than 60 seconds
Seconds_Behind_SourceData freshness and whether lag is growingNULL (replication broken) or monotonically increasing
Relay_Log_SpaceDisk consumption on the replicaGrowing continuously while the SQL thread is stopped
Last_IO_Error / Last_SQL_ErrorSpecific failure reasonAny non-empty error string
GTID set differenceExact transaction-level divergenceGTID_SUBTRACT(source_gtid_set, replica_gtid_set) returns missing GTIDs while threads are stopped

Fixes

IO thread stopped

Network or authentication failure. Fix the network path, credentials, or source configuration. Verify the source is not blocking connections via bind_address or firewall rules. Once resolved, run START REPLICA IO_THREAD. Before starting, verify the source still holds the binary logs the replica needs if the IO thread was down for an extended time.

Source binlog purged. If the source has already removed the binary log file the replica needs, the replica cannot catch up. Your options are to rebuild the replica from a fresh backup or, if GTID is enabled and another replica holds the missing transactions, to provision from that replica. Do not attempt to skip binary log events manually unless you fully understand the data impact.

Duplicate server_id. Change the replica’s server_id to a unique value, then restart the replica threads.

SQL thread stopped

Schema drift. If the replica is missing a column, table, or index that exists on the source, apply the missing DDL to the replica, then run START REPLICA SQL_THREAD. If the drift is widespread, rebuild instead of fixing piecemeal.

Duplicate key or data divergence. This usually means a write reached the replica directly (split-brain) or a skipped transaction created inconsistency. Query the failing row on the source using the key from the error message, reconcile the data, and restart the SQL thread. If divergence is extensive, rebuild the replica. Avoid using sql_slave_skip_counter or manual GTID skips as a routine fix; each skip creates an errant transaction that complicates failover.

Multi-threaded applier worker failure. Query performance_schema.replication_applier_status_by_worker to find the worker that errored. Fix the underlying schema or data issue for that worker’s assigned database or transaction, then restart the SQL thread.

Tradeoffs. In-place fixes are faster but risk compounding divergence. Rebuilding from backup is slower and adds load to the source or backup infrastructure, but guarantees a consistent replica.

Prevention

  • Page if Replica_SQL_Running is not Yes for 60 seconds, if Replica_IO_Running is No for 60 seconds, or if Replica_IO_Running hangs in Connecting for 300 seconds.
  • Set source binary log expiry longer than maximum expected lag plus a safety margin.
  • Monitor Relay_Log_Space growth and disk free space on the replica.
  • Use GTID to simplify position management and enable precise consistency checks.
  • Set read_only or super_read_only on replicas to prevent split-brain writes.
  • Verify unique server_id across every instance in the topology during provisioning.

How Netdata helps

  • Collects Replica_IO_Running, Replica_SQL_Running, Seconds_Behind_Source, Relay_Log_Space, and error fields from SHOW REPLICA STATUS.
  • Correlates replication thread state with source write rate (Com_insert, Com_update, Com_delete) to distinguish network outages from apply bottlenecks.
  • Alerts on thread state transitions with sustained durations: SQL thread stopped for 60 seconds, IO thread stopped for 60 seconds, or IO thread stuck in Connecting for 300 seconds.
  • Tracks relay log growth rate alongside disk utilization to warn before the replica runs out of space.
  • Surfaces multi-threaded applier worker states and errors where instrumentation is available.