MySQL semi-synchronous replication stall: commits hanging on ACK

Commits suddenly take 10 seconds or more and then time out. On the MySQL primary, Threads_running climbs while the commit rate flatlines. A moment later, commits resume, but Rpl_semi_sync_source_no_tx is ticking upward. The primary is waiting for a semi-synchronous replica to acknowledge receipt of binlog events, and that ACK is not arriving fast enough or at all.

Confirm the stall, find the missing ACK, and decide whether to fix the replica path or temporarily fall back to asynchronous replication.

What this means

In semi-synchronous replication, the source blocks each commit until at least rpl_semi_sync_source_wait_for_replica_count replicas acknowledge receipt and flush of the event to their relay log. The replica does not need to execute the transaction; the I/O thread only needs to persist the event. If the ACK does not arrive within rpl_semi_sync_source_timeout milliseconds, the source falls back to asynchronous replication for that transaction.

When a replica is slow, disconnected, or unable to flush its relay log, the source waits. If rpl_semi_sync_source_wait_no_replica is ON (the default), the source checks the replica count only at the start of the wait. If the replica disconnects after the commit begins waiting, the source blocks for the full timeout rather than falling back immediately. The result is a burst of commit latency that cascades into connection pile-up and application timeouts.

flowchart TD
    A[Client commits on source] --> B{Semi-sync enabled?}
    B -->|Yes| C[Wait for replica ACK]
    C --> D{ACK received within timeout?}
    D -->|Yes| E[Commit returns]
    D -->|No| F[Silently fall back to async]
    F --> E
    B -->|No| G[Commit async]
    G --> E
    H[Slow replica or network loss] --> I[Relay log flush delayed]
    I --> C

The default rpl_semi_sync_source_timeout is 10 000 ms. In OLTP workloads, a 10-second commit stall is an outage. This is the direct trade-off between durability and availability.

Common causes

CauseWhat it looks likeFirst thing to check
Replica network partition or latency spikeRpl_semi_sync_source_clients drops; replica I/O thread shows “Connecting”SHOW GLOBAL STATUS LIKE 'Rpl_semi_sync_source_clients' and network latency between nodes
Replica relay-log device saturationReplica I/O thread is running but ACKs are slow; replica disk write queue is highiostat -x on the replica relay log partition
Replica crashed or I/O thread stoppedNo ACK possible; replica is unreachable or replication brokenSHOW REPLICA STATUS on the replica for Replica_IO_Running and Last_IO_Error
Timeout configured too high for SLOCommits hang for the full duration of rpl_semi_sync_source_timeout before fallbackSHOW GLOBAL VARIABLES LIKE 'rpl_semi_sync_source_timeout'
Plugin mismatch after upgradeSemi-sync variables missing; Rpl_semi_sync_source_status stays 0 despite plugin listed in mysql.pluginSHOW PLUGINS and SHOW GLOBAL VARIABLES LIKE 'rpl_semi_sync_%'

Quick checks

Run these read-only checks on the source first, then on the replica.

-- Check semi-sync source status and client count
SHOW GLOBAL STATUS LIKE 'Rpl_semi_sync_source_%';
-- Check how many replicas are required vs connected
SHOW GLOBAL VARIABLES LIKE 'rpl_semi_sync_source_wait_for_replica_count';
SHOW GLOBAL STATUS LIKE 'Rpl_semi_sync_source_clients';
-- Check whether commits are timing out
SHOW GLOBAL STATUS LIKE 'Rpl_semi_sync_source_no_tx';
SHOW GLOBAL STATUS LIKE 'Rpl_semi_sync_source_yes_tx';

On the replica:

-- Check replica I/O thread health
SHOW REPLICA STATUS\G
# Check relay log device write latency on the replica
# Look for high await or queue depth on the relay log filesystem
iostat -x 1 5
-- Check if the semi-sync replica plugin is loaded
SHOW PLUGINS WHERE Name LIKE '%semi_sync%';
-- Verify timeout and wait point configuration on source
SHOW GLOBAL VARIABLES LIKE 'rpl_semi_sync_source_timeout';
SHOW GLOBAL VARIABLES LIKE 'rpl_semi_sync_source_wait_point';

How to diagnose it

  1. Confirm semi-sync is the bottleneck. On the source, check Rpl_semi_sync_source_status. If it is 1, the plugin is enabled and the source is attempting semi-sync. If Rpl_semi_sync_source_clients is less than rpl_semi_sync_source_wait_for_replica_count, the source will wait. Compare Rpl_semi_sync_source_yes_tx and Rpl_semi_sync_source_no_tx over a 30-second window. If no_tx is increasing while yes_tx is flat, timeouts are occurring.

  2. Identify which replica is missing. Rpl_semi_sync_source_clients tells you how many semi-sync-capable replicas are connected. If you expect one and see zero, the replica I/O thread has disconnected. Check SHOW PROCESSLIST on the source for replication connections, or check SHOW REPLICA STATUS on each replica for Replica_IO_Running.

  3. Check the replica I/O thread state. The replica ACKs after writing to the relay log and flushing to disk. If Replica_IO_Running is “Connecting”, the replica is trying to reconnect. If it is “No”, check Last_IO_Error. A stopped I/O thread means no ACKs.

  4. Measure network health. Use ping, mtr, or TCP latency checks between the source and replica. Semi-sync ACKs are small packets, but a network partition or severe packet loss will delay or drop them. Check if firewall rules or security groups recently changed.

  5. Inspect replica disk I/O. The replica must fsync the relay log before ACKing. If the relay log partition is on a saturated disk, the fsync delays the ACK. On the replica, run iostat -x and look for high await or %util on the relay log device.

  6. Review configuration for availability trade-offs. Check rpl_semi_sync_source_wait_point. AFTER_SYNC (the default since MySQL 5.7) waits for the ACK before the storage engine commits, so other sessions cannot see the data until the ACK arrives. AFTER_COMMIT commits to the storage engine first, then waits for the ACK, meaning other sessions may see the data before it is replicated. Neither changes the stall duration, but AFTER_SYNC is lossless on failover.

  7. Verify plugin compatibility. In MySQL 8.0.26 and later, semi-sync plugins and variables were renamed from rpl_semi_sync_master_* and rpl_semi_sync_slave_* to rpl_semi_sync_source_* and rpl_semi_sync_replica_*. If my.cnf references the old names on a new server, or if old plugins remain installed after an upgrade, semi-sync may fail to initialize: the plugin might appear in mysql.plugin but not load correctly, leaving the status variables missing. Check SHOW PLUGINS for rpl_semi_sync_source or rpl_semi_sync_master, and ensure the source and replica plugin names match the server version.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Rpl_semi_sync_source_clientsShows how many replicas are connected and capable of sending ACKsDrops below rpl_semi_sync_source_wait_for_replica_count
Rpl_semi_sync_source_no_txCounts commits that timed out without receiving an ACKSustained increase means the primary is falling back to async
Rpl_semi_sync_source_yes_txCounts commits that succeeded in semi-sync modeFlat or dropping while write volume is constant
Rpl_semi_sync_source_statusWhether the source is currently operating in semi-sync modeDropping to 0 during a write burst
Questions rateQuery throughput on the sourceSustained drop while Threads_running rises
Threads_runningActive concurrencyRising above CPU core count while commit rate drops
Replica_IO_RunningWhether the replica I/O thread is fetching eventsNot “Yes” on a replica expected to be semi-sync
Seconds_Behind_SourceSQL apply lag on the replicaSudden spikes indicate replica health issues, though this lags behind the ACK stall

Fixes

Reduce the timeout for faster fallback

If availability is more important than durability for every single commit, lower rpl_semi_sync_source_timeout so the source falls back to async faster.

-- Reduce timeout to 5 seconds (default is 10000 ms)
SET GLOBAL rpl_semi_sync_source_timeout = 5000;

Trade-off: You lose the durability guarantee sooner. A failover in the gap between async fallback and replica catch-up can lose data.

Allow immediate async fallback when no replicas are connected

If you prefer async replication to a full commit stall when all replicas disconnect, set rpl_semi_sync_source_wait_no_replica to OFF.

SET GLOBAL rpl_semi_sync_source_wait_no_replica = OFF;

Trade-off: If all replicas are offline, commits proceed without any replication guarantee. This can lead to significant data loss if the primary fails before replicas return.

Temporarily disable semi-sync on the source

If the replica path is broken and you need immediate relief, disable the source plugin. This converts all commits to async instantly.

SET GLOBAL rpl_semi_sync_source_enabled = OFF;

Trade-off: You lose all semi-sync durability guarantees until re-enabled. Only use this when the alternative is complete write unavailability.

Fix the replica I/O path

If the replica is disk-bound on relay log writes, move the relay logs to a dedicated, low-latency device or increase the replica’s I/O capacity. If the replica I/O thread has stopped due to an error, resolve the error (for example, a binlog expired on the source before the replica fetched it) and restart replication.

Correct plugin mismatches

If the semi-sync plugin failed to load after an upgrade because old and new plugin names are mixed, uninstall the deprecated plugins and install the current ones. Only run these when the plugin is not actively needed or during a maintenance window.

-- On source
UNINSTALL PLUGIN rpl_semi_sync_master;
INSTALL PLUGIN rpl_semi_sync_source SONAME 'semisync_source.so';

-- On replica
UNINSTALL PLUGIN rpl_semi_sync_slave;
INSTALL PLUGIN rpl_semi_sync_replica SONAME 'semisync_replica.so';

Restart is not required for plugin install or uninstall, but verify that the variables appear after installation.

Prevention

  • Monitor Rpl_semi_sync_source_clients. A drop below your required count is the earliest warning of a stall.
  • Set rpl_semi_sync_source_wait_for_replica_count below your total replica count. If you have three replicas, require one. A single replica failure then does not stall commits.
  • Set a timeout that matches your application’s connection or lock wait limits. The 10-second default is too long for many workloads.
  • Watch the ratio of Rpl_semi_sync_source_no_tx to total commits. A rising rate of async fallbacks indicates intermittent replica path problems before they become full stalls.
  • After any MySQL upgrade, confirm SHOW PLUGINS lists the correct semi-sync plugins and that the status variables exist.
  • Alert on replica fsync latency and cross-AZ network latency. Semi-sync is only as fast as the slowest ACK path.

How Netdata helps

  • Correlate Rpl_semi_sync_source_no_tx with Threads_running and the MySQL Questions rate to confirm that commit stalls are causing connection pile-up.
  • Alert when Rpl_semi_sync_source_clients drops below rpl_semi_sync_source_wait_for_replica_count before commits start timing out.
  • Cross-reference the primary’s semi-sync counters with the replica’s replication lag, disk write latency, and network latency to isolate whether the missing ACK is a network, disk, or replication thread problem.
  • Track Rpl_semi_sync_source_yes_tx versus no_tx over time to spot gradual degradation in the durability path, such as a replica flapping due to intermittent packet loss.