ClickHouse replication queue stuck: num_tries, last_exception, and dead entries
A replica can show low absolute_delay in system.replicas while system.replication_queue contains entries with num_tries in the hundreds and the same last_exception repeating for hours. These dead entries do not self-resolve. They block merges, fetches, or mutations, causing silent divergence and stale reads. This guide shows how to read the queue correctly, identify the failure mode from the entry type, and clear the blockage without making the replica diverge further.
What this means
In ReplicatedMergeTree, each replica maintains a local system.replication_queue of operations from the shared replication log.
- GET_PART: Fetch a data part from another replica over the inter-server port.
- MERGE_PARTS: Execute a local merge of existing parts.
- MUTATE_PART: Apply an ALTER UPDATE/DELETE mutation to a part.
A background thread picks up each entry. On failure, ClickHouse retries with backoff and increments num_tries. Once num_tries exceeds 10, and certainly above 100, the entry is genuinely stuck and will not resolve on its own. last_exception usually reveals whether the problem is network, disk, source replica health, or local resource starvation.
Relying on absolute_delay alone is dangerous. A replica can process new inserts quickly while an old GET_PART or MUTATE_PART entry stays dead in the queue. The delay stays low, but the replica is missing data or unable to merge parts. This eventually surfaces as elevated part counts, slower queries, or insert failures on the affected partition.
flowchart TD
A[Queue size high or num_tries rising] --> B{Entry type?}
B -->|GET_PART| C[Check source replica and port 9009]
B -->|MERGE_PARTS| D[Check merge pool and disk space]
B -->|MUTATE_PART| E[Check system.mutations]
C --> F{Exception pattern}
D --> F
E --> F
F -->|Transient| G[Monitor for resolution]
F -->|Permanent| H[Apply fix: restart replica,
kill mutation, or free disk]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Source replica unreachable or part missing on source | GET_PART entries with connection errors or “part not found” in last_exception | Connectivity to source_replica on port 9009; whether the part exists on the source |
| Background fetch pool saturation | GET_PART tasks pending, BackgroundFetchesPoolTask near BackgroundFetchesPoolSize | system.metrics for fetch pool utilization |
| Local disk full or unreserved space exhausted | Fetches or merges fail with space errors; MERGE_PARTS may also stall | system.disks unreserved_space |
Stuck mutation blocking MUTATE_PART | MUTATE_PART entries with high num_tries; system.mutations shows stale parts_to_do | system.mutations where is_done = 0 |
| Local merge pool starvation | MERGE_PARTS entries stuck; background merge pool at capacity | system.merges and merge pool utilization |
| ZooKeeper session flapping | Queue grows across many tables simultaneously; is_session_expired toggles | system.zookeeper_connection and system.replicas |
Quick checks
Run these read-only checks to characterize the queue state.
-- Queue summary per replica
SELECT
database,
table,
queue_size,
inserts_in_queue,
merges_in_queue,
log_max_index - log_pointer AS entries_behind,
absolute_delay
FROM system.replicas
WHERE queue_size > 0
ORDER BY queue_size DESC;
-- Stuck entries with exceptions
SELECT
database,
table,
type,
source_replica,
create_time,
last_attempt_time,
num_tries,
last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;
-- Queue composition by type
SELECT
database,
table,
type,
count(*) AS pending
FROM system.replication_queue
GROUP BY database, table, type
ORDER BY pending DESC;
-- Background pool utilization
SELECT metric, value
FROM system.metrics
WHERE metric LIKE 'Background%Pool%'
ORDER BY metric;
-- Disk space headroom
SELECT
name,
path,
formatReadableSize(free_space) AS free,
formatReadableSize(unreserved_space) AS unreserved,
formatReadableSize(keep_free_space_bytes) AS keep_free
FROM system.disks;
-- Replication failure events
SELECT event, value
FROM system.events
WHERE event IN (
'ReplicatedPartFailedFetches',
'ReplicatedPartChecksFailed',
'ReplicatedDataLoss'
);
-- Active mutations that may block the queue
SELECT
database,
table,
mutation_id,
parts_to_do,
latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;
# Test connectivity to source replica on inter-server port
nc -zv <source-replica-host> 9009
-- Local part counts to spot merge starvation
SELECT
database,
table,
count(*) AS active_parts
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY active_parts DESC
LIMIT 10;
How to diagnose it
Identify stuck entries. Query
system.replication_queuewithnum_tries > 0, sorted bynum_triesdescending. Notetype,last_exception, andsource_replica. Ifnum_triesis above 10 and climbing, treat the entry as dead.Classify by type.
GET_PARTmeans a fetch problem (network or source replica).MERGE_PARTSmeans local merge resource issues.MUTATE_PARTmeans a mutation dependency.For
GET_PART: Verify the source replica is alive and reachable on port 9009. Check whether the requested part still exists on the source viasystem.parts. If the source merged it away, ClickHouse may fall back to fetching the merged result. If that also fails, inspect the source replica’s logs for detach or corruption events.For
MERGE_PARTS: Check whether the background merge pool is saturated (BackgroundMergesAndMutationsPoolTasknear pool size). Checksystem.mergesfor long-running operations. Verify disk space: ifunreserved_spaceis too low to hold the merged output, ClickHouse will not schedule the merge.For
MUTATE_PART: Inspectsystem.mutations. Ifparts_to_dois flat for more than 30 minutes orlatest_fail_reasonis non-empty, the mutation is stalled. Mutations serialize per table, so one stuck mutation blocks all subsequentMUTATE_PARTentries and can starve merges that depend on mutated parts.Correlate delay with queue depth. Compare
absolute_delayagainstqueue_size. If delay is low butqueue_sizeis high, the replica is processing recent log entries while ignoring old dead ones. This is the pattern most likely to be missed by delay-only monitoring.Check ZooKeeper health. A flapping session (
is_expired = 1) can cause the replica to rebuild its queue repeatedly, leaving phantom or duplicate entries. Usesystem.zookeeper_connectionto confirm session stability.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
system.replication_queue entries with num_tries > 0 | Directly reveals retrying or dead operations | num_tries > 5 sustained, or > 10/100 (dead entry) |
system.replicas.queue_size and entries_behind | Measures backlog depth and log divergence | queue_size growing for > 15 minutes, or entries_behind increasing |
system.replicas.absolute_delay | Wall-clock data freshness | > 120 s sustained; do not use alone |
| Background fetches pool utilization | Saturated pool cannot process GET_PART tasks | BackgroundFetchesPoolTask / BackgroundFetchesPoolSize > 0.9 for > 10 min |
ReplicatedPartFailedFetches | Cumulative failed fetches from peers | Sustained increase over any 5-minute window |
system.disks.unreserved_space | Fetches and merges need writable space | Approaching zero, or below the size of the largest active part |
| Inter-server throughput and connection count | Replication traffic health | Zero throughput despite pending fetches, or link saturation |
system.mutations.parts_to_do | Mutations block MUTATE_PART entries | Flat for > 30 min, or latest_fail_reason non-empty |
system.zookeeper_connection.is_expired | Session loss halts queue progress | Any non-zero value sustained > 30 s |
Fixes
Fetch failures from a missing or corrupted part
If last_exception indicates the source replica does not have the part, verify its presence on the source via system.parts. If the part was lost on the source and no other replica has it, data loss has occurred. If the source merged it away, ClickHouse should automatically fall back to fetching the merged result. If the queue entry remains stuck, SYSTEM RESTART REPLICA db.table forces the replica to rebuild its queue from the shared log. This briefly interrupts reads on that table.
Network or port 9009 connectivity issues
Repair firewall rules or routing so replicas can reach each other on port 9009 bidirectionally. If a replica is permanently decommissioned, remove it from the cluster configuration to stop fetches targeting it. Expect a temporary replication lag spike while the remaining replicas absorb catch-up traffic.
Background fetch pool saturation
If CPU and disk I/O have headroom, increase background_fetches_pool_size. If the saturation stems from a large post-downtime catch-up, consider throttling non-essential queries or inserts to reserve I/O bandwidth for replication. A larger pool increases concurrent I/O, which can starve local merges if the disk subsystem is already near saturation.
Disk space blocking fetches or merges
Free space immediately by detaching old partitions (ALTER TABLE ... DETACH PARTITION) or adding storage. Detaching makes the partition unavailable for queries until it is reattached or dropped. Merges need enough temporary space to write the full merged result before deleting source parts. If unreserved_space is near zero, merges stop and replication fetches may also fail, creating a death spiral.
Stuck mutation blocking the queue
Identify the mutation in system.mutations. If it is non-critical, kill it:
KILL MUTATION WHERE database = 'db' AND table = 'table' AND mutation_id = 'mutation_id';
This aborts work already performed; the mutation may need to be reissued later. After killing, verify that MUTATE_PART queue entries resume processing.
Permanently dead queue entry
When a single entry has num_tries above 100 and blocks subsequent work, SYSTEM RESTART REPLICA db.table is usually the fastest safe fix. It rebuilds the local queue state without dropping data. Do this during a maintenance window if possible, as it briefly makes the replica inconsistent for reads.
Prevention
- Monitor queue health, not just delay. Alert on
num_triesandlast_exceptiondirectly. A lowabsolute_delaydoes not prove the queue is healthy. - Set
num_triesthresholds. Alert when any entry exceeds 5 tries, and page when it exceeds 10. - Reserve inter-server bandwidth. Keep port 9009 open and uncongested. Do not share replication links with heavy cross-traffic.
- Keep disk headroom. Maintain disk usage below 80-85% and ensure
unreserved_spacecan accommodate the largest potential merge. - Limit mutations on replicated tables. Avoid heavy
ALTER UPDATE/DELETEduring peak hours. Mutations serialize per table and stall replication queues. - Run periodic cross-replica checks. Compare row counts per partition across replicas to catch silent divergence that queue metrics miss.
How Netdata helps
- Netdata exposes ClickHouse replication metrics including
ReplicatedPartFailedFetches, queue depth, and background pool utilization alongside system-level CPU, disk, and network charts. - Correlate rising
num_trieswith inter-server network throughput and disk I/O latency to distinguish a slow source replica from local resource saturation. - Alert on queue entry staleness by tracking replication queue metrics over time, catching dead entries before
absolute_delayreflects the problem. - Visualize background pool saturation for both fetches and merges to identify capacity constraints that lead to stuck entries.
- Monitor
system.replicasandsystem.replication_queuedimensions without running manual SQL during an incident.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse Memory limit (for query) exceeded: per-query limits and GROUP BY/JOIN blowups
- ClickHouse Memory limit (total) exceeded - server-wide memory pressure and fixes
- ClickHouse memory pressure death spiral: runaway queries, retries, and OOM
- ClickHouse MemoryTracking vs MemoryResident: reading the memory gap correctly
- ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
- ClickHouse merge duration climbing: the leading indicator of part explosion
- ClickHouse merges not keeping up: diagnosing a stalled or starved merge pool







