ClickHouse replication queue stuck: num_tries, last_exception, and dead entries

A replica can show low absolute_delay in system.replicas while system.replication_queue contains entries with num_tries in the hundreds and the same last_exception repeating for hours. These dead entries do not self-resolve. They block merges, fetches, or mutations, causing silent divergence and stale reads. This guide shows how to read the queue correctly, identify the failure mode from the entry type, and clear the blockage without making the replica diverge further.

What this means

In ReplicatedMergeTree, each replica maintains a local system.replication_queue of operations from the shared replication log.

  • GET_PART: Fetch a data part from another replica over the inter-server port.
  • MERGE_PARTS: Execute a local merge of existing parts.
  • MUTATE_PART: Apply an ALTER UPDATE/DELETE mutation to a part.

A background thread picks up each entry. On failure, ClickHouse retries with backoff and increments num_tries. Once num_tries exceeds 10, and certainly above 100, the entry is genuinely stuck and will not resolve on its own. last_exception usually reveals whether the problem is network, disk, source replica health, or local resource starvation.

Relying on absolute_delay alone is dangerous. A replica can process new inserts quickly while an old GET_PART or MUTATE_PART entry stays dead in the queue. The delay stays low, but the replica is missing data or unable to merge parts. This eventually surfaces as elevated part counts, slower queries, or insert failures on the affected partition.

flowchart TD
    A[Queue size high or num_tries rising] --> B{Entry type?}
    B -->|GET_PART| C[Check source replica and port 9009]
    B -->|MERGE_PARTS| D[Check merge pool and disk space]
    B -->|MUTATE_PART| E[Check system.mutations]
    C --> F{Exception pattern}
    D --> F
    E --> F
    F -->|Transient| G[Monitor for resolution]
    F -->|Permanent| H[Apply fix: restart replica,
kill mutation, or free disk]

Common causes

CauseWhat it looks likeFirst thing to check
Source replica unreachable or part missing on sourceGET_PART entries with connection errors or “part not found” in last_exceptionConnectivity to source_replica on port 9009; whether the part exists on the source
Background fetch pool saturationGET_PART tasks pending, BackgroundFetchesPoolTask near BackgroundFetchesPoolSizesystem.metrics for fetch pool utilization
Local disk full or unreserved space exhaustedFetches or merges fail with space errors; MERGE_PARTS may also stallsystem.disks unreserved_space
Stuck mutation blocking MUTATE_PARTMUTATE_PART entries with high num_tries; system.mutations shows stale parts_to_dosystem.mutations where is_done = 0
Local merge pool starvationMERGE_PARTS entries stuck; background merge pool at capacitysystem.merges and merge pool utilization
ZooKeeper session flappingQueue grows across many tables simultaneously; is_session_expired togglessystem.zookeeper_connection and system.replicas

Quick checks

Run these read-only checks to characterize the queue state.

-- Queue summary per replica
SELECT
    database,
    table,
    queue_size,
    inserts_in_queue,
    merges_in_queue,
    log_max_index - log_pointer AS entries_behind,
    absolute_delay
FROM system.replicas
WHERE queue_size > 0
ORDER BY queue_size DESC;
-- Stuck entries with exceptions
SELECT
    database,
    table,
    type,
    source_replica,
    create_time,
    last_attempt_time,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;
-- Queue composition by type
SELECT
    database,
    table,
    type,
    count(*) AS pending
FROM system.replication_queue
GROUP BY database, table, type
ORDER BY pending DESC;
-- Background pool utilization
SELECT metric, value
FROM system.metrics
WHERE metric LIKE 'Background%Pool%'
ORDER BY metric;
-- Disk space headroom
SELECT
    name,
    path,
    formatReadableSize(free_space) AS free,
    formatReadableSize(unreserved_space) AS unreserved,
    formatReadableSize(keep_free_space_bytes) AS keep_free
FROM system.disks;
-- Replication failure events
SELECT event, value
FROM system.events
WHERE event IN (
    'ReplicatedPartFailedFetches',
    'ReplicatedPartChecksFailed',
    'ReplicatedDataLoss'
);
-- Active mutations that may block the queue
SELECT
    database,
    table,
    mutation_id,
    parts_to_do,
    latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;
# Test connectivity to source replica on inter-server port
nc -zv <source-replica-host> 9009
-- Local part counts to spot merge starvation
SELECT
    database,
    table,
    count(*) AS active_parts
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY active_parts DESC
LIMIT 10;

How to diagnose it

  1. Identify stuck entries. Query system.replication_queue with num_tries > 0, sorted by num_tries descending. Note type, last_exception, and source_replica. If num_tries is above 10 and climbing, treat the entry as dead.

  2. Classify by type. GET_PART means a fetch problem (network or source replica). MERGE_PARTS means local merge resource issues. MUTATE_PART means a mutation dependency.

  3. For GET_PART: Verify the source replica is alive and reachable on port 9009. Check whether the requested part still exists on the source via system.parts. If the source merged it away, ClickHouse may fall back to fetching the merged result. If that also fails, inspect the source replica’s logs for detach or corruption events.

  4. For MERGE_PARTS: Check whether the background merge pool is saturated (BackgroundMergesAndMutationsPoolTask near pool size). Check system.merges for long-running operations. Verify disk space: if unreserved_space is too low to hold the merged output, ClickHouse will not schedule the merge.

  5. For MUTATE_PART: Inspect system.mutations. If parts_to_do is flat for more than 30 minutes or latest_fail_reason is non-empty, the mutation is stalled. Mutations serialize per table, so one stuck mutation blocks all subsequent MUTATE_PART entries and can starve merges that depend on mutated parts.

  6. Correlate delay with queue depth. Compare absolute_delay against queue_size. If delay is low but queue_size is high, the replica is processing recent log entries while ignoring old dead ones. This is the pattern most likely to be missed by delay-only monitoring.

  7. Check ZooKeeper health. A flapping session (is_expired = 1) can cause the replica to rebuild its queue repeatedly, leaving phantom or duplicate entries. Use system.zookeeper_connection to confirm session stability.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
system.replication_queue entries with num_tries > 0Directly reveals retrying or dead operationsnum_tries > 5 sustained, or > 10/100 (dead entry)
system.replicas.queue_size and entries_behindMeasures backlog depth and log divergencequeue_size growing for > 15 minutes, or entries_behind increasing
system.replicas.absolute_delayWall-clock data freshness> 120 s sustained; do not use alone
Background fetches pool utilizationSaturated pool cannot process GET_PART tasksBackgroundFetchesPoolTask / BackgroundFetchesPoolSize > 0.9 for > 10 min
ReplicatedPartFailedFetchesCumulative failed fetches from peersSustained increase over any 5-minute window
system.disks.unreserved_spaceFetches and merges need writable spaceApproaching zero, or below the size of the largest active part
Inter-server throughput and connection countReplication traffic healthZero throughput despite pending fetches, or link saturation
system.mutations.parts_to_doMutations block MUTATE_PART entriesFlat for > 30 min, or latest_fail_reason non-empty
system.zookeeper_connection.is_expiredSession loss halts queue progressAny non-zero value sustained > 30 s

Fixes

Fetch failures from a missing or corrupted part

If last_exception indicates the source replica does not have the part, verify its presence on the source via system.parts. If the part was lost on the source and no other replica has it, data loss has occurred. If the source merged it away, ClickHouse should automatically fall back to fetching the merged result. If the queue entry remains stuck, SYSTEM RESTART REPLICA db.table forces the replica to rebuild its queue from the shared log. This briefly interrupts reads on that table.

Network or port 9009 connectivity issues

Repair firewall rules or routing so replicas can reach each other on port 9009 bidirectionally. If a replica is permanently decommissioned, remove it from the cluster configuration to stop fetches targeting it. Expect a temporary replication lag spike while the remaining replicas absorb catch-up traffic.

Background fetch pool saturation

If CPU and disk I/O have headroom, increase background_fetches_pool_size. If the saturation stems from a large post-downtime catch-up, consider throttling non-essential queries or inserts to reserve I/O bandwidth for replication. A larger pool increases concurrent I/O, which can starve local merges if the disk subsystem is already near saturation.

Disk space blocking fetches or merges

Free space immediately by detaching old partitions (ALTER TABLE ... DETACH PARTITION) or adding storage. Detaching makes the partition unavailable for queries until it is reattached or dropped. Merges need enough temporary space to write the full merged result before deleting source parts. If unreserved_space is near zero, merges stop and replication fetches may also fail, creating a death spiral.

Stuck mutation blocking the queue

Identify the mutation in system.mutations. If it is non-critical, kill it:

KILL MUTATION WHERE database = 'db' AND table = 'table' AND mutation_id = 'mutation_id';

This aborts work already performed; the mutation may need to be reissued later. After killing, verify that MUTATE_PART queue entries resume processing.

Permanently dead queue entry

When a single entry has num_tries above 100 and blocks subsequent work, SYSTEM RESTART REPLICA db.table is usually the fastest safe fix. It rebuilds the local queue state without dropping data. Do this during a maintenance window if possible, as it briefly makes the replica inconsistent for reads.

Prevention

  • Monitor queue health, not just delay. Alert on num_tries and last_exception directly. A low absolute_delay does not prove the queue is healthy.
  • Set num_tries thresholds. Alert when any entry exceeds 5 tries, and page when it exceeds 10.
  • Reserve inter-server bandwidth. Keep port 9009 open and uncongested. Do not share replication links with heavy cross-traffic.
  • Keep disk headroom. Maintain disk usage below 80-85% and ensure unreserved_space can accommodate the largest potential merge.
  • Limit mutations on replicated tables. Avoid heavy ALTER UPDATE/DELETE during peak hours. Mutations serialize per table and stall replication queues.
  • Run periodic cross-replica checks. Compare row counts per partition across replicas to catch silent divergence that queue metrics miss.

How Netdata helps

  • Netdata exposes ClickHouse replication metrics including ReplicatedPartFailedFetches, queue depth, and background pool utilization alongside system-level CPU, disk, and network charts.
  • Correlate rising num_tries with inter-server network throughput and disk I/O latency to distinguish a slow source replica from local resource saturation.
  • Alert on queue entry staleness by tracking replication queue metrics over time, catching dead entries before absolute_delay reflects the problem.
  • Visualize background pool saturation for both fetches and merges to identify capacity constraints that lead to stuck entries.
  • Monitor system.replicas and system.replication_queue dimensions without running manual SQL during an incident.