$ guides / clickhouse / clickhouse-replication-queue-stuck ▌

Operations Guides

ClickHouse replication queue stuck: num_tries, last_exception, and dead entries

A replica can show low absolute_delay in system.replicas while system.replication_queue contains entries with num_tries in the hundreds and the same last_exception repeating for hours. These dead entries do not self-resolve. They block merges, fetches, or mutations, causing silent divergence and stale reads. This guide shows how to read the queue correctly, identify the failure mode from the entry type, and clear the blockage without making the replica diverge further.

What this means

In ReplicatedMergeTree, each replica maintains a local system.replication_queue of operations from the shared replication log.

GET_PART: Fetch a data part from another replica over the inter-server port.
MERGE_PARTS: Execute a local merge of existing parts.
MUTATE_PART: Apply an ALTER UPDATE/DELETE mutation to a part.

A background thread picks up each entry. On failure, ClickHouse retries with backoff and increments num_tries. Once num_tries exceeds 10, and certainly above 100, the entry is genuinely stuck and will not resolve on its own. last_exception usually reveals whether the problem is network, disk, source replica health, or local resource starvation.

Relying on absolute_delay alone is dangerous. A replica can process new inserts quickly while an old GET_PART or MUTATE_PART entry stays dead in the queue. The delay stays low, but the replica is missing data or unable to merge parts. This eventually surfaces as elevated part counts, slower queries, or insert failures on the affected partition.

flowchart TD
    A[Queue size high or num_tries rising] --> B{Entry type?}
    B -->|GET_PART| C[Check source replica and port 9009]
    B -->|MERGE_PARTS| D[Check merge pool and disk space]
    B -->|MUTATE_PART| E[Check system.mutations]
    C --> F{Exception pattern}
    D --> F
    E --> F
    F -->|Transient| G[Monitor for resolution]
    F -->|Permanent| H[Apply fix: restart replica,
kill mutation, or free disk]

Common causes

Cause	What it looks like	First thing to check
Source replica unreachable or part missing on source	`GET_PART` entries with connection errors or “part not found” in `last_exception`	Connectivity to `source_replica` on port 9009; whether the part exists on the source
Background fetch pool saturation	`GET_PART` tasks pending, `BackgroundFetchesPoolTask` near `BackgroundFetchesPoolSize`	`system.metrics` for fetch pool utilization
Local disk full or unreserved space exhausted	Fetches or merges fail with space errors; `MERGE_PARTS` may also stall	`system.disks` `unreserved_space`
Stuck mutation blocking `MUTATE_PART`	`MUTATE_PART` entries with high `num_tries`; `system.mutations` shows stale `parts_to_do`	`system.mutations` where `is_done = 0`
Local merge pool starvation	`MERGE_PARTS` entries stuck; background merge pool at capacity	`system.merges` and merge pool utilization
ZooKeeper session flapping	Queue grows across many tables simultaneously; `is_session_expired` toggles	`system.zookeeper_connection` and `system.replicas`

Quick checks

Run these read-only checks to characterize the queue state.

-- Queue summary per replica
SELECT
    database,
    table,
    queue_size,
    inserts_in_queue,
    merges_in_queue,
    log_max_index - log_pointer AS entries_behind,
    absolute_delay
FROM system.replicas
WHERE queue_size > 0
ORDER BY queue_size DESC;

-- Stuck entries with exceptions
SELECT
    database,
    table,
    type,
    source_replica,
    create_time,
    last_attempt_time,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;

-- Queue composition by type
SELECT
    database,
    table,
    type,
    count(*) AS pending
FROM system.replication_queue
GROUP BY database, table, type
ORDER BY pending DESC;

-- Background pool utilization
SELECT metric, value
FROM system.metrics
WHERE metric LIKE 'Background%Pool%'
ORDER BY metric;

-- Disk space headroom
SELECT
    name,
    path,
    formatReadableSize(free_space) AS free,
    formatReadableSize(unreserved_space) AS unreserved,
    formatReadableSize(keep_free_space_bytes) AS keep_free
FROM system.disks;

-- Replication failure events
SELECT event, value
FROM system.events
WHERE event IN (
    'ReplicatedPartFailedFetches',
    'ReplicatedPartChecksFailed',
    'ReplicatedDataLoss'
);

-- Active mutations that may block the queue
SELECT
    database,
    table,
    mutation_id,
    parts_to_do,
    latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;

# Test connectivity to source replica on inter-server port
nc -zv <source-replica-host> 9009

-- Local part counts to spot merge starvation
SELECT
    database,
    table,
    count(*) AS active_parts
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY active_parts DESC
LIMIT 10;

How to diagnose it

Identify stuck entries. Query system.replication_queue with num_tries > 0, sorted by num_tries descending. Note type, last_exception, and source_replica. If num_tries is above 10 and climbing, treat the entry as dead.
Classify by type. GET_PART means a fetch problem (network or source replica). MERGE_PARTS means local merge resource issues. MUTATE_PART means a mutation dependency.
For GET_PART: Verify the source replica is alive and reachable on port 9009. Check whether the requested part still exists on the source via system.parts. If the source merged it away, ClickHouse may fall back to fetching the merged result. If that also fails, inspect the source replica’s logs for detach or corruption events.
For MERGE_PARTS: Check whether the background merge pool is saturated (BackgroundMergesAndMutationsPoolTask near pool size). Check system.merges for long-running operations. Verify disk space: if unreserved_space is too low to hold the merged output, ClickHouse will not schedule the merge.
For MUTATE_PART: Inspect system.mutations. If parts_to_do is flat for more than 30 minutes or latest_fail_reason is non-empty, the mutation is stalled. Mutations serialize per table, so one stuck mutation blocks all subsequent MUTATE_PART entries and can starve merges that depend on mutated parts.
Correlate delay with queue depth. Compare absolute_delay against queue_size. If delay is low but queue_size is high, the replica is processing recent log entries while ignoring old dead ones. This is the pattern most likely to be missed by delay-only monitoring.
Check ZooKeeper health. A flapping session (is_expired = 1) can cause the replica to rebuild its queue repeatedly, leaving phantom or duplicate entries. Use system.zookeeper_connection to confirm session stability.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`system.replication_queue` entries with `num_tries > 0`	Directly reveals retrying or dead operations	`num_tries` > 5 sustained, or > 10/100 (dead entry)
`system.replicas.queue_size` and `entries_behind`	Measures backlog depth and log divergence	`queue_size` growing for > 15 minutes, or `entries_behind` increasing
`system.replicas.absolute_delay`	Wall-clock data freshness	> 120 s sustained; do not use alone
Background fetches pool utilization	Saturated pool cannot process `GET_PART` tasks	`BackgroundFetchesPoolTask` / `BackgroundFetchesPoolSize` > 0.9 for > 10 min
`ReplicatedPartFailedFetches`	Cumulative failed fetches from peers	Sustained increase over any 5-minute window
`system.disks.unreserved_space`	Fetches and merges need writable space	Approaching zero, or below the size of the largest active part
Inter-server throughput and connection count	Replication traffic health	Zero throughput despite pending fetches, or link saturation
`system.mutations.parts_to_do`	Mutations block `MUTATE_PART` entries	Flat for > 30 min, or `latest_fail_reason` non-empty
`system.zookeeper_connection.is_expired`	Session loss halts queue progress	Any non-zero value sustained > 30 s

Fixes

Fetch failures from a missing or corrupted part

If last_exception indicates the source replica does not have the part, verify its presence on the source via system.parts. If the part was lost on the source and no other replica has it, data loss has occurred. If the source merged it away, ClickHouse should automatically fall back to fetching the merged result. If the queue entry remains stuck, SYSTEM RESTART REPLICA db.table forces the replica to rebuild its queue from the shared log. This briefly interrupts reads on that table.

Network or port 9009 connectivity issues

Repair firewall rules or routing so replicas can reach each other on port 9009 bidirectionally. If a replica is permanently decommissioned, remove it from the cluster configuration to stop fetches targeting it. Expect a temporary replication lag spike while the remaining replicas absorb catch-up traffic.

Background fetch pool saturation

If CPU and disk I/O have headroom, increase background_fetches_pool_size. If the saturation stems from a large post-downtime catch-up, consider throttling non-essential queries or inserts to reserve I/O bandwidth for replication. A larger pool increases concurrent I/O, which can starve local merges if the disk subsystem is already near saturation.

Disk space blocking fetches or merges

Free space immediately by detaching old partitions (ALTER TABLE ... DETACH PARTITION) or adding storage. Detaching makes the partition unavailable for queries until it is reattached or dropped. Merges need enough temporary space to write the full merged result before deleting source parts. If unreserved_space is near zero, merges stop and replication fetches may also fail, creating a death spiral.

Stuck mutation blocking the queue

Identify the mutation in system.mutations. If it is non-critical, kill it:

KILL MUTATION WHERE database = 'db' AND table = 'table' AND mutation_id = 'mutation_id';

This aborts work already performed; the mutation may need to be reissued later. After killing, verify that MUTATE_PART queue entries resume processing.

Permanently dead queue entry

When a single entry has num_tries above 100 and blocks subsequent work, SYSTEM RESTART REPLICA db.table is usually the fastest safe fix. It rebuilds the local queue state without dropping data. Do this during a maintenance window if possible, as it briefly makes the replica inconsistent for reads.

Prevention

Monitor queue health, not just delay. Alert on num_tries and last_exception directly. A low absolute_delay does not prove the queue is healthy.
Set num_tries thresholds. Alert when any entry exceeds 5 tries, and page when it exceeds 10.
Reserve inter-server bandwidth. Keep port 9009 open and uncongested. Do not share replication links with heavy cross-traffic.
Keep disk headroom. Maintain disk usage below 80-85% and ensure unreserved_space can accommodate the largest potential merge.
Limit mutations on replicated tables. Avoid heavy ALTER UPDATE/DELETE during peak hours. Mutations serialize per table and stall replication queues.
Run periodic cross-replica checks. Compare row counts per partition across replicas to catch silent divergence that queue metrics miss.

How Netdata helps

Netdata exposes ClickHouse replication metrics including ReplicatedPartFailedFetches, queue depth, and background pool utilization alongside system-level CPU, disk, and network charts.
Correlate rising num_tries with inter-server network throughput and disk I/O latency to distinguish a slow source replica from local resource saturation.
Alert on queue entry staleness by tracking replication queue metrics over time, catching dead entries before absolute_delay reflects the problem.
Visualize background pool saturation for both fetches and merges to identify capacity constraints that lead to stuck entries.
Monitor system.replicas and system.replication_queue dimensions without running manual SQL during an incident.

The Netdata solution

ClickHouse monitoring with Netdata

Netdata monitors ClickHouse with per-second metrics and ML anomaly detection. Track merge debt, memory usage, replication lag, Keeper/ZooKeeper saturation, and disk headroom against the host signals that drive them.

See ClickHouse monitoring → Start monitoring free

ClickHouse replication queue stuck: num_tries, last_exception, and dead entries

ClickHouse replication queue stuck: num_tries, last_exception, and dead entries

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Fetch failures from a missing or corrupted part

Network or port 9009 connectivity issues

Background fetch pool saturation

Disk space blocking fetches or merges

Stuck mutation blocking the queue

Permanently dead queue entry

Prevention

How Netdata helps

Related guides

ClickHouse monitoring with Netdata