ClickHouse mutation stuck: parts_to_do not decreasing and how to recover

Check system.mutations and find a mutation active for hours. parts_to_do has not moved in thirty minutes and parts are accumulating. Queries return, but insert latency climbs. On replicated tables, each replica processes mutations independently, so a stall on one node creates silent divergence while the rest of the cluster appears healthy.

Mutations rewrite data parts to apply ALTER UPDATE, ALTER DELETE, or projection changes. They run sequentially per table and share the background merge and mutation pool with regular merges. A stalled mutation blocks subsequent mutations for that table and consumes threads needed for merges. The result is merge starvation, insert delays, and eventually Too many parts rejections.

What this means

A stalled mutation creates a linear bottleneck per table because mutations run sequentially. On replicated tables, the stalled replica diverges until the mutation completes or is killed.

While stalled, the mutation occupies the background pool. Merges for that table slow or stop. Inserts create fresh parts that are not merged away, so active part count rises and parts_to_do stays flat.

flowchart TD
    A[Mutation starts on table] --> B{parts_to_do decreasing?}
    B -->|Yes| C[Normal progress]
    B -->|No for >30 min| D[Stalled mutation]
    D --> E[Blocks merge pool]
    E --> F[Parts accumulate]
    F --> G[Insert delays or rejections]
    D --> H[Check latest_fail_reason]
    H -->|Non-empty| I[Failing retry loop]
    H -->|Empty| J[Pool saturated or resource blocked]

Common causes

CauseWhat it looks likeFirst thing to check
Failing mutation on a specific partlatest_fail_reason is non-empty; mutation retries the same parts without progressSELECT latest_fail_reason, parts_to_do_names FROM system.mutations WHERE is_done = 0
Background pool saturationsystem.merges shows many long-running tasks; no free slots for the mutation to proceedSELECT count(*) FROM system.merges and background pool metrics from system.metrics
Replication queue blockageMUTATE_PART entries in system.replication_queue have high num_tries and exceptionsSELECT * FROM system.replication_queue WHERE type = 'MUTATE_PART'
Resource exhaustionsystem.merges is empty despite load, or disk free space is critically lowsystem.disks and MemoryTracking in system.metrics
Legitimate slow progress on large tableparts_to_do decreases very slowly over hours; no error presentSample parts_to_do at 5-minute intervals

Quick checks

-- Active mutations with failure context
SELECT database, table, mutation_id, command, create_time, parts_to_do, is_done, latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;
-- Snapshot background work to see pool contention
SELECT database, table, elapsed, progress, num_parts, is_mutation, result_part_name
FROM system.merges
ORDER BY elapsed DESC;
-- Replication queue for stuck mutation tasks
SELECT database, table, type, create_time, num_tries, last_exception
FROM system.replication_queue
WHERE type = 'MUTATE_PART'
ORDER BY num_tries DESC;
-- Disk headroom for merge and mutation temp output
SELECT name, path, formatReadableSize(free_space) AS free, formatReadableSize(total_space) AS total
FROM system.disks;
-- Server memory pressure
SELECT metric, formatReadableSize(value) AS memory
FROM system.metrics
WHERE metric = 'MemoryTracking';
-- Active part count for the affected table
SELECT count() AS active_parts
FROM system.parts
WHERE database = 'db' AND table = 'tbl' AND active = 1;

How to diagnose it

  1. Confirm the stall. Query system.mutations where is_done = 0. Note create_time and parts_to_do. Wait five minutes and re-query. If parts_to_do is unchanged, the mutation is stalled.

  2. Check for repeated failures. If latest_fail_reason is non-empty, the mutation is failing on a specific part and retrying without progress. Note the parts listed in parts_to_do_names.

  3. Inspect the background pool. Query system.merges. If many entries show is_mutation = 1 or very long elapsed times, the pool is saturated. If the pool is empty but parts_to_do is flat, the mutation may be blocked on resources or locks.

  4. Evaluate replication state on replicated tables. Query system.replication_queue for MUTATE_PART entries. High num_tries with last_exception indicates the replica cannot apply the mutation locally, often because a required source part is missing or a fetch is failing.

  5. Check resource availability. Query system.disks to ensure free space is sufficient for temporary merge output. Check MemoryTracking to rule out memory pressure killing or starving background tasks.

  6. Distinguish slow from stuck. Very large tables can take hours or days to mutate. If parts_to_do drops even slightly across your sampling window, the mutation is slow but healthy. Do not kill it.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
system.mutations.parts_to_doMeasures remaining work per mutationFlat for more than 30 minutes
system.mutations.latest_fail_reasonSurfaces part-level failures that block progressNon-empty on any is_done = 0 mutation
Active merge/mutation countShows whether the background pool is saturatedSustained high count with flat parts_to_do
Active parts per partitionStalled mutations block merges, so parts accumulateSteady growth while a mutation is active
Replication queue num_triesReplicated mutations may fail silently on one replicaMUTATE_PART entries with num_tries > 5
Disk free spaceMerges and mutations need temp space to write new partsFree space below 20% of total

Fixes

Failing mutation with latest_fail_reason

When a mutation repeatedly fails, it will not self-heal. Identify the failing parts from parts_to_do_names. On a replicated table, check whether those parts are healthy on another replica. If a part is corrupt and the table is replicated, you may detach the bad part and re-fetch it from a healthy peer. Once the part is repaired or removed, the mutation should resume. If you cannot repair the part quickly, kill the mutation to unblock the queue:

KILL MUTATION WHERE database = 'db' AND table = 'tbl' AND mutation_id = 'id';

Killing leaves the table in a partially mutated state. Issue a replacement mutation only after confirming the root cause is resolved.

Background pool saturation

If system.merges shows long-running tasks consuming the pool, you must decide whether to wait or kill. Killing a non-critical mutation frees a pool slot and allows merges to resume. In-progress work on the current part is lost; already-completed parts remain mutated. If the pool is saturated by merges on a large table, killing the mutation may be the only way to prevent insert delays. After killing, monitor system.merges to confirm merges restart.

Replication queue blockage

For replicated tables, a mutation may stall because the local replica cannot fetch a required part or because a MUTATE_PART entry is stuck. Do not manually delete entries from system.replication_queue. Instead, investigate the source replica. If the replica is permanently diverged, use SYSTEM RESTART REPLICA or re-initialize the replica. These actions force the replica to reconcile its state and re-fetch missing parts. Expect temporary network load during recovery.

Resource exhaustion

If disk space is below the threshold needed for temporary merge output, background writes will halt. Free space immediately by detaching old partitions or adding capacity. If memory is saturated, large mutations may be aborted by the memory tracker. Address the memory pressure first by killing runaway queries or reducing concurrency. Resource fixes are the only durable solution; killing the mutation without fixing the resource only delays the next stall.

Legitimate slow mutation

For tables with billions of rows, mutations are inherently slow. If parts_to_do is decreasing, leave the mutation alone. Monitor part count to ensure merges are not starved. If you need faster completion, the only safe option is to reduce concurrent load on the node so the background pool can dedicate more cycles to the mutation.

Prevention

  • Monitor system.mutations proactively. A mutation backlog is a leading indicator of merge starvation.
  • Avoid issuing many sequential ALTER UPDATE or ALTER DELETE statements on the same table. Batch changes or use lightweight deletes where your version supports them.
  • Maintain disk free space well above 20%. Merges and mutations need headroom to write output.
  • On replicated tables, monitor mutation progress per replica. A stall on one replica creates silent divergence.
  • Set alerting on parts_to_do flatlining for more than 30 minutes and on non-empty latest_fail_reason.

How Netdata helps

  • Correlate flat parts_to_do with rising active part count to expose merge starvation caused by stuck mutations.
  • Alert on background pool saturation and merge stall signals that accompany mutation blockage.
  • Track per-table part growth trends to separate mutation-induced accumulation from insert overload.
  • Surface disk and memory pressure metrics alongside mutation status to reveal resource root causes.
  • Monitor replication lag per replica when mutations run on replicated tables, so a stalled replica does not go unnoticed.