ClickHouse merge death spiral: when parts accumulate faster than merges consolidate

Insert latency climbs. Application logs show ClickHouse throttling writes. Eventually inserts fail with Too many parts. Disk usage rises even though ingestion volume is flat. The cluster is up but refusing writes.

This is the merge death spiral: a self-reinforcing loop where parts accumulate faster than background merges consolidate them. Every INSERT creates immutable on-disk parts. Background merge threads combine smaller parts into larger ones to keep query performance healthy and part counts low. When insert pressure exceeds merge throughput, the backlog grows, merge overhead increases, and the system chokes on its own structure.

What this means

The loop begins when new part creation exceeds merge completion. Each part adds files, index entries, and merge candidates. As the part count per partition climbs, merges take longer and need more resources.

If merges cannot catch up, ClickHouse first delays inserts (DelayedInserts), then rejects them (RejectedInserts). Disk usage accelerates because merges cannot reclaim space: a merge must write its entire output before deleting source parts, and stalled merges leave old parts alive indefinitely. If the disk fills enough to prevent even a single merge from completing, the loop becomes unbreakable without manual intervention.

flowchart TD
    A[High insert rate] --> B[New parts created]
    B --> C[Part count grows]
    C --> D[Merge cost rises]
    D --> E[Merges fall behind]
    E --> F[Disk cannot reclaim space]
    F --> C
    E --> G[DelayedInserts]
    G --> H[RejectedInserts]
    H --> I[Writes halt]
    C --> J[Query slowdown]

Common causes

CauseWhat it looks likeFirst thing to check
Many small insertsHigh query rate, low rows per insert; each INSERT creates a partsystem.events for InsertQuery vs InsertedRows; system.query_log for batch size
Mutations blocking the merge poolFew or no merges in system.merges while system.mutations shows long-running entriesSELECT * FROM system.mutations WHERE is_done = 0
Disk I/O saturationMerge threads exist but progress in system.merges is flat or barely movingsystem.merges sampled over 60 seconds; OS disk metrics
Disk space too low for merge outputMerges stop despite available threads; disk near 85-90%system.disks for free_space and unreserved_space
Replication bottleneck (ReplicatedMergeTree)Merge pool busy but replication queue shows pending MERGE_PARTS entries; replica lag growingsystem.replication_queue for stuck merge entries
High-cardinality partitioningPart count explodes across many partition keys simultaneously, each with independent merge queuessystem.parts grouped by partition_id

Quick checks

Run these safe, read-only probes to confirm whether you are in a spiral and what is blocking merges.

-- Worst partitions by active part count
SELECT
    database,
    table,
    partition_id,
    count() AS active_parts,
    sum(rows) AS total_rows,
    formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
ORDER BY active_parts DESC
LIMIT 20;
-- Are merges running and making progress?
SELECT
    database,
    table,
    elapsed,
    progress,
    num_parts,
    is_mutation,
    formatReadableSize(total_size_bytes_compressed) AS total_size,
    formatReadableSize(memory_usage) AS mem_used
FROM system.merges
ORDER BY elapsed DESC;
-- Insert delay and rejection counters
SELECT event, value
FROM system.events
WHERE event IN ('DelayedInserts', 'RejectedInserts');
-- Pending mutations that may be consuming merge threads
SELECT
    database,
    table,
    mutation_id,
    command,
    create_time,
    parts_to_do,
    is_done
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;
-- Disk space and reservation headroom
SELECT
    name,
    path,
    formatReadableSize(free_space) AS free,
    formatReadableSize(total_space) AS total,
    round(100 * (1 - free_space / total_space), 1) AS used_pct,
    formatReadableSize(unreserved_space) AS unreserved
FROM system.disks;
-- Background pool saturation
SELECT metric, value
FROM system.metrics
WHERE metric LIKE 'Background%Pool%';
-- Replication queue stuck entries (only if using ReplicatedMergeTree)
SELECT
    database,
    table,
    type,
    create_time,
    last_attempt_time,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;

How to diagnose it

  1. Confirm part count explosion. Group system.parts by partition_id and sample active parts over time. A sustained upward slope is the defining symptom.
  2. Check merge progress. Query system.merges. If the result set is empty while inserts are active and parts are high, merges are not running. If merges exist but progress is flat across a 60-second window, they are stuck.
  3. Check for mutations. Query system.mutations for is_done = 0. Mutations rewrite entire parts and occupy merge pool threads. A large mutation on a busy table can starve merges.
  4. Measure insert pressure. Compare InsertQuery event count against InsertedRows. A high query count with low row count indicates micro-batching, the most common trigger.
  5. Inspect disk headroom. Use system.disks. If unreserved_space is near zero or used percentage exceeds 85%, merges may halt because they cannot allocate temporary output space.
  6. Correlate delay and rejection events. Rising DelayedInserts means the system is throttling. Any increase in RejectedInserts means data loss is already happening.
  7. Evaluate replication if clustered. In system.replication_queue, look for MERGE_PARTS entries with high num_tries or last_exception. Replicated merges must propagate; a stuck replica can serialize the whole queue.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Active part count per partitionDirectly impacts query performance and insert viabilitySustained growth past 500 parts per partition
Merge activity (system.merges)Background merge subsystem health and advancementEmpty result set during active inserts, or progress stuck
DelayedInserts / RejectedInsertsClickHouse throttling or refusing writesAny increase in RejectedInserts; steady growth in DelayedInserts
Background pool utilizationWhether merge threads are saturated or blockedActive tasks consistently near pool size for more than 10 minutes
Disk free space (system.disks)Merges require temporary output space; without it they haltunreserved_space approaching zero or used percentage above 85%
Mutation queue depthMutations consume the same pool as merges and can block consolidationAny is_done = 0 mutation with parts_to_do flat for more than 30 minutes
Insert latencyLeading indicator that precedes delay/rejection eventsP99 insert latency more than 5x baseline sustained for more than 15 minutes

Fixes

Stop the bleed: throttle or pause inserts

Reduce part creation rate. Reduce insert frequency and increase batch size. Aim for fewer, larger inserts. If the source is external, buffer upstream temporarily.

Kill mutations that monopolize the pool

If system.mutations shows long-running mutations blocking merges, cancel them with KILL MUTATION. This frees pool threads immediately, but the mutation work is lost and must be reissued later. Only kill mutations you can afford to restart.

Reclaim disk space fast

When disk is critically full, merges cannot complete. Identify the largest tables:

SELECT database, table, formatReadableSize(sum(bytes_on_disk))
FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC;

Detach old or unneeded partitions to free space immediately. Warning: detached data is unavailable until reattached. Verify backup or recovery requirements before detaching.

Increase merge concurrency only if resources exist

If CPU and I/O headroom exist, increase merge parallelism by adjusting background_merges_mutations_concurrency_ratio. This consumes more resources; do not raise it on an already saturated system.

Address replication bottlenecks

In ReplicatedMergeTree setups, a single slow replica or network partition can stall the replication queue. Check system.replication_queue for stuck MERGE_PARTS or GET_PART entries. Fix the replica or network issue; do not restart nodes blindly, as reconnection storms worsen coordination load.

Prevention

  • Batch inserts aggressively. Target roughly one INSERT per 1-2 seconds carrying tens of thousands to hundreds of thousands of rows. Many small inserts are the most common preventable cause.
  • Consider async inserts if clients cannot batch. Asynchronous inserts buffer multiple small inserts server-side, reducing part creation pressure.
  • Monitor part count trend, not just absolute value. A partition at 200 parts may be healthy, but if the count has been rising for 30 minutes, a crisis is forming.
  • Keep disk usage below 80-85%. ClickHouse needs headroom to write merged output before deleting source parts. The cliff from low disk to dead system is sudden.
  • Watch mutation usage. Avoid treating ClickHouse like an OLTP store with frequent ALTER UPDATE/DELETE. Each mutation rewrites entire parts and competes with merges.
  • Right-size partitioning. Parts are not merged across partitions. A high-cardinality partition key multiplies merge queues and accelerates part accumulation. Coarsen partition granularity where possible.

How Netdata helps

  • Correlate system.parts active count, system.merges throughput, and DelayedInserts/RejectedInserts on one dashboard.
  • Track disk I/O and space alongside part count to distinguish merge backlog from disk saturation.
  • Surface background pool utilization and mutation queue depth without manual system table queries during an incident.
  • Alert on part-count growth rate to catch an inverted merge-to-insert ratio early.