$ guides / clickhouse / clickhouse-merge-death-spiral ▌

Operations Guides

ClickHouse merge death spiral: when parts accumulate faster than merges consolidate

Insert latency climbs. Application logs show ClickHouse throttling writes. Eventually inserts fail with Too many parts. Disk usage rises even though ingestion volume is flat. The cluster is up but refusing writes.

This is the merge death spiral: a self-reinforcing loop where parts accumulate faster than background merges consolidate them. Every INSERT creates immutable on-disk parts. Background merge threads combine smaller parts into larger ones to keep query performance healthy and part counts low. When insert pressure exceeds merge throughput, the backlog grows, merge overhead increases, and the system chokes on its own structure.

What this means

The loop begins when new part creation exceeds merge completion. Each part adds files, index entries, and merge candidates. As the part count per partition climbs, merges take longer and need more resources.

If merges cannot catch up, ClickHouse first delays inserts (DelayedInserts), then rejects them (RejectedInserts). Disk usage accelerates because merges cannot reclaim space: a merge must write its entire output before deleting source parts, and stalled merges leave old parts alive indefinitely. If the disk fills enough to prevent even a single merge from completing, the loop becomes unbreakable without manual intervention.

flowchart TD
    A[High insert rate] --> B[New parts created]
    B --> C[Part count grows]
    C --> D[Merge cost rises]
    D --> E[Merges fall behind]
    E --> F[Disk cannot reclaim space]
    F --> C
    E --> G[DelayedInserts]
    G --> H[RejectedInserts]
    H --> I[Writes halt]
    C --> J[Query slowdown]

Common causes

Cause	What it looks like	First thing to check
Many small inserts	High query rate, low rows per insert; each INSERT creates a part	`system.events` for `InsertQuery` vs `InsertedRows`; `system.query_log` for batch size
Mutations blocking the merge pool	Few or no merges in `system.merges` while `system.mutations` shows long-running entries	`SELECT * FROM system.mutations WHERE is_done = 0`
Disk I/O saturation	Merge threads exist but `progress` in `system.merges` is flat or barely moving	`system.merges` sampled over 60 seconds; OS disk metrics
Disk space too low for merge output	Merges stop despite available threads; disk near 85-90%	`system.disks` for `free_space` and `unreserved_space`
Replication bottleneck (ReplicatedMergeTree)	Merge pool busy but replication queue shows pending `MERGE_PARTS` entries; replica lag growing	`system.replication_queue` for stuck merge entries
High-cardinality partitioning	Part count explodes across many partition keys simultaneously, each with independent merge queues	`system.parts` grouped by `partition_id`

Quick checks

Run these safe, read-only probes to confirm whether you are in a spiral and what is blocking merges.

-- Worst partitions by active part count
SELECT
    database,
    table,
    partition_id,
    count() AS active_parts,
    sum(rows) AS total_rows,
    formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
ORDER BY active_parts DESC
LIMIT 20;

-- Are merges running and making progress?
SELECT
    database,
    table,
    elapsed,
    progress,
    num_parts,
    is_mutation,
    formatReadableSize(total_size_bytes_compressed) AS total_size,
    formatReadableSize(memory_usage) AS mem_used
FROM system.merges
ORDER BY elapsed DESC;

-- Insert delay and rejection counters
SELECT event, value
FROM system.events
WHERE event IN ('DelayedInserts', 'RejectedInserts');

-- Pending mutations that may be consuming merge threads
SELECT
    database,
    table,
    mutation_id,
    command,
    create_time,
    parts_to_do,
    is_done
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;

-- Disk space and reservation headroom
SELECT
    name,
    path,
    formatReadableSize(free_space) AS free,
    formatReadableSize(total_space) AS total,
    round(100 * (1 - free_space / total_space), 1) AS used_pct,
    formatReadableSize(unreserved_space) AS unreserved
FROM system.disks;

-- Background pool saturation
SELECT metric, value
FROM system.metrics
WHERE metric LIKE 'Background%Pool%';

-- Replication queue stuck entries (only if using ReplicatedMergeTree)
SELECT
    database,
    table,
    type,
    create_time,
    last_attempt_time,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;

How to diagnose it

Confirm part count explosion. Group system.parts by partition_id and sample active parts over time. A sustained upward slope is the defining symptom.
Check merge progress. Query system.merges. If the result set is empty while inserts are active and parts are high, merges are not running. If merges exist but progress is flat across a 60-second window, they are stuck.
Check for mutations. Query system.mutations for is_done = 0. Mutations rewrite entire parts and occupy merge pool threads. A large mutation on a busy table can starve merges.
Measure insert pressure. Compare InsertQuery event count against InsertedRows. A high query count with low row count indicates micro-batching, the most common trigger.
Inspect disk headroom. Use system.disks. If unreserved_space is near zero or used percentage exceeds 85%, merges may halt because they cannot allocate temporary output space.
Correlate delay and rejection events. Rising DelayedInserts means the system is throttling. Any increase in RejectedInserts means data loss is already happening.
Evaluate replication if clustered. In system.replication_queue, look for MERGE_PARTS entries with high num_tries or last_exception. Replicated merges must propagate; a stuck replica can serialize the whole queue.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Active part count per partition	Directly impacts query performance and insert viability	Sustained growth past 500 parts per partition
Merge activity (`system.merges`)	Background merge subsystem health and advancement	Empty result set during active inserts, or `progress` stuck
DelayedInserts / RejectedInserts	ClickHouse throttling or refusing writes	Any increase in `RejectedInserts`; steady growth in `DelayedInserts`
Background pool utilization	Whether merge threads are saturated or blocked	Active tasks consistently near pool size for more than 10 minutes
Disk free space (`system.disks`)	Merges require temporary output space; without it they halt	`unreserved_space` approaching zero or used percentage above 85%
Mutation queue depth	Mutations consume the same pool as merges and can block consolidation	Any `is_done = 0` mutation with `parts_to_do` flat for more than 30 minutes
Insert latency	Leading indicator that precedes delay/rejection events	P99 insert latency more than 5x baseline sustained for more than 15 minutes

Fixes

Stop the bleed: throttle or pause inserts

Reduce part creation rate. Reduce insert frequency and increase batch size. Aim for fewer, larger inserts. If the source is external, buffer upstream temporarily.

Kill mutations that monopolize the pool

If system.mutations shows long-running mutations blocking merges, cancel them with KILL MUTATION. This frees pool threads immediately, but the mutation work is lost and must be reissued later. Only kill mutations you can afford to restart.

Reclaim disk space fast

When disk is critically full, merges cannot complete. Identify the largest tables:

SELECT database, table, formatReadableSize(sum(bytes_on_disk))
FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC;

Detach old or unneeded partitions to free space immediately. Warning: detached data is unavailable until reattached. Verify backup or recovery requirements before detaching.

Increase merge concurrency only if resources exist

If CPU and I/O headroom exist, increase merge parallelism by adjusting background_merges_mutations_concurrency_ratio. This consumes more resources; do not raise it on an already saturated system.

Address replication bottlenecks

In ReplicatedMergeTree setups, a single slow replica or network partition can stall the replication queue. Check system.replication_queue for stuck MERGE_PARTS or GET_PART entries. Fix the replica or network issue; do not restart nodes blindly, as reconnection storms worsen coordination load.

Prevention

Batch inserts aggressively. Target roughly one INSERT per 1-2 seconds carrying tens of thousands to hundreds of thousands of rows. Many small inserts are the most common preventable cause.
Consider async inserts if clients cannot batch. Asynchronous inserts buffer multiple small inserts server-side, reducing part creation pressure.
Monitor part count trend, not just absolute value. A partition at 200 parts may be healthy, but if the count has been rising for 30 minutes, a crisis is forming.
Keep disk usage below 80-85%. ClickHouse needs headroom to write merged output before deleting source parts. The cliff from low disk to dead system is sudden.
Watch mutation usage. Avoid treating ClickHouse like an OLTP store with frequent ALTER UPDATE/DELETE. Each mutation rewrites entire parts and competes with merges.
Right-size partitioning. Parts are not merged across partitions. A high-cardinality partition key multiplies merge queues and accelerates part accumulation. Coarsen partition granularity where possible.

How Netdata helps

Correlate system.parts active count, system.merges throughput, and DelayedInserts/RejectedInserts on one dashboard.
Track disk I/O and space alongside part count to distinguish merge backlog from disk saturation.
Surface background pool utilization and mutation queue depth without manual system table queries during an incident.
Alert on part-count growth rate to catch an inverted merge-to-insert ratio early.

The Netdata solution

ClickHouse monitoring with Netdata

Netdata monitors ClickHouse with per-second metrics and ML anomaly detection. Track merge debt, memory usage, replication lag, Keeper/ZooKeeper saturation, and disk headroom against the host signals that drive them.

See ClickHouse monitoring → Start monitoring free

ClickHouse merge death spiral: when parts accumulate faster than merges consolidate

ClickHouse merge death spiral: when parts accumulate faster than merges consolidate

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Stop the bleed: throttle or pause inserts

Kill mutations that monopolize the pool

Reclaim disk space fast

Increase merge concurrency only if resources exist

Address replication bottlenecks

Prevention

How Netdata helps

Related guides

ClickHouse monitoring with Netdata