ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
Insert latency climbs. Application logs show ClickHouse throttling writes. Eventually inserts fail with Too many parts. Disk usage rises even though ingestion volume is flat. The cluster is up but refusing writes.
This is the merge death spiral: a self-reinforcing loop where parts accumulate faster than background merges consolidate them. Every INSERT creates immutable on-disk parts. Background merge threads combine smaller parts into larger ones to keep query performance healthy and part counts low. When insert pressure exceeds merge throughput, the backlog grows, merge overhead increases, and the system chokes on its own structure.
What this means
The loop begins when new part creation exceeds merge completion. Each part adds files, index entries, and merge candidates. As the part count per partition climbs, merges take longer and need more resources.
If merges cannot catch up, ClickHouse first delays inserts (DelayedInserts), then rejects them (RejectedInserts). Disk usage accelerates because merges cannot reclaim space: a merge must write its entire output before deleting source parts, and stalled merges leave old parts alive indefinitely. If the disk fills enough to prevent even a single merge from completing, the loop becomes unbreakable without manual intervention.
flowchart TD
A[High insert rate] --> B[New parts created]
B --> C[Part count grows]
C --> D[Merge cost rises]
D --> E[Merges fall behind]
E --> F[Disk cannot reclaim space]
F --> C
E --> G[DelayedInserts]
G --> H[RejectedInserts]
H --> I[Writes halt]
C --> J[Query slowdown]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Many small inserts | High query rate, low rows per insert; each INSERT creates a part | system.events for InsertQuery vs InsertedRows; system.query_log for batch size |
| Mutations blocking the merge pool | Few or no merges in system.merges while system.mutations shows long-running entries | SELECT * FROM system.mutations WHERE is_done = 0 |
| Disk I/O saturation | Merge threads exist but progress in system.merges is flat or barely moving | system.merges sampled over 60 seconds; OS disk metrics |
| Disk space too low for merge output | Merges stop despite available threads; disk near 85-90% | system.disks for free_space and unreserved_space |
| Replication bottleneck (ReplicatedMergeTree) | Merge pool busy but replication queue shows pending MERGE_PARTS entries; replica lag growing | system.replication_queue for stuck merge entries |
| High-cardinality partitioning | Part count explodes across many partition keys simultaneously, each with independent merge queues | system.parts grouped by partition_id |
Quick checks
Run these safe, read-only probes to confirm whether you are in a spiral and what is blocking merges.
-- Worst partitions by active part count
SELECT
database,
table,
partition_id,
count() AS active_parts,
sum(rows) AS total_rows,
formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
ORDER BY active_parts DESC
LIMIT 20;
-- Are merges running and making progress?
SELECT
database,
table,
elapsed,
progress,
num_parts,
is_mutation,
formatReadableSize(total_size_bytes_compressed) AS total_size,
formatReadableSize(memory_usage) AS mem_used
FROM system.merges
ORDER BY elapsed DESC;
-- Insert delay and rejection counters
SELECT event, value
FROM system.events
WHERE event IN ('DelayedInserts', 'RejectedInserts');
-- Pending mutations that may be consuming merge threads
SELECT
database,
table,
mutation_id,
command,
create_time,
parts_to_do,
is_done
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;
-- Disk space and reservation headroom
SELECT
name,
path,
formatReadableSize(free_space) AS free,
formatReadableSize(total_space) AS total,
round(100 * (1 - free_space / total_space), 1) AS used_pct,
formatReadableSize(unreserved_space) AS unreserved
FROM system.disks;
-- Background pool saturation
SELECT metric, value
FROM system.metrics
WHERE metric LIKE 'Background%Pool%';
-- Replication queue stuck entries (only if using ReplicatedMergeTree)
SELECT
database,
table,
type,
create_time,
last_attempt_time,
num_tries,
last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;
How to diagnose it
- Confirm part count explosion. Group
system.partsbypartition_idand sample active parts over time. A sustained upward slope is the defining symptom. - Check merge progress. Query
system.merges. If the result set is empty while inserts are active and parts are high, merges are not running. If merges exist butprogressis flat across a 60-second window, they are stuck. - Check for mutations. Query
system.mutationsforis_done = 0. Mutations rewrite entire parts and occupy merge pool threads. A large mutation on a busy table can starve merges. - Measure insert pressure. Compare
InsertQueryevent count againstInsertedRows. A high query count with low row count indicates micro-batching, the most common trigger. - Inspect disk headroom. Use
system.disks. Ifunreserved_spaceis near zero or used percentage exceeds 85%, merges may halt because they cannot allocate temporary output space. - Correlate delay and rejection events. Rising
DelayedInsertsmeans the system is throttling. Any increase inRejectedInsertsmeans data loss is already happening. - Evaluate replication if clustered. In
system.replication_queue, look forMERGE_PARTSentries with highnum_triesorlast_exception. Replicated merges must propagate; a stuck replica can serialize the whole queue.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Active part count per partition | Directly impacts query performance and insert viability | Sustained growth past 500 parts per partition |
Merge activity (system.merges) | Background merge subsystem health and advancement | Empty result set during active inserts, or progress stuck |
| DelayedInserts / RejectedInserts | ClickHouse throttling or refusing writes | Any increase in RejectedInserts; steady growth in DelayedInserts |
| Background pool utilization | Whether merge threads are saturated or blocked | Active tasks consistently near pool size for more than 10 minutes |
Disk free space (system.disks) | Merges require temporary output space; without it they halt | unreserved_space approaching zero or used percentage above 85% |
| Mutation queue depth | Mutations consume the same pool as merges and can block consolidation | Any is_done = 0 mutation with parts_to_do flat for more than 30 minutes |
| Insert latency | Leading indicator that precedes delay/rejection events | P99 insert latency more than 5x baseline sustained for more than 15 minutes |
Fixes
Stop the bleed: throttle or pause inserts
Reduce part creation rate. Reduce insert frequency and increase batch size. Aim for fewer, larger inserts. If the source is external, buffer upstream temporarily.
Kill mutations that monopolize the pool
If system.mutations shows long-running mutations blocking merges, cancel them with KILL MUTATION. This frees pool threads immediately, but the mutation work is lost and must be reissued later. Only kill mutations you can afford to restart.
Reclaim disk space fast
When disk is critically full, merges cannot complete. Identify the largest tables:
SELECT database, table, formatReadableSize(sum(bytes_on_disk))
FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC;
Detach old or unneeded partitions to free space immediately. Warning: detached data is unavailable until reattached. Verify backup or recovery requirements before detaching.
Increase merge concurrency only if resources exist
If CPU and I/O headroom exist, increase merge parallelism by adjusting background_merges_mutations_concurrency_ratio. This consumes more resources; do not raise it on an already saturated system.
Address replication bottlenecks
In ReplicatedMergeTree setups, a single slow replica or network partition can stall the replication queue. Check system.replication_queue for stuck MERGE_PARTS or GET_PART entries. Fix the replica or network issue; do not restart nodes blindly, as reconnection storms worsen coordination load.
Prevention
- Batch inserts aggressively. Target roughly one INSERT per 1-2 seconds carrying tens of thousands to hundreds of thousands of rows. Many small inserts are the most common preventable cause.
- Consider async inserts if clients cannot batch. Asynchronous inserts buffer multiple small inserts server-side, reducing part creation pressure.
- Monitor part count trend, not just absolute value. A partition at 200 parts may be healthy, but if the count has been rising for 30 minutes, a crisis is forming.
- Keep disk usage below 80-85%. ClickHouse needs headroom to write merged output before deleting source parts. The cliff from low disk to dead system is sudden.
- Watch mutation usage. Avoid treating ClickHouse like an OLTP store with frequent
ALTER UPDATE/DELETE. Each mutation rewrites entire parts and competes with merges. - Right-size partitioning. Parts are not merged across partitions. A high-cardinality partition key multiplies merge queues and accelerates part accumulation. Coarsen partition granularity where possible.
How Netdata helps
- Correlate
system.partsactive count,system.mergesthroughput, andDelayedInserts/RejectedInsertson one dashboard. - Track disk I/O and space alongside part count to distinguish merge backlog from disk saturation.
- Surface background pool utilization and mutation queue depth without manual
systemtable queries during an incident. - Alert on part-count growth rate to catch an inverted merge-to-insert ratio early.







