ClickHouse merge duration climbing: the leading indicator of part explosion
system.merges shows elapsed times in hours. Your dashboards show P99 merge duration climbing over the past 48 hours. Rising merge duration is the earliest signal that your cluster is heading toward a part-count crisis, typically 1 to 3 days before inserts throttle or fail entirely.
Active part count is a lagging indicator. Merge duration rises while the merge subsystem still has work in progress. By the time part counts cross parts_to_delay_insert or parts_to_throw_insert, the backlog is severe. The operational skill is distinguishing legitimate long merges on large tables from systemic slowdowns that cascade into delayed or rejected inserts, then query degradation.
What this means
ClickHouse writes every INSERT as one or more immutable parts. The MergeTree engine continuously merges smaller parts into larger ones to reduce file count and index overhead.
Merge duration climbs when the background pool saturates, disk I/O is contested, or individual merges grow large enough to dominate threads. A single merge running for hours on a multi-terabyte table can be normal. The danger sign is a sustained increase in P99 merge duration across tables that previously completed similar-sized merges in minutes. Once merge throughput falls below the part creation rate, the system enters a debt spiral. ClickHouse first slows inserts at parts_to_delay_insert, then rejects them at parts_to_throw_insert. At that point the cluster is in a merge death spiral.
flowchart TD
A[Inserts outpace merges] --> B[Merge duration climbs]
B --> C[Active parts per partition rise]
C --> D[Insert throttling begins]
D --> E[Insert rejection: too many parts]
E --> F[Write halt and query degradation]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Insert micro-batching | Many small inserts create parts faster than merges consolidate them. P99 merge duration climbs while active parts grow steadily. | system.events for InsertQuery vs. InsertedRows ratio. Target 1,000+ rows per insert. |
| Mutation backlog | ALTER UPDATE/DELETE operations monopolize background pool threads. system.merges shows is_mutation = 1 consuming slots. | SELECT * FROM system.mutations WHERE is_done = 0; |
| Disk I/O saturation | Merge threads exist but throughput is low. Queries compete with merges for bandwidth. | iostat -xz 1 5 and system.merges.progress stalling. |
| Background pool exhaustion | Merge and mutation tasks fill the pool. New merges queue while parts accumulate. | system.metrics for BackgroundMergesAndMutationsPoolTask near pool size. |
| Disk space pressure | ClickHouse halts merges when insufficient space exists to write the merged result before deleting sources. | system.disks for unreserved_space approaching zero. |
| TTL merge competition | Tables with TTL generate additional background merges that compete with regular merges for the same pool. | system.merges during spikes to see if affected tables have TTL configured. |
Quick checks
Run these read-only probes to assess current state.
# Currently running merges and elapsed time
clickhouse-client -q "
SELECT database, table, elapsed, progress, num_parts, is_mutation,
formatReadableSize(total_size_bytes_compressed) AS total_size
FROM system.merges
ORDER BY elapsed DESC
LIMIT 10;
"
# P99 merge duration over last 24 hours (requires system.part_log)
clickhouse-client -q "
SELECT quantile(0.99)(duration_ms) / 1000 AS p99_merge_sec
FROM system.part_log
WHERE event_type = 'MergeParts'
AND event_time > now() - INTERVAL 24 HOUR;
"
# Active parts per partition (where explosions happen)
clickhouse-client -q "
SELECT database, table, partition_id, count(*) AS active_parts
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
ORDER BY active_parts DESC
LIMIT 20;
"
# Mutations blocking the pool
clickhouse-client -q "
SELECT database, table, mutation_id, parts_to_do, latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;
"
# Background pool saturation
clickhouse-client -q "
SELECT metric, value
FROM system.metrics
WHERE metric LIKE 'Background%Pool%';
"
# Disk space available for merge temp writes
clickhouse-client -q "
SELECT name, formatReadableSize(free_space) AS free,
formatReadableSize(unreserved_space) AS unreserved
FROM system.disks;
"
# Insert batching health
clickhouse-client -q "
SELECT event, value
FROM system.events
WHERE event IN ('InsertQuery', 'InsertedRows');
"
How to diagnose it
- Baseline P99 merge duration. Query
system.part_logfor the past week. A climb from single-digit minutes toward hours is a red flag. - Check live merge velocity. Sample
system.mergestwice, 60 seconds apart. Ifprogressis flat for a merge with highelapsed, the merge is stuck or I/O bound. - Map merges to tables and partitions. Correlate the tables with the longest
elapsedagainstsystem.partsactive counts. - Check for mutation interference. If
system.mergesshowsis_mutation = 1entries consuming threads, querysystem.mutationsforis_done = 0. A mutation withparts_to_donot decreasing is starving merges. - Measure background pool saturation. Compare
BackgroundMergesAndMutationsPoolTasktoBackgroundMergesAndMutationsPoolSizeinsystem.metrics. If the ratio exceeds 0.9 for more than 10 minutes, the pool is a bottleneck. - Verify disk headroom. Merges require enough space to write the full merged result before source parts are unlinked. If
system.disks.unreserved_spaceapproaches zero, merges halt. - Validate insert patterns. If
InsertedRows / InsertQueryinsystem.eventsyields fewer than 1,000 rows per insert, the application is micro-batching. - Distinguish normal from abnormal. A merge with
num_parts = 20andtotal_size_bytes_compressedin the hundreds of gigabytes legitimately takes hours. A merge withnum_parts = 3and small footprint taking over an hour indicates a slowdown.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
P99 merge duration (system.part_log) | Leading indicator of merge backlog | Trending from minutes toward hours over 24-48 hours |
Active parts per partition (system.parts) | Distance to insert throttling and rejection | Sustained growth above 500 per partition |
Background merge pool utilization (system.metrics) | Whether merge threads are saturated | BackgroundMergesAndMutationsPoolTask / pool size > 0.9 for > 10 min |
Mutation queue depth (system.mutations) | Hidden consumers of merge capacity | Any is_done = 0 with parts_to_do flat for > 30 min |
Disk unreserved space (system.disks) | Merges halt without temp space | Approaching zero |
Insert delay/rejection counters (system.events) | Late-stage symptoms of merge debt | DelayedInserts or RejectedInserts increasing |
Merge progress velocity (system.merges) | Real-time merge health | progress flat across a 60-second sample |
Fixes
Reduce part creation rate. Throttle insertors or increase batch sizes to 1,000+ rows per INSERT. This is the safest first step and buys immediate runway.
Kill blocking mutations. Warning: KILL MUTATION aborts the running mutation. If non-critical ALTER UPDATE/DELETE mutations are consuming the pool, use KILL MUTATION to free threads.
Increase merge concurrency. If CPU and disk I/O have headroom, raise background_merges_mutations_concurrency_ratio. Changes do not affect merges already in flight.
Reclaim disk space. If unreserved_space is low, drop old partitions with ALTER TABLE ... DROP PARTITION or add capacity. Warning: DROP PARTITION is irreversible. DETACH PARTITION alone moves data to the detached/ folder and does not free space.
Separate merge and mutation pools. On newer versions, merges and mutations can use separate pools. If mutations are a chronic issue, configure isolation so they cannot starve regular merges.
Review partition granularity. Over-partitioning (for example, by hour instead of day) multiplies part counts. Changing partition keys requires rewriting data but prevents recurrence.
Prevention
- Monitor P99 merge duration weekly using
system.part_log. Trend it before it becomes an incident. - Keep active parts per partition below 50% of
parts_to_delay_insertduring normal operations. - Enforce insert batching at the application layer. Single-row inserts are an anti-pattern in ClickHouse.
- Review
system.mutationsbefore it becomes a queue. Avoid treating ClickHouse like an OLTP store. - Maintain disk headroom so merges always have temp space.
- Alert on background pool utilization trends, not just instantaneous spikes.
How Netdata helps
- Correlate climbing merge duration with part-count growth, disk I/O latency, and insert rate on unified timelines.
- Alert on derived metrics like part-count growth rate and merge pool utilization without manual polling of
system.metrics. - Visualize P99 merge duration trends automatically.
- Surface host-level disk latency and CPU alongside ClickHouse internal signals.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
- ClickHouse monitoring checklist: the signals every production cluster needs
- ClickHouse monitoring maturity model: from survival to expert
- ClickHouse DB::Exception: Too many parts - causes and fixes
- How ClickHouse actually works in production: a mental model for operators







