ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
Your inserts are slowing down. system.merges shows long-running background tasks with is_mutation = 1, while the active part count climbs toward the parts_to_delay_insert threshold. Write latency rises and DelayedInserts increases even though memory, disk, and CPU are not exhausted. The culprit is usually an application treating ClickHouse as an OLTP store, issuing ALTER TABLE UPDATE or ALTER TABLE DELETE as routine operations.
In ClickHouse, these statements do not perform row-level updates. They create mutations: asynchronous background jobs that rewrite entire data parts. Mutations queue sequentially per table, consume the same background pool that performs merges, and can silently starve the merge process until parts accumulate and inserts begin to reject.
What this means
When you issue ALTER TABLE UPDATE or ALTER TABLE DELETE, ClickHouse inserts a job into system.mutations. That job rewrites every active part in the target table. Multiple mutations stack up and execute one at a time per table, creating an O(mutations x parts) workload. Each mutation occupies a slot in the background merge and mutation pool, which defaults to a concurrency ratio of 2. While mutations run, regular merges are blocked or delayed.
The result is silent merge starvation. Parts that would normally be consolidated continue to accumulate. Query performance degrades as file counts rise. Eventually insert throttling begins, followed by hard rejections when a partition crosses parts_to_throw_insert.
Lightweight DELETE FROM uses a different mechanism based on mask files. It is less disruptive than classic ALTER TABLE DELETE mutations, though deleted rows are physically removed only during subsequent merges. For update-heavy workloads, avoid mutations entirely and use ReplacingMergeTree or CollapsingMergeTree, which handle changes through insert semantics and background deduplication rather than part rewrites.
flowchart TD
A[Frequent ALTER UPDATE or DELETE] --> B[Mutation queued per table]
B --> C[Rewrites every active part]
C --> D[Background merge pool consumed]
D --> E[Regular merges blocked]
E --> F[Active part count rises]
F --> G[Inserts delayed then rejected]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Application using ALTER TABLE UPDATE for incremental changes | system.mutations shows multiple pending mutations on the same table with is_done = 0 | SELECT ... FROM system.mutations WHERE is_done = 0 |
Batch purging with ALTER TABLE DELETE | parts_to_do decreasing very slowly or stalled | system.merges for is_mutation = 1 count |
ORM or BI tool emitting ALTER statements | Frequent Alter entries in system.query_log from application users | Query log filtered by query_kind = 'Alter' |
| One-off corrections queuing behind a large table mutation | Small mutation blocked for hours behind a rewrite of a billion-row table | create_time and parts_to_do per mutation |
Lack of ReplacingMergeTree or CollapsingMergeTree for mutable data | Plain MergeTree engine with frequent “fix” operations | SHOW CREATE TABLE for engine type |
Quick checks
-- Active mutation queue and remaining work
SELECT database, table, mutation_id, command, create_time, is_done, parts_to_do
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;
-- Distinguish mutations from regular merges in the background pool
SELECT count(*) AS active_merges,
countIf(is_mutation = 1) AS mutations,
countIf(is_mutation = 0) AS regular_merges
FROM system.merges;
-- Check if inserts are already being throttled or rejected
SELECT event, value
FROM system.events
WHERE event IN ('DelayedInserts', 'RejectedInserts');
-- Worst-case part count across all partitions
SELECT value
FROM system.asynchronous_metrics
WHERE metric = 'MaxPartCountForPartition';
-- Per-table active part counts to spot accumulation
SELECT database, table, count(*) AS active_parts
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY active_parts DESC
LIMIT 10;
# Recent mutation-related log lines (path varies by installation)
grep -i mutation /var/log/clickhouse-server/*.log | tail -20
How to diagnose it
- Confirm mutation backlog. Query
system.mutations WHERE is_done = 0. If multiple rows exist for the same table, they are queuing sequentially and each must process every part. - Check pool allocation. Query
system.merges. Ifis_mutation = 1dominates the active task list while part counts are high, mutations are monopolizing the background pool. - Correlate with part growth. Compare
MaxPartCountForPartitionover time. A sustained rise while mutations run indicates merges are starved. - Check insert pipeline pressure. Verify
DelayedInsertsandRejectedInsertscounters. Any sustained increase confirms the write path is backing up. - Identify the source. Search
system.query_logforquery_kind = 'Alter'to find which user or service account is issuing mutations. - Assess replication impact. For
ReplicatedMergeTree, comparesystem.mutationsacross replicas. Ifparts_to_dolags on one replica, that node is the bottleneck for cluster-wide completion.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
system.mutations queue depth (is_done = 0) | Direct measure of pending mutation work | More than one pending mutation per table sustained |
parts_to_do flat or decreasing slowly | Indicates stalled or extremely slow mutation | No progress for longer than 30 minutes |
system.merges with is_mutation = 1 | Shows pool capacity consumed by mutations | All active background tasks are mutations |
MaxPartCountForPartition | Predicts “too many parts” before inserts fail | Greater than 500 parts per partition |
DelayedInserts / RejectedInserts | Confirms merge or mutation backlog affecting writes | Any sustained increase |
| Active background pool tasks | Merge and mutation tasks share this pool | system.merges task count near the concurrency ratio limit with growing parts |
Fixes
Kill unnecessary mutations
If a mutation was issued by mistake or is no longer needed, remove it from the queue.
-- Identify the mutation ID
SELECT database, table, mutation_id
FROM system.mutations
WHERE is_done = 0;
-- Kill the mutation
KILL MUTATION WHERE database = 'db' AND table = 'table' AND mutation_id = 'mutation_id';
Warning: Killing a mutation is disruptive and may leave parts in an intermediate state. Test cancellation behavior on your ClickHouse version before using this during an incident.
Switch to ReplacingMergeTree or CollapsingMergeTree
For workloads that require updates or deletes, redesign the table to avoid mutations:
- ReplacingMergeTree: Insert new rows with updated values and a version column. Background merges retain the highest version per
ORDER BYkey. UseFINALor deduplicate at query time. - CollapsingMergeTree: Insert cancel rows (sign = -1) and replacement rows (sign = +1). Background merges collapse matching pairs.
Tradeoff: These engines shift effort from write-time mutation to read-time filtering. Queries must handle duplicate rows that have not yet been merged.
Use lightweight DELETE
If the use case is row deletion, prefer DELETE FROM table WHERE ... over ALTER TABLE DELETE. This applies a mask file instead of rewriting entire parts. It is significantly less I/O-intensive, though cleanup still requires subsequent merges.
Collapse mutations into fewer statements
If multiple column updates are needed, batch them into a single ALTER TABLE UPDATE with a compound condition. One mutation rewriting parts once is far better than five sequential mutations each rewriting the same parts.
Increase background pool headroom
If mutations are legitimate and the server has CPU and disk headroom, you can increase merge and mutation concurrency:
SELECT name, value
FROM system.settings
WHERE name = 'background_merges_mutations_concurrency_ratio';
Only increase this if disk I/O and CPU are not saturated. More concurrency amplifies resource usage without solving the root cause of excessive mutations.
Address stuck or failed mutations
If latest_fail_reason is non-empty in system.mutations, the mutation is failing repeatedly. A failed mutation can block all subsequent mutations on that table. Check disk space, part integrity, and permissions. Resolve the underlying failure or kill the mutation to unblock the queue.
Prevention
- Never use
ALTER TABLE UPDATE/DELETEas a routine application pattern. Treat it as a rare administrative tool. - Design tables for immutable inserts. Use
ReplacingMergeTree,CollapsingMergeTree, or versioned insert logic instead of in-place updates. - Batch corrections into single statements. Avoid issuing many small mutations.
- Monitor
system.mutationsproactively. Alert on anyis_done = 0mutation that lasts longer than expected for your table size. - Audit query logs for ALTER patterns. Application service accounts should not hold privileges to issue
ALTER TABLE UPDATE/DELETE. - Watch background pool saturation. If background task utilization is consistently near the concurrency limit, investigate whether mutations are the cause before part counts explode.
How Netdata helps
- Correlates mutation backlog with merge starvation: Netdata collects
system.mutationsqueue depth alongsidesystem.mergesactivity, surfacing whenis_mutationtasks dominate the background pool. - Tracks part count growth: Charts
MaxPartCountForPartitionand per-table active parts to expose accumulation caused by blocked merges. - Surfaces insert pipeline pressure: Alerts on
DelayedInsertsandRejectedInsertsto catch downstream write-path impact. - Links resource usage: Correlates mutation-driven I/O with disk latency and memory pressure, distinguishing mutation overload from query load.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse Memory limit (for query) exceeded: per-query limits and GROUP BY/JOIN blowups
- ClickHouse Memory limit (total) exceeded - server-wide memory pressure and fixes
- ClickHouse memory pressure death spiral: runaway queries, retries, and OOM
- ClickHouse MemoryTracking vs MemoryResident: reading the memory gap correctly
- ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
- ClickHouse merge duration climbing: the leading indicator of part explosion
- ClickHouse merges not keeping up: diagnosing a stalled or starved merge pool
- ClickHouse monitoring checklist: the signals every production cluster needs







