$ guides / clickhouse / clickhouse-memory-pressure-death-spiral ▌

Operations Guides

ClickHouse memory pressure death spiral: runaway queries, retries, and OOM

MEMORY_LIMIT_EXCEEDED errors climb in the query log. Queries that normally finish in seconds now take minutes or are killed outright. The ClickHouse process is near its memory limit, but killing the heaviest query only frees capacity for a moment before another query is killed. If the application retries immediately, pressure never drops. With spill-to-disk enabled, the bottleneck shifts to disk I/O, starving background merges and slowing the whole system.

This is the memory pressure death spiral: a composite failure pattern where memory saturation triggers query termination or spill-to-disk, rising latency provokes retries, and repeated attempts keep memory pinned near the limit. The spiral ends with an OS-level OOM kill if internal tracker and cgroup limits are misaligned, or with a merge crisis if temporary disk files consume all I/O bandwidth.

What this means

ClickHouse enforces memory limits hierarchically: per-query (max_memory_usage), per-user (max_memory_usage_for_user), and server-wide (max_server_memory_usage). When any tracker hits its limit, ClickHouse kills the offending query with exception code 241. The server-wide limit defaults to 90% of physical RAM. Per-query limits live in user profiles; if unset, a query allocates without an individual bound.

Once memory pressure begins, three effects compound:

Query kills do not free headroom. Surviving queries, cache restoration, or background merges immediately reclaim the freed memory.
Retries amplify load. An application that reissues a killed query immediately sends fresh allocations into the same constrained pool.
Spill-to-disk shifts the bottleneck. If spill-to-disk is enabled, queries write temporary files instead of failing. This turns memory pressure into disk I/O pressure that competes with merges and increases latency.

The result is a feedback loop: pressure causes slowdowns and kills, which trigger retries, which restore pressure. Breaking the loop requires identifying whether the root cause is a single runaway query, unbounded concurrency, or an application retry storm.

flowchart TD
    A[Workload surge or runaway query] --> B[Memory allocations rise]
    B --> C{Near max_server_memory_usage?}
    C -->|Yes| D[Queries killed or spilled to disk]
    D --> E[Query latency increases]
    E --> F[Application retries]
    F --> B
    D --> G[Disk I/O saturation]
    G --> H[Merge throughput drops]
    H --> I[Part count rises]
    I --> J[Further latency increase]
    J --> E

Common causes

Cause	What it looks like	First thing to check
Runaway query	A single query shows `peak_memory_usage` orders of magnitude above the norm in `system.processes`	`SELECT query_id, memory_usage, peak_memory_usage, query FROM system.processes ORDER BY memory_usage DESC LIMIT 5`
Retry amplification	The same query pattern appears repeatedly in `system.query_log` with `exception_code = 241`	`SELECT exception_code, count(), any(exception) FROM system.query_log WHERE type = 'ExceptionWhileProcessing' AND event_time > now() - INTERVAL 10 MINUTE GROUP BY exception_code`
Unbounded concurrent analytics	Total memory across all running queries approaches 70% of the server limit with no single dominant query	`SELECT formatReadableSize(sum(memory_usage)) FROM system.processes`
Cache over-allocation	`MemoryTracking` is high despite low query concurrency; mark cache or uncompressed cache consume most RAM	`SELECT metric, value FROM system.asynchronous_metrics WHERE metric IN ('MarkCacheBytes', 'UncompressedCacheBytes')`
Spill-to-disk without I/O headroom	`/var/lib/clickhouse/tmp/` grows and disk I/O latency spikes while queries continue rather than fail	`ls -lah /var/lib/clickhouse/tmp/` and `iostat -xz 1 5`
Background merge memory spikes	Memory pressure coincides with long-running merges in `system.merges`	`SELECT database, table, elapsed, formatReadableSize(memory_usage) FROM system.merges ORDER BY memory_usage DESC`

Quick checks

# Check ClickHouse tracked memory vs configured server limit
clickhouse-client -q "SELECT formatReadableSize(value) AS memory_tracking FROM system.metrics WHERE metric = 'MemoryTracking'"

# Check OS RSS and peak to spot divergence from tracked memory
cat /proc/$(pgrep clickhouse-server)/status | grep -E '^(VmRSS|VmPeak|VmSize)'

# Top memory-consuming queries right now
clickhouse-client -q "SELECT query_id, formatReadableSize(memory_usage) AS mem, formatReadableSize(peak_memory_usage) AS peak, substring(query, 1, 120) AS q FROM system.processes WHERE memory_usage > 0 ORDER BY memory_usage DESC LIMIT 10"

# Recent MEMORY_LIMIT_EXCEEDED errors
clickhouse-client -q "SELECT query_id, exception_code, substring(query, 1, 120) AS q FROM system.query_log WHERE type = 'ExceptionWhileProcessing' AND exception_code = 241 AND event_time > now() - INTERVAL 10 MINUTE LIMIT 10"

# Check for spill-to-disk activity in tmp
ls -lah /var/lib/clickhouse/tmp/

# Check concurrent query count and total query memory
clickhouse-client -q "SELECT count() AS queries, formatReadableSize(sum(memory_usage)) AS total_query_mem FROM system.processes"

# Check mark cache size
clickhouse-client -q "SELECT metric, formatReadableSize(value) FROM system.asynchronous_metrics WHERE metric = 'MarkCacheBytes'"

# Check swap usage (swap thrashing makes recovery nearly impossible)
free -h | grep -i swap

How to diagnose it

Confirm server-wide pressure. Compare MemoryTracking from system.metrics to max_server_memory_usage. If the ratio is sustained above 80%, the server is in the danger zone. Also check MemoryResident in system.asynchronous_metrics because RSS can exceed tracked memory due to allocator overhead and untracked allocations.
Identify the memory consumer category. Run the system.processes query sorted by memory_usage. If one query dominates, it is a runaway query. If many moderate queries sum to most of the limit, it is a concurrency problem.
Check for retry amplification. Query system.query_log for exception code 241 over the last 10 minutes. If the same query fingerprint appears multiple times with short intervals, the application is retrying immediately.
Check spill-to-disk activity. Inspect /var/lib/clickhouse/tmp/. Growing files there mean queries are spilling. Correlate with disk I/O metrics (iostat or your infrastructure monitoring). High I/O wait during memory pressure confirms the bottleneck has shifted to disk.
Correlate with merge health. Check system.merges. If merges are running but disk I/O is saturated, merge throughput drops and parts begin to accumulate. This is the secondary death spiral.
Check OS and cgroup limits. In containerized deployments, verify that the cgroup memory limit is not close to ClickHouse’s internal limit. If the container limit is lower than or equal to max_server_memory_usage, the OS OOM killer can fire before ClickHouse’s circuit breaker engages.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
MemoryTracking / max_server_memory_usage	Measures proximity to ClickHouse’s internal query-kill threshold	Sustained ratio > 80%
Peak per-query memory	Identifies runaway queries before they consume the server	Single query > 50% of server limit
FailedQuery rate with exception 241	Direct evidence of the kill-and-retry loop	Any sustained increase over baseline
Mark cache hit rate	Memory pressure evicts caches, causing more disk I/O	< 80% during pressure events (from `system.events` `MarkCacheHits` and `MarkCacheMisses`)
Disk I/O await	Spill-to-disk and cache misses shift saturation to I/O	> 20 ms on SSD sustained
Active merge count and throughput	I/O competition from spills starves merges, causing part accumulation	Merge throughput flat or falling while parts grow
OSMemoryAvailable	The OS OOM killer operates independently of ClickHouse trackers	< 5% of total RAM
Concurrent query count	Heavy concurrency compounds moderate per-query memory into server pressure	Approaching `max_concurrent_queries` or 2x typical peak

Fixes

Kill the runaway query and bound per-query memory

Warning: Killing a query is disruptive. The client receives an error and may retry immediately, so coordinate with the application owner before killing production traffic.

Identify the top consumer in system.processes and kill it:

KILL QUERY WHERE query_id = '<query_id>';

Set max_memory_usage in the user profile to prevent recurrence. A sensible starting point is 10-20% of max_server_memory_usage for ad-hoc users, lower for service accounts. The tradeoff is that legitimate large queries fail instead of completing slowly.

Stop retry amplification

Implement exponential back-off and circuit-breaking at the application layer. A query that failed with code 241 should wait seconds, not milliseconds, before retrying, and should give up after a small number of attempts. The tradeoff is temporary query unavailability, but this prevents the retry loop from pinning the server at 100% memory.

Reduce or remove spill-to-disk pressure

If max_bytes_before_external_group_by or max_bytes_before_external_sort are configured and disk I/O is saturated, you have two choices. Raise the thresholds so fewer queries spill, accepting that more will hit code 241. Or leave them enabled but ensure disk I/O capacity is reserved for merges so spills do not compete on the same device. The tradeoff is between query failure and I/O saturation.

Temporarily lower concurrency

If the root cause is a surge of legitimate concurrent heavy queries, reduce max_concurrent_queries temporarily. This queues or rejects new queries until memory drops. The tradeoff is that some client requests fail or wait.

Reclaim memory from caches

If MarkCacheBytes or UncompressedCacheBytes are consuming most RAM, reduce mark_cache_size or disable the uncompressed cache. This frees memory for query working sets at the cost of more disk seeks and decompression on repeated queries.

Align container and ClickHouse limits

In Kubernetes or containerized deployments, ensure the cgroup memory limit leaves headroom above max_server_memory_usage. This prevents the OS OOM killer from terminating the process before ClickHouse can kill the offending query. If the container limit is too tight, lower the configured server-wide limit to leave that headroom.

Prevention

Set per-query memory limits for every non-admin user profile. An unbounded query can trigger server-wide pressure.
Monitor both MemoryTracking and OS RSS. Alert on divergence because tracked memory underestimates actual footprint.
Instrument application retry logic with back-off and jitter. Immediate retries on 241 are an anti-pattern.
Size caches conservatively. Leave RAM for query working memory and background merges.
Validate spill-to-disk thresholds against I/O capacity before enabling. Spilling is only a safety valve if the disk subsystem can absorb the extra load without starving merges.
Review queries that use GLOBAL IN or large JOINs. These can cause unbounded memory growth during aggregation.

How Netdata helps

Netdata correlates ClickHouse MemoryTracking with OS RSS and cgroup memory limits in one chart, surfacing tracker divergence immediately.

Netdata alerts on spikes in exception code 241 and sustained memory saturation without requiring manual system.query_log polling.

Netdata disk I/O latency charts, shown alongside query latency, make it easy to see when spill-to-disk shifts the bottleneck from memory to I/O.

Netdata collects per-query memory from system.processes continuously, so runaway queries are visible before they dominate the server.

Netdata merge pool and active part count charts detect secondary merge starvation caused by I/O competition.

The Netdata solution

ClickHouse monitoring with Netdata

Netdata monitors ClickHouse with per-second metrics and ML anomaly detection. Track merge debt, memory usage, replication lag, Keeper/ZooKeeper saturation, and disk headroom against the host signals that drive them.

See ClickHouse monitoring → Start monitoring free

ClickHouse memory pressure death spiral: runaway queries, retries, and OOM

ClickHouse memory pressure death spiral: runaway queries, retries, and OOM

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Kill the runaway query and bound per-query memory

Stop retry amplification

Reduce or remove spill-to-disk pressure

Temporarily lower concurrency

Reclaim memory from caches

Align container and ClickHouse limits

Prevention

How Netdata helps

Related guides

ClickHouse monitoring with Netdata