ClickHouse memory pressure death spiral: runaway queries, retries, and OOM

MEMORY_LIMIT_EXCEEDED errors climb in the query log. Queries that normally finish in seconds now take minutes or are killed outright. The ClickHouse process is near its memory limit, but killing the heaviest query only frees capacity for a moment before another query is killed. If the application retries immediately, pressure never drops. With spill-to-disk enabled, the bottleneck shifts to disk I/O, starving background merges and slowing the whole system.

This is the memory pressure death spiral: a composite failure pattern where memory saturation triggers query termination or spill-to-disk, rising latency provokes retries, and repeated attempts keep memory pinned near the limit. The spiral ends with an OS-level OOM kill if internal tracker and cgroup limits are misaligned, or with a merge crisis if temporary disk files consume all I/O bandwidth.

What this means

ClickHouse enforces memory limits hierarchically: per-query (max_memory_usage), per-user (max_memory_usage_for_user), and server-wide (max_server_memory_usage). When any tracker hits its limit, ClickHouse kills the offending query with exception code 241. The server-wide limit defaults to 90% of physical RAM. Per-query limits live in user profiles; if unset, a query allocates without an individual bound.

Once memory pressure begins, three effects compound:

  1. Query kills do not free headroom. Surviving queries, cache restoration, or background merges immediately reclaim the freed memory.
  2. Retries amplify load. An application that reissues a killed query immediately sends fresh allocations into the same constrained pool.
  3. Spill-to-disk shifts the bottleneck. If spill-to-disk is enabled, queries write temporary files instead of failing. This turns memory pressure into disk I/O pressure that competes with merges and increases latency.

The result is a feedback loop: pressure causes slowdowns and kills, which trigger retries, which restore pressure. Breaking the loop requires identifying whether the root cause is a single runaway query, unbounded concurrency, or an application retry storm.

flowchart TD
    A[Workload surge or runaway query] --> B[Memory allocations rise]
    B --> C{Near max_server_memory_usage?}
    C -->|Yes| D[Queries killed or spilled to disk]
    D --> E[Query latency increases]
    E --> F[Application retries]
    F --> B
    D --> G[Disk I/O saturation]
    G --> H[Merge throughput drops]
    H --> I[Part count rises]
    I --> J[Further latency increase]
    J --> E

Common causes

CauseWhat it looks likeFirst thing to check
Runaway queryA single query shows peak_memory_usage orders of magnitude above the norm in system.processesSELECT query_id, memory_usage, peak_memory_usage, query FROM system.processes ORDER BY memory_usage DESC LIMIT 5
Retry amplificationThe same query pattern appears repeatedly in system.query_log with exception_code = 241SELECT exception_code, count(), any(exception) FROM system.query_log WHERE type = 'ExceptionWhileProcessing' AND event_time > now() - INTERVAL 10 MINUTE GROUP BY exception_code
Unbounded concurrent analyticsTotal memory across all running queries approaches 70% of the server limit with no single dominant querySELECT formatReadableSize(sum(memory_usage)) FROM system.processes
Cache over-allocationMemoryTracking is high despite low query concurrency; mark cache or uncompressed cache consume most RAMSELECT metric, value FROM system.asynchronous_metrics WHERE metric IN ('MarkCacheBytes', 'UncompressedCacheBytes')
Spill-to-disk without I/O headroom/var/lib/clickhouse/tmp/ grows and disk I/O latency spikes while queries continue rather than faills -lah /var/lib/clickhouse/tmp/ and iostat -xz 1 5
Background merge memory spikesMemory pressure coincides with long-running merges in system.mergesSELECT database, table, elapsed, formatReadableSize(memory_usage) FROM system.merges ORDER BY memory_usage DESC

Quick checks

# Check ClickHouse tracked memory vs configured server limit
clickhouse-client -q "SELECT formatReadableSize(value) AS memory_tracking FROM system.metrics WHERE metric = 'MemoryTracking'"

# Check OS RSS and peak to spot divergence from tracked memory
cat /proc/$(pgrep clickhouse-server)/status | grep -E '^(VmRSS|VmPeak|VmSize)'

# Top memory-consuming queries right now
clickhouse-client -q "SELECT query_id, formatReadableSize(memory_usage) AS mem, formatReadableSize(peak_memory_usage) AS peak, substring(query, 1, 120) AS q FROM system.processes WHERE memory_usage > 0 ORDER BY memory_usage DESC LIMIT 10"

# Recent MEMORY_LIMIT_EXCEEDED errors
clickhouse-client -q "SELECT query_id, exception_code, substring(query, 1, 120) AS q FROM system.query_log WHERE type = 'ExceptionWhileProcessing' AND exception_code = 241 AND event_time > now() - INTERVAL 10 MINUTE LIMIT 10"

# Check for spill-to-disk activity in tmp
ls -lah /var/lib/clickhouse/tmp/

# Check concurrent query count and total query memory
clickhouse-client -q "SELECT count() AS queries, formatReadableSize(sum(memory_usage)) AS total_query_mem FROM system.processes"

# Check mark cache size
clickhouse-client -q "SELECT metric, formatReadableSize(value) FROM system.asynchronous_metrics WHERE metric = 'MarkCacheBytes'"

# Check swap usage (swap thrashing makes recovery nearly impossible)
free -h | grep -i swap

How to diagnose it

  1. Confirm server-wide pressure. Compare MemoryTracking from system.metrics to max_server_memory_usage. If the ratio is sustained above 80%, the server is in the danger zone. Also check MemoryResident in system.asynchronous_metrics because RSS can exceed tracked memory due to allocator overhead and untracked allocations.
  2. Identify the memory consumer category. Run the system.processes query sorted by memory_usage. If one query dominates, it is a runaway query. If many moderate queries sum to most of the limit, it is a concurrency problem.
  3. Check for retry amplification. Query system.query_log for exception code 241 over the last 10 minutes. If the same query fingerprint appears multiple times with short intervals, the application is retrying immediately.
  4. Check spill-to-disk activity. Inspect /var/lib/clickhouse/tmp/. Growing files there mean queries are spilling. Correlate with disk I/O metrics (iostat or your infrastructure monitoring). High I/O wait during memory pressure confirms the bottleneck has shifted to disk.
  5. Correlate with merge health. Check system.merges. If merges are running but disk I/O is saturated, merge throughput drops and parts begin to accumulate. This is the secondary death spiral.
  6. Check OS and cgroup limits. In containerized deployments, verify that the cgroup memory limit is not close to ClickHouse’s internal limit. If the container limit is lower than or equal to max_server_memory_usage, the OS OOM killer can fire before ClickHouse’s circuit breaker engages.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
MemoryTracking / max_server_memory_usageMeasures proximity to ClickHouse’s internal query-kill thresholdSustained ratio > 80%
Peak per-query memoryIdentifies runaway queries before they consume the serverSingle query > 50% of server limit
FailedQuery rate with exception 241Direct evidence of the kill-and-retry loopAny sustained increase over baseline
Mark cache hit rateMemory pressure evicts caches, causing more disk I/O< 80% during pressure events (from system.events MarkCacheHits and MarkCacheMisses)
Disk I/O awaitSpill-to-disk and cache misses shift saturation to I/O> 20 ms on SSD sustained
Active merge count and throughputI/O competition from spills starves merges, causing part accumulationMerge throughput flat or falling while parts grow
OSMemoryAvailableThe OS OOM killer operates independently of ClickHouse trackers< 5% of total RAM
Concurrent query countHeavy concurrency compounds moderate per-query memory into server pressureApproaching max_concurrent_queries or 2x typical peak

Fixes

Kill the runaway query and bound per-query memory

Warning: Killing a query is disruptive. The client receives an error and may retry immediately, so coordinate with the application owner before killing production traffic.

Identify the top consumer in system.processes and kill it:

KILL QUERY WHERE query_id = '<query_id>';

Set max_memory_usage in the user profile to prevent recurrence. A sensible starting point is 10-20% of max_server_memory_usage for ad-hoc users, lower for service accounts. The tradeoff is that legitimate large queries fail instead of completing slowly.

Stop retry amplification

Implement exponential back-off and circuit-breaking at the application layer. A query that failed with code 241 should wait seconds, not milliseconds, before retrying, and should give up after a small number of attempts. The tradeoff is temporary query unavailability, but this prevents the retry loop from pinning the server at 100% memory.

Reduce or remove spill-to-disk pressure

If max_bytes_before_external_group_by or max_bytes_before_external_sort are configured and disk I/O is saturated, you have two choices. Raise the thresholds so fewer queries spill, accepting that more will hit code 241. Or leave them enabled but ensure disk I/O capacity is reserved for merges so spills do not compete on the same device. The tradeoff is between query failure and I/O saturation.

Temporarily lower concurrency

If the root cause is a surge of legitimate concurrent heavy queries, reduce max_concurrent_queries temporarily. This queues or rejects new queries until memory drops. The tradeoff is that some client requests fail or wait.

Reclaim memory from caches

If MarkCacheBytes or UncompressedCacheBytes are consuming most RAM, reduce mark_cache_size or disable the uncompressed cache. This frees memory for query working sets at the cost of more disk seeks and decompression on repeated queries.

Align container and ClickHouse limits

In Kubernetes or containerized deployments, ensure the cgroup memory limit leaves headroom above max_server_memory_usage. This prevents the OS OOM killer from terminating the process before ClickHouse can kill the offending query. If the container limit is too tight, lower the configured server-wide limit to leave that headroom.

Prevention

  • Set per-query memory limits for every non-admin user profile. An unbounded query can trigger server-wide pressure.
  • Monitor both MemoryTracking and OS RSS. Alert on divergence because tracked memory underestimates actual footprint.
  • Instrument application retry logic with back-off and jitter. Immediate retries on 241 are an anti-pattern.
  • Size caches conservatively. Leave RAM for query working memory and background merges.
  • Validate spill-to-disk thresholds against I/O capacity before enabling. Spilling is only a safety valve if the disk subsystem can absorb the extra load without starving merges.
  • Review queries that use GLOBAL IN or large JOINs. These can cause unbounded memory growth during aggregation.

How Netdata helps

Netdata correlates ClickHouse MemoryTracking with OS RSS and cgroup memory limits in one chart, surfacing tracker divergence immediately.

Netdata alerts on spikes in exception code 241 and sustained memory saturation without requiring manual system.query_log polling.

Netdata disk I/O latency charts, shown alongside query latency, make it easy to see when spill-to-disk shifts the bottleneck from memory to I/O.

Netdata collects per-query memory from system.processes continuously, so runaway queries are visible before they dominate the server.

Netdata merge pool and active part count charts detect secondary merge starvation caused by I/O competition.