ClickHouse killed by the OOM killer: RSS, max_server_memory_usage, and cgroup limits

You restart a pod and kubectl describe pod shows Reason: OOMKilled with exit code 137. Inside ClickHouse, MemoryTracking sits well below max_server_memory_usage, and system.text_log shows no warning. The process is gone, merges are dead, and replication queues are backing up.

The Linux OOM killer targets RSS, not ClickHouse’s internal MemoryTracking. Untracked allocations, jemalloc arena fragmentation, and cgroup accounting quirks create a persistent gap between what ClickHouse thinks it is using and what the kernel sees. In containerized environments, the cgroup OOM killer can evict the pod before ClickHouse ever triggers its own server-wide limit.

This guide closes that gap.

What this means

The kernel OOM killer targets processes by resident set size, specifically anon-rss plus file-rss. ClickHouse tracks allocations in MemoryTracking via lightweight hierarchical atomic counters. Memory that RSS counts but MemoryTracking misses includes:

  • Jemalloc dirty pages: ClickHouse uses jemalloc, which holds freed pages in arenas to reduce syscalls. RSS reflects the committed arena size, not just actively used memory.
  • mmap regions and metadata: Memory-mapped files, allocator metadata, and external library allocations are not fully captured in MemoryTracking.
  • Cgroup accounting inflation: Under cgroups v2, memory.current historically included page cache and kernel slab reclaimable pages. ClickHouse attempts to subtract inactive file cache, but corrections were incomplete until 24.7 and still evolving for slab reclaimable pages in 25.x.

When max_server_memory_usage is set close to the physical or cgroup limit, the untracked headroom can push RSS over the edge. Kubernetes applies its own limit via cgroup memory.max; if RSS exceeds that limit, the container is killed regardless of ClickHouse’s internal state.

flowchart TD
    A[Query and merge allocations] --> B[ClickHouse MemoryTracking]
    C[Jemalloc dirty arenas] --> D[Process RSS]
    E[mmap and external libraries] --> D
    B --> D
    F[max_server_memory_usage] --> B
    G[Cgroup memory.max] --> H[Kernel OOM killer]
    D --> H

Common causes

CauseWhat it looks likeFirst thing to check
Jemalloc fragmentation and dirty pagesMemoryResident exceeds MemoryTracking by 20-40% persistently; memory does not drop after large queries finishCompare system.asynchronous_metrics MemoryResident against system.metrics MemoryTracking
Cgroup v2 page cache or slab inflation (pre-24.7 or unpatched)Sudden OOM kills under Kubernetes with no spike in MemoryTracking; cgroup memory.current is high while process RSS looks lowerCheck ClickHouse version and whether cgroup memory observer subtracts page cache correctly
max_server_memory_usage sized at or above the cgroup limitOOM kills happen exactly at the Kubernetes memory limit, but ClickHouse logs show no MEMORY_LIMIT_EXCEEDEDInspect cgroup memory.max and compare with max_server_memory_usage from system.server_settings
Concurrent query allocation raceMemory allocated faster than atomic counters can enforce the limit; multiple queries breach the limit simultaneouslyCheck system.events for FailedQuery with code 241 around the time of the OOM kill
Merge or mutation memory spikesLarge background merges or lightweight deletes drive RSS up quickly; merges_mutations_memory_usage_soft_limit is unset (default 0)Query system.merges for memory_usage during the incident window

Quick checks

# Check kernel OOM killer logs for the ClickHouse process
dmesg -T | grep -i 'killed process.*clickhouse'
# On Kubernetes, confirm OOMKilled reason and exit code 137
kubectl describe pod <pod-name> | grep -E 'Reason|Exit Code'
-- Compare tracked memory versus resident memory inside ClickHouse
SELECT
    (SELECT value FROM system.metrics WHERE metric = 'MemoryTracking') AS tracked,
    (SELECT value FROM system.asynchronous_metrics WHERE metric = 'MemoryResident') AS resident,
    round(resident / tracked, 2) AS ratio;
# OS-level RSS and virtual size for the clickhouse-server process
PID=$(pidof clickhouse-server) && ps -o pid,rss,vsz,comm -p $PID
# Cgroup memory limit for the current process
cat /sys/fs/cgroup/memory.max 2>/dev/null || cat /sys/fs/cgroup/memory/memory.limit_in_bytes 2>/dev/null || echo "unlimited"
-- Current server memory limit and ratio settings
SELECT name, value, changed FROM system.server_settings
WHERE name IN ('max_server_memory_usage', 'max_server_memory_usage_to_ram_ratio');
# Search prior boot logs if the host restarted
journalctl -kg 'Killed process'

How to diagnose it

  1. Confirm the kill was OOM-related. On bare metal, dmesg -T | grep -i 'killed process.*clickhouse' shows Out of memory: Killed process NNN (clickhouse-serv) total-vm:XkB, anon-rss:YkB. On Kubernetes, kubectl describe pod shows Reason: OOMKilled and exit code 137.
  2. Compare MemoryTracking and MemoryResident. Query system.metrics and system.asynchronous_metrics. A ratio above 1.2 indicates significant untracked or retained memory. Ratios above 1.4 are dangerous when running near limits.
  3. Check the cgroup ceiling. Read /sys/fs/cgroup/memory.max (or memory.limit_in_bytes for cgroups v1). If this value is lower than max_server_memory_usage, the cgroup will kill the process before ClickHouse throttles itself.
  4. Review the startup limit calculation. In ClickHouse logs, look for Setting max_server_memory_usage was set to N GiB (M GiB available * 0.90 ...). If it auto-computed to 90% of host RAM but the cgroup limit is lower, the startup calculation used host RAM, not the constrained cgroup allowance.
  5. Look for concurrent heavy queries. Query system.query_log for queries with large peak_memory_usage that executed just before the restart. Multiple queries near their individual limits can sum to an RSS spike that bypasses atomic counter enforcement.
  6. Check for mutations or lightweight deletes. system.mutations with is_done = 0 and system.merges with high memory_usage can explain background RSS spikes that MemoryTracking underestimates.
  7. Verify version-specific cgroup behavior. If running a version older than 24.7, cgroups v2 page cache inclusion can inflate perceived usage. If running older than 21.12, ClickHouse cannot detect cgroup limits at all.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
MemoryResident / MemoryTracking ratioReveals untracked or retained memory headroomRatio > 1.3 sustained
Cgroup memory limit vs max_server_memory_usageEnsures ClickHouse limits itself before the cgroup kills itmax_server_memory_usage >= cgroup memory.max
OSMemoryAvailable (async metric)Tracks host-level free memory outside ClickHouse’s viewDropping toward 5% of total
FailedQuery with exception 241Indicates ClickHouse is hitting its own limit before the kernel doesSustained rate > 0
Merge/mutation memory usageBackground tasks can spike RSS independently of query memoryIndividual merge memory > 50% of server limit
Pod restart count / exit code 137Direct signal of cgroup OOM kills in KubernetesAny unexplained restart

Fixes

Size max_server_memory_usage below the cgroup limit

Set max_server_memory_usage explicitly to a value well below the cgroup or pod memory limit. Do not rely solely on the default auto-calculation, which uses host RAM and a 0.9 ratio. max_server_memory_usage can be changed at runtime, but max_server_memory_usage_to_ram_ratio cannot. In Kubernetes, a safe starting point is 80% of the pod memory limit.

Enable memory worker correction with caution

The memory worker thread can correct MemoryTracking against jemalloc and cgroup data. Settings include memory_worker_use_cgroup (default 1), memory_worker_correct_memory_tracker (default 0), memory_worker_decay_adjustment_period_ms (default 5000), and memory_worker_purge_dirty_pages_threshold_ratio (default 0.2). Enabling memory_worker_correct_memory_tracker can reduce drift, but it may cause abrupt resets that oscillate with AsynchronousMetrics and produce spurious MEMORY_LIMIT_EXCEEDED exceptions in some versions.

Cap merge and mutation memory separately

Set merges_mutations_memory_usage_soft_limit or merges_mutations_memory_usage_to_ram_ratio (default 0.5) to prevent background merges from consuming unbounded RSS. This is especially important when running lightweight deletes or large mutations that rewrite many parts.

Reduce concurrent query pressure

If rapid concurrent allocations are racing past the atomic counters, lower per-query limits and concurrency. Set explicit max_memory_usage per user profile and consider reducing max_concurrent_queries during incidents.

Upgrade for cgroup detection accuracy

ClickHouse versions before 21.12 do not detect cgroup memory limits at all. Versions before 24.7 do not fully exclude cgroups v2 page cache from cgroup memory observations. If you run ClickHouse in containers, upgrade to at least 24.7, and monitor release notes for ongoing slab reclaimable fixes.

Prevention

  • Monitor RSS, not just MemoryTracking. Alert on MemoryResident or cgroup memory.current as a percentage of the cgroup limit. ClickHouse internals alone will not warn you of an impending kernel OOM kill.
  • Set explicit memory limits. Define max_server_memory_usage in config rather than relying on auto-detection. Recalculate it whenever moving to a different instance size or Kubernetes memory limit.
  • Leave headroom for untracked allocations. Maintain at least 20% of available RAM as a gap between max_server_memory_usage and the physical or cgroup ceiling.
  • Watch the merge pool. Large merges and mutations are common RSS spikes. Monitor system.merges memory usage and set merges_mutations_memory_usage_to_ram_ratio on busy write nodes.
  • Tune the OOM score if needed. On Linux, the oom_score setting (default 0) influences OOM killer preference. It is not changeable at runtime, so set it in configuration if you need ClickHouse to survive longer than co-located processes.

How Netdata helps

  • Chart MemoryResident against MemoryTracking to expose the divergence gap.
  • Alert on RSS as a percentage of the cgroup limit independently of ClickHouse internal metrics, catching container OOMs before the pod dies.
  • Track system.metrics MemoryTracking alongside OS-level available memory to distinguish ClickHouse pressure from host-level pressure.
  • Monitor Kubernetes pod status and exit codes to surface OOMKilled events that ClickHouse never logs.
  • Correlate query latency spikes with memory saturation to identify runaway queries before RSS peaks.