ClickHouse killed by the OOM killer: RSS, max_server_memory_usage, and cgroup limits
You restart a pod and kubectl describe pod shows Reason: OOMKilled with exit code 137. Inside ClickHouse, MemoryTracking sits well below max_server_memory_usage, and system.text_log shows no warning. The process is gone, merges are dead, and replication queues are backing up.
The Linux OOM killer targets RSS, not ClickHouse’s internal MemoryTracking. Untracked allocations, jemalloc arena fragmentation, and cgroup accounting quirks create a persistent gap between what ClickHouse thinks it is using and what the kernel sees. In containerized environments, the cgroup OOM killer can evict the pod before ClickHouse ever triggers its own server-wide limit.
This guide closes that gap.
What this means
The kernel OOM killer targets processes by resident set size, specifically anon-rss plus file-rss. ClickHouse tracks allocations in MemoryTracking via lightweight hierarchical atomic counters. Memory that RSS counts but MemoryTracking misses includes:
- Jemalloc dirty pages: ClickHouse uses jemalloc, which holds freed pages in arenas to reduce syscalls. RSS reflects the committed arena size, not just actively used memory.
- mmap regions and metadata: Memory-mapped files, allocator metadata, and external library allocations are not fully captured in
MemoryTracking. - Cgroup accounting inflation: Under cgroups v2,
memory.currenthistorically included page cache and kernel slab reclaimable pages. ClickHouse attempts to subtract inactive file cache, but corrections were incomplete until 24.7 and still evolving for slab reclaimable pages in 25.x.
When max_server_memory_usage is set close to the physical or cgroup limit, the untracked headroom can push RSS over the edge. Kubernetes applies its own limit via cgroup memory.max; if RSS exceeds that limit, the container is killed regardless of ClickHouse’s internal state.
flowchart TD
A[Query and merge allocations] --> B[ClickHouse MemoryTracking]
C[Jemalloc dirty arenas] --> D[Process RSS]
E[mmap and external libraries] --> D
B --> D
F[max_server_memory_usage] --> B
G[Cgroup memory.max] --> H[Kernel OOM killer]
D --> HCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Jemalloc fragmentation and dirty pages | MemoryResident exceeds MemoryTracking by 20-40% persistently; memory does not drop after large queries finish | Compare system.asynchronous_metrics MemoryResident against system.metrics MemoryTracking |
| Cgroup v2 page cache or slab inflation (pre-24.7 or unpatched) | Sudden OOM kills under Kubernetes with no spike in MemoryTracking; cgroup memory.current is high while process RSS looks lower | Check ClickHouse version and whether cgroup memory observer subtracts page cache correctly |
| max_server_memory_usage sized at or above the cgroup limit | OOM kills happen exactly at the Kubernetes memory limit, but ClickHouse logs show no MEMORY_LIMIT_EXCEEDED | Inspect cgroup memory.max and compare with max_server_memory_usage from system.server_settings |
| Concurrent query allocation race | Memory allocated faster than atomic counters can enforce the limit; multiple queries breach the limit simultaneously | Check system.events for FailedQuery with code 241 around the time of the OOM kill |
| Merge or mutation memory spikes | Large background merges or lightweight deletes drive RSS up quickly; merges_mutations_memory_usage_soft_limit is unset (default 0) | Query system.merges for memory_usage during the incident window |
Quick checks
# Check kernel OOM killer logs for the ClickHouse process
dmesg -T | grep -i 'killed process.*clickhouse'
# On Kubernetes, confirm OOMKilled reason and exit code 137
kubectl describe pod <pod-name> | grep -E 'Reason|Exit Code'
-- Compare tracked memory versus resident memory inside ClickHouse
SELECT
(SELECT value FROM system.metrics WHERE metric = 'MemoryTracking') AS tracked,
(SELECT value FROM system.asynchronous_metrics WHERE metric = 'MemoryResident') AS resident,
round(resident / tracked, 2) AS ratio;
# OS-level RSS and virtual size for the clickhouse-server process
PID=$(pidof clickhouse-server) && ps -o pid,rss,vsz,comm -p $PID
# Cgroup memory limit for the current process
cat /sys/fs/cgroup/memory.max 2>/dev/null || cat /sys/fs/cgroup/memory/memory.limit_in_bytes 2>/dev/null || echo "unlimited"
-- Current server memory limit and ratio settings
SELECT name, value, changed FROM system.server_settings
WHERE name IN ('max_server_memory_usage', 'max_server_memory_usage_to_ram_ratio');
# Search prior boot logs if the host restarted
journalctl -kg 'Killed process'
How to diagnose it
- Confirm the kill was OOM-related. On bare metal,
dmesg -T | grep -i 'killed process.*clickhouse'showsOut of memory: Killed process NNN (clickhouse-serv) total-vm:XkB, anon-rss:YkB. On Kubernetes,kubectl describe podshowsReason: OOMKilledand exit code 137. - Compare
MemoryTrackingandMemoryResident. Querysystem.metricsandsystem.asynchronous_metrics. A ratio above 1.2 indicates significant untracked or retained memory. Ratios above 1.4 are dangerous when running near limits. - Check the cgroup ceiling. Read
/sys/fs/cgroup/memory.max(ormemory.limit_in_bytesfor cgroups v1). If this value is lower thanmax_server_memory_usage, the cgroup will kill the process before ClickHouse throttles itself. - Review the startup limit calculation. In ClickHouse logs, look for
Setting max_server_memory_usage was set to N GiB (M GiB available * 0.90 ...). If it auto-computed to 90% of host RAM but the cgroup limit is lower, the startup calculation used host RAM, not the constrained cgroup allowance. - Look for concurrent heavy queries. Query
system.query_logfor queries with largepeak_memory_usagethat executed just before the restart. Multiple queries near their individual limits can sum to an RSS spike that bypasses atomic counter enforcement. - Check for mutations or lightweight deletes.
system.mutationswithis_done = 0andsystem.mergeswith highmemory_usagecan explain background RSS spikes thatMemoryTrackingunderestimates. - Verify version-specific cgroup behavior. If running a version older than 24.7, cgroups v2 page cache inclusion can inflate perceived usage. If running older than 21.12, ClickHouse cannot detect cgroup limits at all.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| MemoryResident / MemoryTracking ratio | Reveals untracked or retained memory headroom | Ratio > 1.3 sustained |
| Cgroup memory limit vs max_server_memory_usage | Ensures ClickHouse limits itself before the cgroup kills it | max_server_memory_usage >= cgroup memory.max |
| OSMemoryAvailable (async metric) | Tracks host-level free memory outside ClickHouse’s view | Dropping toward 5% of total |
| FailedQuery with exception 241 | Indicates ClickHouse is hitting its own limit before the kernel does | Sustained rate > 0 |
| Merge/mutation memory usage | Background tasks can spike RSS independently of query memory | Individual merge memory > 50% of server limit |
| Pod restart count / exit code 137 | Direct signal of cgroup OOM kills in Kubernetes | Any unexplained restart |
Fixes
Size max_server_memory_usage below the cgroup limit
Set max_server_memory_usage explicitly to a value well below the cgroup or pod memory limit. Do not rely solely on the default auto-calculation, which uses host RAM and a 0.9 ratio. max_server_memory_usage can be changed at runtime, but max_server_memory_usage_to_ram_ratio cannot. In Kubernetes, a safe starting point is 80% of the pod memory limit.
Enable memory worker correction with caution
The memory worker thread can correct MemoryTracking against jemalloc and cgroup data. Settings include memory_worker_use_cgroup (default 1), memory_worker_correct_memory_tracker (default 0), memory_worker_decay_adjustment_period_ms (default 5000), and memory_worker_purge_dirty_pages_threshold_ratio (default 0.2). Enabling memory_worker_correct_memory_tracker can reduce drift, but it may cause abrupt resets that oscillate with AsynchronousMetrics and produce spurious MEMORY_LIMIT_EXCEEDED exceptions in some versions.
Cap merge and mutation memory separately
Set merges_mutations_memory_usage_soft_limit or merges_mutations_memory_usage_to_ram_ratio (default 0.5) to prevent background merges from consuming unbounded RSS. This is especially important when running lightweight deletes or large mutations that rewrite many parts.
Reduce concurrent query pressure
If rapid concurrent allocations are racing past the atomic counters, lower per-query limits and concurrency. Set explicit max_memory_usage per user profile and consider reducing max_concurrent_queries during incidents.
Upgrade for cgroup detection accuracy
ClickHouse versions before 21.12 do not detect cgroup memory limits at all. Versions before 24.7 do not fully exclude cgroups v2 page cache from cgroup memory observations. If you run ClickHouse in containers, upgrade to at least 24.7, and monitor release notes for ongoing slab reclaimable fixes.
Prevention
- Monitor RSS, not just MemoryTracking. Alert on
MemoryResidentor cgroupmemory.currentas a percentage of the cgroup limit. ClickHouse internals alone will not warn you of an impending kernel OOM kill. - Set explicit memory limits. Define
max_server_memory_usagein config rather than relying on auto-detection. Recalculate it whenever moving to a different instance size or Kubernetes memory limit. - Leave headroom for untracked allocations. Maintain at least 20% of available RAM as a gap between
max_server_memory_usageand the physical or cgroup ceiling. - Watch the merge pool. Large merges and mutations are common RSS spikes. Monitor
system.mergesmemory usage and setmerges_mutations_memory_usage_to_ram_ratioon busy write nodes. - Tune the OOM score if needed. On Linux, the
oom_scoresetting (default 0) influences OOM killer preference. It is not changeable at runtime, so set it in configuration if you need ClickHouse to survive longer than co-located processes.
How Netdata helps
- Chart
MemoryResidentagainstMemoryTrackingto expose the divergence gap. - Alert on RSS as a percentage of the cgroup limit independently of ClickHouse internal metrics, catching container OOMs before the pod dies.
- Track
system.metricsMemoryTrackingalongside OS-level available memory to distinguish ClickHouse pressure from host-level pressure. - Monitor Kubernetes pod status and exit codes to surface OOMKilled events that ClickHouse never logs.
- Correlate query latency spikes with memory saturation to identify runaway queries before RSS peaks.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse Memory limit (for query) exceeded: per-query limits and GROUP BY/JOIN blowups
- ClickHouse Memory limit (total) exceeded - server-wide memory pressure and fixes
- ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
- ClickHouse merge duration climbing: the leading indicator of part explosion
- ClickHouse merges not keeping up: diagnosing a stalled or starved merge pool
- ClickHouse monitoring checklist: the signals every production cluster needs
- ClickHouse monitoring maturity model: from survival to expert
- ClickHouse projections and hidden parts: the part count you can’t see







