ClickHouse MemoryTracking vs MemoryResident: reading the memory gap correctly
You finish a large batch query, open monitoring, and see ClickHouse MemoryTracking drop by 40 GB while MemoryResident barely moves. Or you watch RSS climb for hours after a restart while MemoryTracking tracks the rise steadily. Neither pattern indicates a leak. The gap between ClickHouse’s internal ledger and OS resident set size is normal: the server accounts for memory synchronously while jemalloc retains pages for reuse.
This article explains the mechanism behind the divergence, when the gap is expected, and which metric to trust for which operational decision.
What it is and why it matters
MemoryTracking lives in system.metrics. It is the sum of all allocations passing through ClickHouse’s hierarchical MemoryTracker: server-level totals, per-user aggregates, and per-query working sets. Every GROUP BY hash table, JOIN buffer, mark cache entry, merge scratch buffer, and decompression block that ClickHouse explicitly manages is added and subtracted synchronously on alloc and free.
MemoryResident lives in system.asynchronous_metrics. It is the RSS of the clickhouse-server process sampled in the background. This is what the Linux OOM killer watches and what container runtimes enforce. Because it includes allocator-retained pages, memory-mapped files, and library overhead that ClickHouse never tracks, it is always equal to or larger than the internally accounted figure.
The two metrics serve different operational purposes. MemoryTracking tells you which query or subsystem is consuming space and whether a single operation is about to hit a per-query or server-wide limit. MemoryResident tells you whether the process is about to be killed by the kernel. Treating MemoryTracking as the single source of truth routinely underestimates true footprint. Treating a post-query RSS plateau as a leak wastes hours profiling a healthy allocator.
How it works
ClickHouse maintains a tree of MemoryTracker objects. When a query executes, its allocator calls track(alloc) and track(free) against the query’s node, which rolls up to the user and server totals. This gives precise attribution: system.processes.memory_usage shows the current allocation, and system.processes.peak_memory_usage shows the high-water mark.
However, ClickHouse does not return freed pages to the operating system immediately. The server uses jemalloc, which holds deallocated memory in thread-local caches and dirty-page bins for reuse. When a query frees 30 GB of hash tables, those pages are deducted from MemoryTracking immediately because the tracker records the logical free. The pages remain in the process resident set, available for the next allocation. This is the primary driver of the gap after large queries finish.
jemalloc organizes memory into arenas and size classes. Freed large allocations often sit in dirty or muzzy states until the allocator triggers a background purge or the process faces memory pressure. You may observe RSS drop suddenly during a system-level memory shortage even though ClickHouse workload and MemoryTracking remain unchanged.
Untracked allocations widen the divergence further. Memory-mapped files, certain third-party library buffers, and jemalloc’s own metadata sit outside the tracked hierarchy. The approximate relationship is:
MemoryResident = MemoryTracking + jemalloc retained/dirty pages + untracked allocations
This is why RSS can exceed MemoryTracking by a wide margin on a warm node, and why the delta can spike right after a query ends even though no new memory was allocated.
The server-level MemoryTracker enforces max_server_memory_usage. When the sum of tracked allocations approaches this limit, new queries may receive MEMORY_LIMIT_EXCEEDED errors even if the OS reports available RAM. ClickHouse prefers to fail queries rather than invite the OOM killer. The gap between tracked and resident memory determines how much headroom actually exists before the kernel intervenes.
Caches are another intentional, tracked consumer. The mark cache and uncompressed cache are accounted inside MemoryTracking. On a warm analytical node, a large portion of MemoryTracking is simply these caches doing their job. You can see their current sizes through system.metrics (MarkCacheBytes) and system.asynchronous_metrics (UncompressedCacheBytes).
flowchart TD
RSS[MemoryResident
RSS]
subgraph Tracked ["Accounted in MemoryTracking"]
MT[Query buffers
caches
merges]
end
subgraph Gap ["The gap"]
JEM[jemalloc retained/dirty pages]
UNTR[Untracked mmap
and library overhead]
end
Tracked --> RSS
Gap --> RSSWhere it shows up in production
Post-query plateau. A batch job with a heavy aggregation allocates tens of gigabytes of hash tables. MemoryTracking rises with the query and falls when it ends. RSS rises too, but it stays elevated for minutes afterward. If you only watch RSS, you think the memory leaked. If you only watch MemoryTracking, you think the node has plenty of headroom. The truth is that jemalloc retained the pages for reuse and will likely surrender them only under sustained pressure or when the allocator decides to purge.
Steady-state cache warmup. After startup, ClickHouse populates the mark cache and uncompressed cache lazily as queries touch parts. MemoryTracking climbs for hours as caches warm. RSS climbs with it. This is expected behavior. The plateau you reach is the working set, not a leak waiting to happen.
Containerized deployments. The mismatch becomes dangerous when the cgroup memory limit is compared against the wrong metric. If your orchestrator kills the pod based on RSS but your alert threshold is set on MemoryTracking, you will miss the approaching OOM. If you tune max_server_memory_usage based on RSS without understanding how much of it is cache, you may starve legitimate cache space and degrade query performance.
Gap growth with flat MemoryTracking. If the delta between resident and tracked memory grows steadily while MemoryTracking stays flat and cache sizes are stable, suspect untracked allocations or allocator fragmentation. Check for large anonymous mappings or investigate whether third-party libraries are allocating outside the tracked hierarchy.
If you need to test whether RSS is held by tracked caches, you can drop them. This degrades query performance until caches refill, so run only during low traffic or in staging:
-- Degrades performance until caches rebuild. Use with caution.
SYSTEM DROP MARK CACHE;
SYSTEM DROP UNCOMPRESSED CACHE;
If MemoryTracking falls significantly, the gap was tracked cache. If the gap remains, the memory is allocator-retained or untracked.
Common misuses and misreadings
Using MemoryTracking alone for OOM avoidance. Because RSS includes untracked and retained pages, it can exceed MemoryTracking by a significant margin. The OOM killer acts on RSS, not on the internal ledger.
Treating RSS persistence after query completion as a memory leak. jemalloc holds pages for reuse. The memory is available to the process for subsequent allocations. A true leak shows as both MemoryTracking and RSS growing without bound under stable load.
Assuming MemoryTracking and MemoryResident should match. They are designed to diverge. The operational question is whether the divergence is stable or growing without bound.
Panicking over negative MemoryTracking spikes. Rapid deallocations can briefly drive the signed Int64 counter negative before it stabilizes on the next synchronous update. This is cosmetic. It does not indicate corruption and it self-corrects.
Signals to watch in production
| Signal | Why it matters | Warning sign |
|---|---|---|
| MemoryTracking | Internal ledger of all ClickHouse-accounted allocations | Sustained growth without corresponding query or cache load increase |
| MemoryResident | RSS visible to the OOM killer and cgroup enforcer | Approaching physical RAM limit or container memory cap |
| Resident minus Tracking gap | jemalloc fragmentation or untracked allocations | Persistent growth of the gap without corresponding cache growth |
| MarkCacheBytes | Tracked index cache; large on warm nodes is normal | Unexpected shrinkage may indicate memory pressure eviction |
| UncompressedCacheBytes | Tracked decompressed block cache if enabled | Zero when enabled may mean pressure; large size is expected |
| Peak query memory | Per-query attribution in system.processes | Single query using more than 50% of max_server_memory_usage |
max_server_memory_usage headroom | Server limit, defaulting to 90% of physical RAM | Tracked memory staying above 80% of this limit during peak |
The system.processes table exposes both memory_usage and peak_memory_usage for running queries. A query whose peak approaches the per-query limit may still be well under the server limit, but a query that uses more than half of max_server_memory_usage can starve concurrent workloads. Watch for single-query dominance in the per-process breakdown.
Pull the core counters side by side with:
-- Check tracked memory
SELECT value, formatReadableSize(value) AS readable
FROM system.metrics
WHERE metric = 'MemoryTracking';
-- Check resident memory
SELECT metric, value, formatReadableSize(value) AS readable
FROM system.asynchronous_metrics
WHERE metric IN ('MemoryResident', 'MemoryVirtual');
-- Find recent heavy queries by peak tracked memory
SELECT query_id, formatReadableSize(peak_memory_usage) AS peak
FROM system.query_log
WHERE event_time > now() - INTERVAL 1 HOUR
ORDER BY peak_memory_usage DESC
LIMIT 10;
Or check the OS view directly:
# OS-level process memory. pgrep -x ensures an exact match.
cat /proc/$(pgrep -x clickhouse-server | head -1)/status | grep -E 'VmRSS|VmSize|VmPeak'
How Netdata helps
Netdata charts MemoryTracking, MemoryResident, and the gap on the same timeline. Correlate RSS steps with per-query peaks from system.processes to distinguish allocator retention from runaway queries. Track cache sizes alongside total memory to distinguish legitimate cache growth from unexpected allocation. Alert on OS RSS approaching physical or cgroup limits independently of internal tracked counters.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse Memory limit (for query) exceeded: per-query limits and GROUP BY/JOIN blowups
- ClickHouse Memory limit (total) exceeded - server-wide memory pressure and fixes
- ClickHouse memory pressure death spiral: runaway queries, retries, and OOM
- ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
- ClickHouse merge duration climbing: the leading indicator of part explosion
- ClickHouse merges not keeping up: diagnosing a stalled or starved merge pool
- ClickHouse monitoring checklist: the signals every production cluster needs
- ClickHouse monitoring maturity model: from survival to expert







