Troubleshooting

Diagnosing Linux cgroups v2 Memory Throttling & OOM-Killed Containers

A deep dive into memory.max- memory.high- and PSI to understand and prevent container out-of-memory events

Diagnosing Linux cgroups v2 Memory Throttling & OOM-Killed Containers

Your critical service is lagging. Users are complaining about timeouts. You check your orchestration platform and see the dreaded OOMKilled status on a container. You dive into the node’s logs (dmesg) and confirm it: the kernel’s Out-of-Memory (OOM) killer has claimed another victim. The immediate fix is easy—restart the container, maybe give it more memory—but the real question remains unanswered: why did it happen? Was it a sudden memory leak, a traffic spike, or something more subtle?

With modern Linux distributions running on cgroups v2, the answer is often more nuanced than a simple memory limit breach. Cgroups v2 introduced a sophisticated, tiered system for memory management that can cause performance degradation and throttling long before the OOM killer is ever invoked. Understanding this system is the key to moving from reactive firefighting to proactive container memory optimisation.

From cgroups v1 to v2: A New Hierarchy for Memory Management

Control groups (cgroups) are a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, etc.) of a collection of processes. Container platforms like Docker and Kubernetes rely on them to enforce resource limits.

While cgroups v1 had a messy, controller-specific hierarchy, cgroups v2 introduced a single, unified hierarchy that simplifies management. For memory, this change was profound. Instead of a single, blunt memory.limit_in_bytes file, v2 introduced a set of controls that create a Memory Quality of Service (QoS) framework. These controls act as three distinct gates that a container’s memory usage must pass through.

The Three Gates of cgroups v2 Memory Control

Imagine your container’s memory allocation as water flowing into a reservoir. Cgroups v2 sets up three watermarks that trigger different kernel behaviors.

Gate 1: memory.low - The Best-Effort Protection

This is a soft limit. It’s a “best-effort” protection boundary. When a cgroup’s memory usage is below this line, the kernel considers it protected and will try to reclaim memory from unprotected cgroups first.

  • Behavior: No throttling or killing. The kernel will avoid reclaiming memory from this cgroup unless absolutely necessary.
  • Use Case: This is primarily used by orchestrators like Kubernetes to protect pods with a Guaranteed QoS class. For these pods, Kubernetes sets memory.low equal to the memory request, ensuring they are the last to be affected during node-wide memory pressure.

Gate 2: memory.high - The Throttling Gate

This is the most critical control for performance troubleshooting. When a cgroup’s memory usage exceeds memory.high, the kernel starts to aggressively throttle the processes within that cgroup.

  • Behavior: The kernel will try to reclaim memory pages from the cgroup to push its usage back below the memory.high mark. This means that when a process inside the container tries to allocate more memory, its execution is paused while the kernel works to free up space. From the application’s perspective, this manifests as high latency, slowness, or even stalls.
  • Why it Matters: Your application can be performing poorly long before it is ever OOMKilled. If your service is slow but not crashing, it’s very likely hitting the memory.high throttle. This is the “slowness before death” phase that is often invisible without the right monitoring.

Gate 3: memory.max - The Hard Limit and the OOM Killer

This is the final, hard cap. It’s the point of no return.

  • Behavior: If a process inside the cgroup attempts to allocate memory that would push the group’s total usage over memory.max, the kernel’s OOM killer is invoked specifically for that cgroup. It will find a process within the cgroup to kill to free up memory. This is what directly results in the OOMKilled status.
  • Use Case: This sets the absolute boundary for your container’s memory usage, preventing a single leaky container from crashing the entire host node.

Your Troubleshooting Toolkit for Memory Issues

To diagnose these issues, you need to inspect the cgroup files directly on the host node.

Step 1: Find Your Container’s Cgroup

First, you need the path to your container’s cgroup directory. You can find this using the container ID. Once you have the container ID from a command like docker ps, you can locate its cgroup path. The path format is usually consistent, often found under /sys/fs/cgroup/, or you can use a tool like systemd-cgls to explore the hierarchy.

Step 2: Read the Signs in the memory.* Files

Once you’re in the cgroup directory, you can read the diagnostic files.

  • memory.current: Shows the current total memory usage. Compare this value to the limits set in memory.high and memory.max to see where you stand.
  • memory.stat: Provides detailed memory statistics. Key fields include:
    • anon: Anonymous memory (not backed by a file, like heap allocations).
    • file: Page cache (memory used to cache file data from disk). High file usage isn’t always bad, but it contributes to the total memory count.
    • slab: Kernel slab allocations on behalf of the cgroup.
  • memory.events: This is a crucial file for diagnostics. It contains counters for key events.
    • high: The number of times the cgroup’s usage has exceeded the memory.high limit. If this number is climbing, your application is being throttled.
    • max: The number of times the cgroup has hit the memory.max limit, triggering an OOM event.
    • oom: The number of processes OOM-killed within the cgroup.

Step 3: Use PSI (Pressure Stall Information) for Proactive Insight

PSI is a modern kernel feature that provides a much clearer view of resource contention. Instead of just showing event counts, it shows the percentage of wall-clock time that processes were stalled waiting for a resource.

For memory, check the memory.pressure file inside your cgroup directory. Inside this file, you will find metrics for some and full pressure, which provide averages over 10, 60, and 300 seconds, along with a running total.

  • some: This value represents the percentage of time where at least one task was stalled waiting for memory. A non-zero some value is a direct, quantitative measure of memory throttling.
  • full: This value represents the percentage of time where all tasks were stalled simultaneously. A non-zero full value indicates severe memory pressure and is a strong predictor of an imminent OOM kill.

The Kubernetes Context: QoS and Pod Eviction

Kubernetes automates the management of these cgroup controls based on the Quality of Service (QoS) class of your pods, which is determined by the requests and limits you set.

  • Guaranteed: requests == limits. The kubelet sets memory.low and memory.max to the same value. These pods are highly protected.
  • Burstable: requests < limits. The kubelet uses memory.low (based on requests) and memory.max (based on limits), with memory.high often set to the limit. These pods can use more memory than they requested but are candidates for throttling or killing if they exceed their request and the node is under pressure.
  • BestEffort: No requests or limits. These have the lowest priority and are the first to be killed during node-wide pressure.

The kubelet actively monitors these cgroup metrics and PSI. It can make a proactive pod memory eviction decision to kill a pod to preserve node stability, even before the kernel OOM killer is invoked. It uses this data to choose the best candidate for eviction—usually a BestEffort or Burstable pod exceeding its request.

By moving beyond simple OOM kill logs and embracing the rich diagnostic data provided by cgroups v2 and PSI, you can gain a deep understanding of your application’s memory behavior. Monitoring for memory.high events and non-zero PSI some values allows you to detect performance issues and potential memory leaks long before they result in a critical outage.

To make this advanced troubleshooting truly effective, you need a monitoring solution that can collect and visualize these metrics in real-time. Netdata provides per-second visibility into every cgroup v2 control file and PSI metric, automatically, turning complex kernel data into actionable dashboards. Get started with Netdata for free to stop guessing and start diagnosing your container memory issues with precision.