Device memory pressure: control-plane memory pools and leaks
A network device alerts at 85% control-plane memory utilization. Or worse: it does not alert, and you discover the problem when BGP sessions start dropping from hold-time expiry, SNMP stops responding, or the device reboots itself. Control-plane memory exhaustion degrades every process on the route processor: routing protocols, CLI, SNMP, management interfaces, logging.
“High memory utilization” on a network device is ambiguous. Some platforms cache aggressively and report 97% used under normal operation. Some pools are expected to sit near zero free. A genuine memory leak may take weeks to manifest, making it easy to dismiss the trend until the device crashes at 3 a.m.
What this means
Control-plane memory on a network device is partitioned into pools. On Cisco IOS-XE, the primary pools are Processor memory (main route-processor DRAM) and I/O memory (packet buffers and DMA). Some platforms add specialized pools: the LSMPI (Low-Speed Message Packet Interface) pool on ASR 1000, or QFP external memory (exmem) on the ESP. Junos reports kernel memory and per-process allocations separately. Arista EOS runs agents as Linux processes, so memory reporting resembles a standard Linux system with per-agent RSS values.
Memory pressure becomes a production problem when one of these conditions is true:
- Free memory is approaching zero, risking an OOM crash or process termination.
- A specific pool is exhausted even if aggregate memory looks healthy. Monitor each pool separately, not just total.
- A memory leak is accumulating: the rate of increase exceeds 1% per minute sustained, or the trend is monotonically upward over hours or days without a corresponding workload change.
The failure cascade is predictable. As free memory shrinks, the device fails allocation requests for new control-plane work: BGP UPDATE processing, SNMP responses, route installation into the FIB. Processes compete for shrinking headroom. SNMP becomes unresponsive, BGP hold-timers expire, and eventually the device crashes or reboots to reclaim memory.
flowchart TD
A["Memory utilization rising"] --> B{"Rate > 1%/min?"}
B -->|Yes| C["Active leak or event-driven spike"]
B -->|No| D{"Platform false alarm?"}
D -->|Yes| E["Expected: tune monitoring"]
D -->|No| F["Sustained pressure"]
C --> G{"Tracks BGP RIB growth?"}
G -->|Yes| H["Route-table driven"]
G -->|No| I["Process leak: check per-PID"]
F --> H
F --> I
H --> J["Filter routes or upgrade hardware"]
I --> K["Identify PID, apply fix or workaround"]
E --> L["Exclude pool from alerting"]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Memory leak (software bug) | Monotonic upward trend in used memory, no workload change. Device may self-reboot to reclaim. | Per-process memory breakdown: which PID is growing in the Holding column? |
| BGP RIB / FIB growth | Memory rises alongside prefix count increase. Triggered by upstream route leak or organic table growth. | Per-peer prefix count and total RIB size. |
| Insufficient hardware | Memory consistently above 80% since deployment. No leak trend, just undersized. | Compare total DRAM to recommended sizing for feature set. |
| Platform false alarm | High utilization reported but device is healthy. Common on Arista EOS (cache accounting) and ASR 1000 LSMPI pool. | Check platform docs for expected memory behavior. |
| Control-plane traffic saturation | Memory rises with CPU spike. Excessive punted traffic (rogue UPnP, multicast flood) fills input queues. | CoPP drop counters and control-plane queue stats. |
| Known CVE or software defect | Memory exhaustion traceable to specific protocol. IKEv2, RPD, or other protocol-specific leaks. | Check vendor advisories against device software version. |
Quick checks
These commands are read-only and safe to run on production devices.
# Cisco: per-pool memory summary (Processor, I/O, etc.)
ssh <device> 'show memory statistics'
# Cisco IOS-XE: per-slot control-processor memory status
ssh <device> 'show platform software status control-processor brief'
# Cisco: per-process memory sorted by Holding column
ssh <device> 'show processes memory sorted'
# Cisco IOS-XE: per-slot Linux-style top, sorted by memory (press Shift+M)
ssh <device> 'show platform software process <slot> monitor'
# Juniper: system memory summary
ssh <device> 'show system memory'
# Juniper: per-process memory alarm states
ssh <device> 'show system monitor memory status process'
# Arista: top processes by memory with RSS percentages
ssh <device> 'show processes top memory'
# SNMP: Cisco memory pool used (all pools)
snmpwalk -v2c -c <community> <device> .1.3.6.1.4.1.9.9.48.1.1.1.5
# SNMP: Cisco memory pool free (all pools)
snmpwalk -v2c -c <community> <device> .1.3.6.1.4.1.9.9.48.1.1.1.7
# SNMP: generic storage table (HOST-RESOURCES-MIB)
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.25.2.3.1.6
How to diagnose it
Establish whether the pressure is real. Check platform-specific behavior. On Arista EOS, 97%+ reported memory is normal because available RAM is used for caching. EOS 4.22.0F+ exposes
hrStorageTable[100]via SNMP, which reports memory in use excluding reclaimable buffers. Pre-4.22.0F formulas count cache as used, inflating utilization. On ASR 1000, thelsmpi_iopool normally shows near-zero free memory; this is expected. Disable monitoring of this pool to avoid spurious alerts.Identify which pool is under pressure. Walk all memory pools, not just the aggregate. On Cisco, CISCO-MEMORY-POOL-MIB exposes
ciscoMemoryPoolUsedandciscoMemoryPoolFreeper pool at.1.3.6.1.4.1.9.9.48.1.1.1.5and.1.3.6.1.4.1.9.9.48.1.1.1.7. On IOS-XE 17.x, additional platform memory pools exist that are not covered by the older CISCO-MEMORY-POOL-MIB walks. Useshow platform software status control-processor briefto see per-slot (RP, ESP, SIP) memory status with Healthy, Warning, and Critical labels.Check the rate of change. A single snapshot of 85% utilization is not actionable without trend data. Compare current utilization to 1 hour, 24 hours, and 7 days ago. Rate of increase greater than 1% per minute sustained indicates an active leak. Monotonic upward trend over days without workload change is also a leak, just slower.
Identify the offending process. On Cisco, run
show processes memory sortedand examine the “Holding” column. This is the key indicator for leaks: it accumulates even if the “Freed” column is nonzero. The PID with the highest or fastest-growing Holding value is the suspect. For deeper analysis,show memory allocation-process totalsidentifies the program counter (PC) responsible for large allocations. On Juniper,show system monitor memory status processsurfaces per-process alarm states (minor, major, critical heap thresholds).Correlate with routing state. If memory pressure tracks BGP prefix count growth, the cause is route-table expansion, not a software leak. Check
show ip bgp summaryandshow ip route summaryfor prefix-count trends. Compare against the expected Internet table size: approximately 940k IPv4 and 190k IPv6 prefixes in 2026.Check for known software defects. Cross-reference the device software version against vendor advisories. Recent examples include CVE-2025-20239 (Cisco IKEv2 memory leak on ASA and FTD appliances, causing partial memory exhaustion that prevents new VPN tunnel establishment without crashing the device) and CVE-2025-52986 (Juniper RPD memory leak triggered by
showcommands when RIB-sharding is configured).Review control-plane policing. Excessive punted traffic can saturate control-plane input queues and memory. Check CoPP drop counters for unexpected traffic patterns such as rogue UPnP or misconfigured multicast.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
ciscoMemoryPoolUsed per pool | Per-pool utilization catches exhaustion invisible in aggregate | Any pool above 80% sustained |
ciscoMemoryPoolFree per pool | Free memory approaching zero is the OOM cliff | Free below 5% on a critical device |
hrStorageUsed / hrStorageSize | Generic, cross-platform memory gauge | Platform-dependent; calibrate per vendor |
| Rate of memory increase | Distinguishes leak from steady-state high utilization | Above 1% per minute sustained |
| Control-plane CPU | Memory pressure often correlates with CPU pressure | Above 70% sustained alongside memory pressure |
| BGP RIB size | Route-table growth is a primary memory consumer | Sudden prefix-count change above 20% |
| Lowest-ever free memory | Indicates sustained pressure even if current value recovered | Value significantly below current free |
| CoPP drop counters | Control-plane traffic saturation causes memory pressure | Rising drops correlate with memory increase |
Fixes
Memory leak from a software bug
If a specific process is leaking (Holding column growing monotonically), check the vendor bug tracker for a known defect. If a fixed version exists, upgrade during a maintenance window. Tradeoff: upgrades carry regression risk and require downtime.
If no fix is available yet:
- On Cisco 9800 WLC, the device may self-reboot to reclaim leaked memory, but the leak resumes immediately post-boot if the triggering feature is still active. The workaround requires disabling the offending feature or upgrading. The syslog signature is
%PLATFORM-4-ELEMENT_WARNING: ... Used Memory value 91% exceeds warning level 88%. - On Juniper MX Series with SPC3 line cards, kernel memory exhaustion under highly scaled routing tables with Inline Active Flow Monitoring was resolved in Junos 25.2R2.
BGP RIB or FIB growth
If memory tracks prefix-count growth, the options are:
- Add prefix-list filters to reject unwanted or overly specific routes from peers. Tradeoff: reduces routing visibility.
- Configure maximum-prefix limits on BGP sessions to cap route reception. Tradeoff: may trigger session reset if peer exceeds the limit.
- Upgrade hardware TCAM and DRAM. Tradeoff: cost and downtime. The Internet table grows approximately 5-10% per year; plan headroom accordingly.
Platform-specific false alarms
Disable or recalibrate monitoring for pools with known benign high utilization:
- Arista EOS pre-4.22.0F: adjust thresholds upward or upgrade to use the corrected
hrStorageTable[100]formula. - ASR 1000 LSMPI pool: exclude from alerting entirely.
- ASR 1000 QFP exmem: monitor via
show platform hardware qfp active infrastructure exmem statisticsfor DRAM and IRAM usage separately from host DRAM. IRAM depletion generates%QFPOOR-4-LOWRSRC_PERCENT.
Insufficient hardware
If memory has been consistently high since deployment with no leak trend, the device is undersized for its workload. Reduce the feature set (fewer VRFs, fewer BGP peers, smaller route tables) or upgrade the hardware.
Prevention
- Monitor every pool separately. A single exhausted pool can crash a device while aggregate utilization looks fine.
- Track rate of change, not just absolute values. Alert on rate above 1%/min as a leak indicator. A static 85% is different from a rising 85%.
- Calibrate thresholds per platform. Arista and ASR 1000 have known false-alarm patterns. Document expected behavior per platform type in your monitoring configuration.
- Track BGP RIB size against hardware TCAM limits. FIB exhaustion is a cliff: once TCAM is full, new routes are not installed and forwarding breaks.
- Apply vendor security advisories promptly. Protocol-specific memory leaks (IKEv2, RPD) have published CVEs. Track software versions against advisory databases.
- Watch the Lowest-ever free memory field. In
show memory statistics, the “Lowest” column reveals sustained pressure that may have temporarily recovered but indicates the device is operating near its limit.
How Netdata helps
- Collects SNMP-based memory pool metrics (Cisco
ciscoMemoryPoolUsed/ciscoMemoryPoolFreeand generichrStorageUsed/hrStorageSize) per pool, not just aggregate, so you can see which pool is under pressure. - Rate-of-change alerts on memory utilization can catch leaks early. Configure alerts on the derivative of memory used to detect a leak trend before absolute thresholds are crossed.
- Correlate memory pressure with control-plane CPU, BGP session state, and SNMP timeout rate in a single timeline. If BGP sessions drop and SNMP becomes unresponsive at the same moment memory crosses a threshold, the memory pressure is the root cause.
- Per-device baselines account for platform-specific behavior: a device that normally runs at 90% memory will not generate noise, while a device that normally runs at 60% and suddenly hits 85% will.
- Memory pool labels preserve pool identity, so you can alert on “Processor pool free below 5%” without false positives from the LSMPI pool or cache-inflated totals.







