Device control-plane CPU saturation: when SNMP polling causes the spike

The router’s control-plane CPU is pinned at 95%. SNMP polls are timing out. BGP sessions are approaching hold-time expiry. The instinct is to blame the device or suspect an attack, but the monitoring system itself is frequently the source of the load.

The control-plane CPU handles everything that is not hardware-forwarded packet switching: the SNMP agent, BGP, OSPF, STP, the CLI, syslog, AAA, and management interfaces. When SNMP polling saturates this CPU, every control-plane function degrades at once. The symptoms look like a device problem. The cause is often on the collector side.

A common reflex is to add more monitoring to understand the spike, which compounds the load. Another is to reboot the router, which triggers a cold-start convergence storm while leaving the root cause untouched because the poller resumes its aggressive schedule immediately after.

What this means

SNMP processing on most network devices runs on the control-plane CPU. Two factors drive CPU consumption: the packet rate arriving at UDP 161 (each packet triggers an interrupt and a context switch into the SNMP process), and the per-query processing cost (walking a large table requires CPU to build each response). A burst of lightweight GETs saturates via packet volume. A single bulk walk saturates via per-query cost. Both produce the same outcome.

When the control-plane CPU saturates, the SNMP agent is typically starved first. Polls time out. The collector retries, sending more packets into an already saturated CPU. Other control-plane processes degrade in parallel: BGP hold-time timers may expire, routing protocol hellos can be missed, and the management CLI becomes sluggish or unresponsive.

This is a self-reinforcing failure. The scheduler oversubscribes devices or holds workers on slow MIB walks, retries compound, device CPU spikes, more devices time out, and the poller queue grows without bound. Devices that are perfectly healthy appear down because the monitoring system cannot complete a poll cycle.

flowchart TD
    A[Aggressive poll schedule
or overlapping NMS tools] --> B[High SNMP packet rate
to device control plane] B --> C[SNMP agent processes
packets at interrupt level] C --> D[Control-plane CPU
saturates above 90 percent] D --> E[SNMP responses slow
or time out entirely] D --> F[BGP hold-time
expiry risk] D --> G[Routing protocol
and CLI degradation] E --> H[Collector retries,
adding more load] H --> B F --> I[BGP sessions reset,
route instability]

The critical insight: retries do not reduce load. They increase it. Breaking the loop requires reducing the polling rate, not increasing collector capacity to keep up.

Common causes

CauseWhat it looks likeFirst thing to check
Poller concurrency too highMany devices show simultaneous SNMP latency spikes and timeoutsPoll cycle duration vs configured interval
Aggressive large MIB walksCPU spikes at regular intervals matching a walk scheduleWhich OIDs have the highest request counts
Multiple overlapping NMS toolsUnexplained CPU load, no single poller accounts for the volumeSNMP packet sources via capture or device ACL logs
Long-tail slow deviceOne device stalls worker threads, cascading delays to othersPer-device SNMP latency histogram
Expensive routing-table OIDsCPU spikes when ipRouteTable or ipNetToMediaTable is walkedWhether CEF is enabled on the device
Undersized collector capacityPoll cycle drifts beyond configured interval for all devicesCollector CPU and worker thread utilization

Walking the entire FDB table on a data center switch with 50,000 MAC entries can take seconds and spike device CPU. The same applies to large routing tables, CDP neighbor caches, and ARP tables on devices that lack hardware-accelerated MIB retrieval.

Quick checks

These are read-only and safe to run during an active incident.

# Check Cisco control-plane CPU via SNMP (cpmCPUTotal5secRev)
# snmpwalk enumerates all CPU entities on modular devices
snmpwalk -v2c -c <community> <device> .1.3.6.1.4.1.9.9.109.1.1.1.1.7

# Check Juniper control-plane CPU via SNMP (jnxOperatingCPU)
snmpwalk -v2c -c <community> <device> .1.3.6.1.4.1.2636.3.1.13.1.8

# Check CPU via CLI (Cisco)
ssh <device> 'show processes cpu sorted | include five sec'

# Measure SNMP response latency and timeout behavior
time snmpget -v2c -c <community> -t 2 -r 2 <device> .1.3.6.1.2.1.1.3.0

# Verify device is reachable via ICMP (is the network path fine?)
ping -c 5 -i 0.2 <host>

# Check BGP session health (is CPU causing hold-time expiry?)
ssh <router> 'show ip bgp summary'

# Check collector poll cycle duration vs configured interval
curl -s http://localhost:<port>/metrics | grep -E 'poll|collect'

# Check per-core collector CPU (RSS bottleneck on one core?)
mpstat -P ALL 1 5

The key diagnostic question is whether ICMP succeeds while SNMP times out. If ping works but SNMP does not, the device is up and the SNMP agent is starved. That points to control-plane CPU saturation rather than a network partition.

How to diagnose it

  1. Confirm the CPU spike is control-plane, not data-plane. Poll cpmCPUTotal5secRev (Cisco) or jnxOperatingCPU (Juniper). On modular devices with multiple CPUs, check each entity separately. Some operations, such as NetFlow export, consume data-plane CPU, not control-plane CPU.

  2. Correlate CPU spikes with polling windows. If CPU spikes at regular intervals matching your poll schedule, the polling is the cause. If spikes are irregular or event-driven (BGP reconvergence, STP recalculation), investigate those events first.

  3. Identify which OIDs generate the most requests. On Cisco devices, show snmp stats oid lists every polled OID with its request count and timestamps. OIDs with request counts orders of magnitude higher than others are the primary load source. Common offenders include ifTable walks on devices with thousands of interfaces, ipRouteTable and ipNetToMediaTable on routers with large routing or ARP tables, and cdpCacheTable on switches with many neighbors.

  4. Determine whether multiple tools are polling the same device. A recurring root cause is two or more monitoring systems independently polling the same target. Each adds its own request load. Check SNMP packet sources via a SPAN or mirror port capture or device ACL logs. Consolidate to a single polling source where possible.

  5. Check whether the issue is collector-side or device-side. If many devices are affected simultaneously, the problem is on the collector. If one device is affected, the problem is on that device. If the issue resolves when you pause polling, the collector is definitively the cause.

  6. Check for CEF status on Cisco routers. When Cisco Express Forwarding is disabled, SNMP queries against routing tables hit the RIB. The RIB must be sorted into lexicographic order before each response can be built, which is CPU-intensive on devices with large routing tables. Enabling CEF eliminates this step because the FIB is stored in native lexicographic order.

  7. Verify BGP sessions are not dropping. Check show ip bgp summary for sessions in non-Established states. Hold-time expiry from CPU saturation drops BGP sessions, routes are withdrawn, and traffic shifts mid-incident.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Device control-plane CPU (5-sec average)Earliest indicator of saturation; 5-min average hides burstsSustained above 70%, or spike above 90% during poll windows
SNMP timeout rateRising timeouts indicate the agent cannot keep upRate above 5% sustained, or above 30% (data unreliable)
SNMP poll response latencyLatency increase precedes timeoutsp99 above 1 second on a normally fast device
Collector poll cycle durationCycle exceeding configured interval means data is going staleDuration above 1.5 times the configured interval
BGP session stateHold-time expiry from CPU starvation causes route churnUnexpected transitions out of Established
CoPP drop countersControl-plane policer drops are the leading indicator before CPU saturatesAny nonzero SNMP-class drops
Per-device SNMP latency histogramIdentifies the long-tail slow device dragging the scheduleOne device with p99 far above its peers

Control-plane CPU above 90% sustained for more than 1 minute warrants a page. Above 70% sustained for 5 minutes is a ticket. The 5-second average is more actionable than the 5-minute for detecting polling-induced bursts, because a spike from a single MIB walk may last only seconds.

Fixes

Reduce poller concurrency and increase per-poll timeouts

If the scheduler has too many parallel workers hammering devices, reduce concurrency. Increase the per-poll timeout so that slow responses do not immediately trigger retries. The tradeoff is longer poll cycles, but stale data is better than a device that drops BGP sessions because the monitoring system overwhelmed its CPU.

Target or eliminate expensive MIB walks

Do not walk ifTable, cdpCacheTable, dot1dTpFdbTable, or large routing tables on every poll cycle. Split static data (interface descriptions, software versions, neighbor tables) from dynamic data (interface counters, CPU, memory). Poll static data infrequently. Poll dynamic data at the interval you need, using targeted GETs rather than full walks.

Consolidate overlapping NMS tools

If multiple monitoring platforms independently poll the same devices, the combined load is the sum of all pollers. Identify all SNMP sources via packet capture or ACL logging. Restrict SNMP access to known management stations:

! WARNING: This immediately cuts off SNMP access from any host not in the ACL.
! Cisco IOS configuration mode:
access-list 10 permit 192.168.100.10
access-list 10 permit 192.168.100.11
snmp-server community MONITORING-RO RO 10

Enable CEF on Cisco routers

If CEF is disabled, enabling it eliminates the lexicographic sort overhead for routing-table SNMP queries. This is standard operational practice on Cisco IOS and IOS XE.

Apply CoPP to rate-limit SNMP traffic

Control Plane Policing can rate-limit SNMP packets to protect the control-plane CPU. However, configuring output policing on some platforms causes the router to silently discard excess SNMP packets rather than sending ICMP unreachable messages. This increases response times at the NMS without alerting the NMS that throttling is occurring. Your dashboards will show SNMP timeouts, not “device rate-limited you.”

Avoid debug commands during saturation

debug snmp packets and similar debug commands generate additional processing load that can worsen a CPU saturation event on a marginal device. Use packet captures from a SPAN or mirror port instead. Never enable SNMP debug on a device already at high CPU.

Prevention

  • Split polling by data type. Poll static configuration data hourly or daily. Poll dynamic counters at the interval your alerting requires. Sub-minute polling of device-wide statistics is a common misconfiguration.
  • Monitor the poller’s own health. Track poll cycle duration against the configured interval. If the cycle drifts beyond 70% of the interval, you are approaching the cliff. When the poller falls behind, every downstream signal becomes unreliable.
  • Track per-device SNMP latency. A single device with p99 latency far above its peers is a future cascade trigger. Identify and remediate it before it stalls the schedule.
  • Prefer streaming telemetry where available. gNMI, Junos Telemetry Interface, and similar push-based streaming protocols avoid the polling overhead entirely. Coverage is uneven across device generations, but where available, streaming eliminates the SNMP packet-rate problem at the source.
  • Monitor control-plane CPU as a first-class metric. Track it correlated with SNMP timeout rate and poll latency. The correlation is the diagnostic signal.
  • Apply CoPP proactively. Do not wait for an incident to configure control-plane policing. Set SNMP rate limits with headroom above your normal polling rate, and monitor the drop counters so you know when you are approaching the limit.

How Netdata helps

Netdata correlates device control-plane CPU with SNMP collector performance on the same timeline, which matters for this specific problem:

  • Device CPU (5-sec average) and SNMP response latency overlaid, making the cause-and-effect relationship visible without manual cross-referencing.
  • Per-device SNMP latency histograms that surface the long-tail slow device before it cascades into scheduler fall-behind and false device-down alerts.
  • Poll cycle duration tracked against the configured interval, catching drift before data goes stale.
  • BGP session state transitions shown alongside the CPU spike that caused them, confirming hold-time expiry without manual log correlation.
  • Collector health metrics (poll cycle duration, worker utilization, per-core CPU) on the collector itself, catching scheduler oversubscription before it impacts devices.