SNMP poller falling behind: the polling-storm cascade and how to catch it

A single slow device is all it takes. Your SNMP poller queue drifts past its scheduled interval. Within minutes, 30 devices show as DOWN in your NMS dashboard. Every one of them responds to ping. The network is fine; your poller is the problem.

Scheduler fall-behind is the most common false “device down” trigger in network monitoring. When a poller cannot complete its collection cycle within the configured interval, every subsequent cycle inherits the debt. Devices that are reachable and healthy appear DOWN because their next poll slot arrives late relative to the alerting threshold. The cascade is self-reinforcing: missed polls generate retries, retries consume worker threads, fewer workers means slower polls for all other devices, more devices time out, and queue depth grows unboundedly.

The monitored devices are the second victim. The SNMP agent on a network device runs on control-plane CPU. When a poller hits a device with concurrent walks of large MIB tables, control-plane CPU saturates. BGP hold-time expiry, OSPF adjacency drops, and CLI unresponsiveness can follow. The monitoring system is now actively harming the network it is supposed to watch.

What this means

The defining characteristic: many devices appear DOWN simultaneously, but ICMP reachability to those same devices succeeds. SNMP is down; the device is up. This is a collector-side problem, not a network outage.

flowchart TD
    A[Slow device or large MIB walk] --> B[Worker threads held past timeout]
    B --> C[Retries consume additional workers]
    C --> D[Fewer workers for other devices]
    D --> E[More devices time out]
    E --> F[Queue depth grows]
    F --> G[Poll cycle exceeds configured interval]
    G --> H[Devices appear DOWN]
    H --> I[Retries on DOWN devices add load]
    I --> C
    E --> J[Device control-plane CPU spikes]
    J --> K[BGP hold-time expiry, session drops]

Common causes

CauseWhat it looks likeFirst thing to check
Scheduler concurrency too highMany parallel polls, collector CPU near saturation, device control-plane CPU spikes during poll windowsCollector CPU and per-device SNMP latency
Aggressive polling of large MIB tablesSpecific devices consistently slow (ifTable, cdpCacheTable, large FDB walks), one device stalls a worker for 30+ secondsPer-device SNMP latency histogram
Long-tail slow device holding workersOne or two devices with latency 10x the median, poll cycle dominated by the slowest devicePer-device SNMP response latency
Insufficient collector capacityPoll cycle duration trending upward toward the configured interval even without individual slow devicesPoll cycle duration vs. configured interval
DNS resolution failures inflating poll timePoller appears stalled with no obvious SNMP error, per-device timing shows DNS overheadDNS resolution time on the collector host
Device SNMP agent overloadSpecific device responds slowly to all OIDs, ICMP RTT normal but SNMP latency highControl-plane CPU on the affected device

Quick checks

These are safe, read-only commands. Run them on the collector and against suspect devices.

# Check collector CPU saturation (watch for a single core at 100%)
mpstat -P ALL 1 5

# Check poll cycle duration and queue depth (vendor-specific stats endpoint)
curl -s http://localhost:<port>/metrics | grep -E 'poll|collect|queue'

# Verify a "down" device is actually reachable via ICMP
ping -c 5 -i 0.2 <device>

# Time a direct SNMP GET to a suspect device (bypasses the scheduler)
time snmpget -v2c -c <community> -t 5 <device> .1.3.6.1.2.1.1.3.0

# Check SNMP timeout and retry behavior explicitly
time snmpget -v2c -c <community> -t 2 -r 2 <device> .1.3.6.1.2.1.1.3.0

# Check device control-plane CPU (Cisco)
<!-- TODO: verify OID .1.3.6.1.4.1.9.9.109.1.1.1.1.7 is cpmCPUTotal5min, not cpmCPUTotal5sec; for real-time diagnosis use cpmCPUTotal5sec at .5 or cpmCPUTotal5secRev -->
snmpget -v2c -c <community> <device> .1.3.6.1.4.1.9.9.109.1.1.1.1.7

# Check device control-plane CPU (Juniper)
snmpwalk -v2c -c <community> <device> .1.3.6.1.4.1.2636.3.1.13.1.8

# Check per-thread CPU of the collector process
top -H -p $(pgrep -d, <collector_process>)

How to diagnose it

  1. Confirm the cascade is collector-side, not network-side. Ping the devices that appear DOWN. If ICMP succeeds where SNMP fails, the network path is healthy. SNMP down with ICMP up means the problem is on the collector or the device SNMP agent, not the network.

  2. Check poll cycle duration against the configured interval. If the cycle takes longer than the interval, the scheduler is drifting. At 1.5x the interval, data is delayed. At 2x, data is effectively stale. This is the single most important diagnostic signal, and it is consistently the most under-monitored meta-signal in network monitoring.

  3. Identify the slowest device. Pull a per-device SNMP latency histogram. The cascade is often triggered by one device holding a worker thread for 30+ seconds during a large MIB walk. Cycle time is bounded by the sum of per-device response times divided by worker count; one outlier can dominate.

  4. Check collector CPU. Run mpstat -P ALL 1 5. If one core is at 100% with others idle, the problem may be RSS misconfiguration funneling all packet processing to one core. If all cores are saturated, the collector is under-provisioned for the workload.

  5. Check device control-plane CPU. Poll the CPU OID on the slowest device. Sustained control-plane CPU above 70% means the device SNMP agent is starving other processes. Above 90% sustained is where BGP hold-time expiry and session drops begin.

  6. Look at the timeout rate. A timeout rate above 5% sustained is the early-warning threshold. Above 30%, polling data is unreliable for alerting or capacity decisions. Distinguish per-device timeouts (device-side) from widespread timeouts (collector-side).

  7. Verify the issue resolves when polling is paused. If temporarily disabling polling to the suspect devices causes their control-plane CPU to drop and other devices to recover, the diagnosis is confirmed: the monitoring system is the cause. WARNING: this test loses visibility into the target devices for the duration. Use a short maintenance window.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Poll cycle duration vs. configured intervalDirectly indicates whether the scheduler can keep upCycle exceeding 70% of interval (headroom lost); exceeding interval (drift); exceeding 2x interval (stale data)
SNMP timeout rateMeasures how often polls fail within the configured windowAbove 5% sustained is early warning; above 30% means data is unreliable
Per-device SNMP poll response latencyIdentifies the slow device triggering the cascadep99 latency above 1s on a normally fast device; latency spiking during bulk walks
Device control-plane CPUIndicates device-side SNMP agent starvationSustained above 70% during poll windows; above 90% risks BGP hold-time expiry
Collector CPU (per-core)Reveals collector-side bottleneck including RSS misconfigurationSingle core at 100% with others idle (RSS issue); all cores above 90% sustained
Poller worker queue depthMeasures backlog of unprocessed poll jobsQueue growing without bound; workers busy above 50% sustained
ICMP reachability to “down” devicesDistinguishes false down from real outageICMP succeeds while SNMP fails: collector-side or agent-side problem
Data freshness (time since last successful poll)Measures how stale the data is for each deviceTime since last successful poll exceeding 2x the poll interval is soft-fail

Fixes

Reduce poller concurrency

If the collector CPU is saturated and the poll cycle exceeds the interval, reduce the number of concurrent worker threads. This trades latency for stability: fewer parallel polls means each device gets a longer time slice, reducing timeouts at the cost of slower overall cycle completion. A slower cycle that completes is strictly better than a faster cycle that drifts and generates false alerts.

Increase per-poll timeout

If specific devices are timing out at the default timeout (often 1s for SNMPv2c), increase the per-poll timeout. SNMPv3 with authPriv adds per-packet auth/encryption overhead and engine-discovery round-trips. Timeouts at the default 1s are common for SNMPv3 and often reflect auth overhead, not agent slowness. Increasing timeout prevents retries from consuming additional workers. The tradeoff: a longer timeout means a stalled worker holds the slot longer. Pair this with reduced concurrency so the total cycle time stays bounded.

Exclude large MIB table walks from default polling

Large MIB walks are the most common cascade trigger. Walking the entire ifTable, cdpCacheTable, or a large dot1dTpFdbTable on a switch with 50,000 MAC entries can stall a worker for seconds and spike device CPU. Move these walks to targeted polling at a slower cadence, or restrict them to scheduled discovery intervals rather than every poll cycle. The default polling cycle should collect lightweight OIDs (sysUpTime, ifOperStatus, counters). Discovery-heavy walks belong on a separate schedule.

Isolate or replace the slow device

If one device consistently dominates the per-device latency histogram, investigate the root cause. It may have an overloaded SNMP agent, a control-plane CPU already saturated by other duties (BGP reconvergence, STP recalculation), or an ACL that is rate-limiting SNMP. In some cases, the device may need to be polled at a slower cadence or via a different transport. Some SD-WAN overlays do not expose SNMP at all; absence of response is design, not failure.

Address DNS resolution failures

DNS resolution failures can silently inflate polling times by adding per-device resolution overhead. If per-device timing shows DNS as the bottleneck, ensure the collector uses a local caching resolver, or switch to IP-based polling for devices that do not require DNS.

Prevention

Monitor the poller itself. The poller’s own health is the most under-monitored signal in network monitoring. When the poller falls behind, all downstream signals become unreliable. Track poll cycle duration, timeout rate, worker queue depth, and collector CPU as first-class metrics with their own alerting thresholds.

Set capacity headroom thresholds. Comfortable headroom is poll cycle duration below 70% of the configured interval, per-device SNMP latency p99 below 1s, and control-plane CPU below 50% during poll windows. Workers should be below 50% utilization with queue depth below 25% of maximum. When any of these thresholds are crossed, add capacity or reduce polling scope before the cascade begins. The degradation curve is gradual then cliff: latency rises slowly as the device SNMP agent saturates, then the scheduler cannot complete a cycle within the interval, then data goes stale.

Separate poll tiers. Not all devices need the same polling cadence. Put critical-path devices on a fast cycle, edge devices on a slower cycle, and discovery-heavy walks (FDB, CDP/LLDP) on a separate, infrequent schedule. This prevents one slow walk from blocking the entire poll cycle for all devices.

Watch for vendor-specific polling halts. Some NMS platforms stop polling a host entirely after repeated SNMP failures. Zabbix logs “temporarily disabling SNMP agent checks on host” and requires a server restart or interface re-detection to resume polling. Aggressive retry settings combined with this behavior can permanently disable monitoring for devices that are merely slow, not down. Zabbix 7.0 increased the asynchronous SNMP poller retry count to reduce silent polling failures on intermittently reachable devices.

Correlate SNMP failure with ICMP before paging. A simple rule: SNMP down plus ICMP up equals an agent or collector problem, not a device outage. Demote these to investigation tickets unless corroborating signals (syslog silence, coldStart trap, BGP session drop) confirm a real outage.

How Netdata helps

  • Collector CPU and per-core utilization surface the collector-side bottleneck. Netdata’s per-core CPU breakdown catches RSS misconfiguration (one core pinned at 100%, others idle) that aggregate averages hide.
  • SNMP latency and timeout metrics provide the per-device histogram needed to identify the slow device triggering the cascade.
  • ICMP reachability correlates alongside SNMP state. When Netdata shows ICMP succeeding while SNMP times out, the diagnosis shifts from “device down” to “collector or agent problem.”
  • Control-plane CPU metrics (via SNMP polls of cpmCPUTotal5sec on Cisco or jnxOperatingCPU on Juniper) reveal when the monitoring system itself is causing device-side CPU saturation.
  • Data freshness tracking shows the gap between the configured poll interval and the actual time since last successful poll, making schedule drift visible before it triggers false alerts.