SNMP poller falling behind: the polling-storm cascade and how to catch it
A single slow device is all it takes. Your SNMP poller queue drifts past its scheduled interval. Within minutes, 30 devices show as DOWN in your NMS dashboard. Every one of them responds to ping. The network is fine; your poller is the problem.
Scheduler fall-behind is the most common false “device down” trigger in network monitoring. When a poller cannot complete its collection cycle within the configured interval, every subsequent cycle inherits the debt. Devices that are reachable and healthy appear DOWN because their next poll slot arrives late relative to the alerting threshold. The cascade is self-reinforcing: missed polls generate retries, retries consume worker threads, fewer workers means slower polls for all other devices, more devices time out, and queue depth grows unboundedly.
The monitored devices are the second victim. The SNMP agent on a network device runs on control-plane CPU. When a poller hits a device with concurrent walks of large MIB tables, control-plane CPU saturates. BGP hold-time expiry, OSPF adjacency drops, and CLI unresponsiveness can follow. The monitoring system is now actively harming the network it is supposed to watch.
What this means
The defining characteristic: many devices appear DOWN simultaneously, but ICMP reachability to those same devices succeeds. SNMP is down; the device is up. This is a collector-side problem, not a network outage.
flowchart TD
A[Slow device or large MIB walk] --> B[Worker threads held past timeout]
B --> C[Retries consume additional workers]
C --> D[Fewer workers for other devices]
D --> E[More devices time out]
E --> F[Queue depth grows]
F --> G[Poll cycle exceeds configured interval]
G --> H[Devices appear DOWN]
H --> I[Retries on DOWN devices add load]
I --> C
E --> J[Device control-plane CPU spikes]
J --> K[BGP hold-time expiry, session drops]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Scheduler concurrency too high | Many parallel polls, collector CPU near saturation, device control-plane CPU spikes during poll windows | Collector CPU and per-device SNMP latency |
| Aggressive polling of large MIB tables | Specific devices consistently slow (ifTable, cdpCacheTable, large FDB walks), one device stalls a worker for 30+ seconds | Per-device SNMP latency histogram |
| Long-tail slow device holding workers | One or two devices with latency 10x the median, poll cycle dominated by the slowest device | Per-device SNMP response latency |
| Insufficient collector capacity | Poll cycle duration trending upward toward the configured interval even without individual slow devices | Poll cycle duration vs. configured interval |
| DNS resolution failures inflating poll time | Poller appears stalled with no obvious SNMP error, per-device timing shows DNS overhead | DNS resolution time on the collector host |
| Device SNMP agent overload | Specific device responds slowly to all OIDs, ICMP RTT normal but SNMP latency high | Control-plane CPU on the affected device |
Quick checks
These are safe, read-only commands. Run them on the collector and against suspect devices.
# Check collector CPU saturation (watch for a single core at 100%)
mpstat -P ALL 1 5
# Check poll cycle duration and queue depth (vendor-specific stats endpoint)
curl -s http://localhost:<port>/metrics | grep -E 'poll|collect|queue'
# Verify a "down" device is actually reachable via ICMP
ping -c 5 -i 0.2 <device>
# Time a direct SNMP GET to a suspect device (bypasses the scheduler)
time snmpget -v2c -c <community> -t 5 <device> .1.3.6.1.2.1.1.3.0
# Check SNMP timeout and retry behavior explicitly
time snmpget -v2c -c <community> -t 2 -r 2 <device> .1.3.6.1.2.1.1.3.0
# Check device control-plane CPU (Cisco)
<!-- TODO: verify OID .1.3.6.1.4.1.9.9.109.1.1.1.1.7 is cpmCPUTotal5min, not cpmCPUTotal5sec; for real-time diagnosis use cpmCPUTotal5sec at .5 or cpmCPUTotal5secRev -->
snmpget -v2c -c <community> <device> .1.3.6.1.4.1.9.9.109.1.1.1.1.7
# Check device control-plane CPU (Juniper)
snmpwalk -v2c -c <community> <device> .1.3.6.1.4.1.2636.3.1.13.1.8
# Check per-thread CPU of the collector process
top -H -p $(pgrep -d, <collector_process>)
How to diagnose it
Confirm the cascade is collector-side, not network-side. Ping the devices that appear DOWN. If ICMP succeeds where SNMP fails, the network path is healthy. SNMP down with ICMP up means the problem is on the collector or the device SNMP agent, not the network.
Check poll cycle duration against the configured interval. If the cycle takes longer than the interval, the scheduler is drifting. At 1.5x the interval, data is delayed. At 2x, data is effectively stale. This is the single most important diagnostic signal, and it is consistently the most under-monitored meta-signal in network monitoring.
Identify the slowest device. Pull a per-device SNMP latency histogram. The cascade is often triggered by one device holding a worker thread for 30+ seconds during a large MIB walk. Cycle time is bounded by the sum of per-device response times divided by worker count; one outlier can dominate.
Check collector CPU. Run
mpstat -P ALL 1 5. If one core is at 100% with others idle, the problem may be RSS misconfiguration funneling all packet processing to one core. If all cores are saturated, the collector is under-provisioned for the workload.Check device control-plane CPU. Poll the CPU OID on the slowest device. Sustained control-plane CPU above 70% means the device SNMP agent is starving other processes. Above 90% sustained is where BGP hold-time expiry and session drops begin.
Look at the timeout rate. A timeout rate above 5% sustained is the early-warning threshold. Above 30%, polling data is unreliable for alerting or capacity decisions. Distinguish per-device timeouts (device-side) from widespread timeouts (collector-side).
Verify the issue resolves when polling is paused. If temporarily disabling polling to the suspect devices causes their control-plane CPU to drop and other devices to recover, the diagnosis is confirmed: the monitoring system is the cause. WARNING: this test loses visibility into the target devices for the duration. Use a short maintenance window.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Poll cycle duration vs. configured interval | Directly indicates whether the scheduler can keep up | Cycle exceeding 70% of interval (headroom lost); exceeding interval (drift); exceeding 2x interval (stale data) |
| SNMP timeout rate | Measures how often polls fail within the configured window | Above 5% sustained is early warning; above 30% means data is unreliable |
| Per-device SNMP poll response latency | Identifies the slow device triggering the cascade | p99 latency above 1s on a normally fast device; latency spiking during bulk walks |
| Device control-plane CPU | Indicates device-side SNMP agent starvation | Sustained above 70% during poll windows; above 90% risks BGP hold-time expiry |
| Collector CPU (per-core) | Reveals collector-side bottleneck including RSS misconfiguration | Single core at 100% with others idle (RSS issue); all cores above 90% sustained |
| Poller worker queue depth | Measures backlog of unprocessed poll jobs | Queue growing without bound; workers busy above 50% sustained |
| ICMP reachability to “down” devices | Distinguishes false down from real outage | ICMP succeeds while SNMP fails: collector-side or agent-side problem |
| Data freshness (time since last successful poll) | Measures how stale the data is for each device | Time since last successful poll exceeding 2x the poll interval is soft-fail |
Fixes
Reduce poller concurrency
If the collector CPU is saturated and the poll cycle exceeds the interval, reduce the number of concurrent worker threads. This trades latency for stability: fewer parallel polls means each device gets a longer time slice, reducing timeouts at the cost of slower overall cycle completion. A slower cycle that completes is strictly better than a faster cycle that drifts and generates false alerts.
Increase per-poll timeout
If specific devices are timing out at the default timeout (often 1s for SNMPv2c), increase the per-poll timeout. SNMPv3 with authPriv adds per-packet auth/encryption overhead and engine-discovery round-trips. Timeouts at the default 1s are common for SNMPv3 and often reflect auth overhead, not agent slowness. Increasing timeout prevents retries from consuming additional workers. The tradeoff: a longer timeout means a stalled worker holds the slot longer. Pair this with reduced concurrency so the total cycle time stays bounded.
Exclude large MIB table walks from default polling
Large MIB walks are the most common cascade trigger. Walking the entire ifTable, cdpCacheTable, or a large dot1dTpFdbTable on a switch with 50,000 MAC entries can stall a worker for seconds and spike device CPU. Move these walks to targeted polling at a slower cadence, or restrict them to scheduled discovery intervals rather than every poll cycle. The default polling cycle should collect lightweight OIDs (sysUpTime, ifOperStatus, counters). Discovery-heavy walks belong on a separate schedule.
Isolate or replace the slow device
If one device consistently dominates the per-device latency histogram, investigate the root cause. It may have an overloaded SNMP agent, a control-plane CPU already saturated by other duties (BGP reconvergence, STP recalculation), or an ACL that is rate-limiting SNMP. In some cases, the device may need to be polled at a slower cadence or via a different transport. Some SD-WAN overlays do not expose SNMP at all; absence of response is design, not failure.
Address DNS resolution failures
DNS resolution failures can silently inflate polling times by adding per-device resolution overhead. If per-device timing shows DNS as the bottleneck, ensure the collector uses a local caching resolver, or switch to IP-based polling for devices that do not require DNS.Prevention
Monitor the poller itself. The poller’s own health is the most under-monitored signal in network monitoring. When the poller falls behind, all downstream signals become unreliable. Track poll cycle duration, timeout rate, worker queue depth, and collector CPU as first-class metrics with their own alerting thresholds.
Set capacity headroom thresholds. Comfortable headroom is poll cycle duration below 70% of the configured interval, per-device SNMP latency p99 below 1s, and control-plane CPU below 50% during poll windows. Workers should be below 50% utilization with queue depth below 25% of maximum. When any of these thresholds are crossed, add capacity or reduce polling scope before the cascade begins. The degradation curve is gradual then cliff: latency rises slowly as the device SNMP agent saturates, then the scheduler cannot complete a cycle within the interval, then data goes stale.
Separate poll tiers. Not all devices need the same polling cadence. Put critical-path devices on a fast cycle, edge devices on a slower cycle, and discovery-heavy walks (FDB, CDP/LLDP) on a separate, infrequent schedule. This prevents one slow walk from blocking the entire poll cycle for all devices.
Watch for vendor-specific polling halts. Some NMS platforms stop polling a host entirely after repeated SNMP failures. Zabbix logs “temporarily disabling SNMP agent checks on host” and requires a server restart or interface re-detection to resume polling. Aggressive retry settings combined with this behavior can permanently disable monitoring for devices that are merely slow, not down. Zabbix 7.0 increased the asynchronous SNMP poller retry count to reduce silent polling failures on intermittently reachable devices.
Correlate SNMP failure with ICMP before paging. A simple rule: SNMP down plus ICMP up equals an agent or collector problem, not a device outage. Demote these to investigation tickets unless corroborating signals (syslog silence, coldStart trap, BGP session drop) confirm a real outage.
How Netdata helps
- Collector CPU and per-core utilization surface the collector-side bottleneck. Netdata’s per-core CPU breakdown catches RSS misconfiguration (one core pinned at 100%, others idle) that aggregate averages hide.
- SNMP latency and timeout metrics provide the per-device histogram needed to identify the slow device triggering the cascade.
- ICMP reachability correlates alongside SNMP state. When Netdata shows ICMP succeeding while SNMP times out, the diagnosis shifts from “device down” to “collector or agent problem.”
- Control-plane CPU metrics (via SNMP polls of
cpmCPUTotal5secon Cisco orjnxOperatingCPUon Juniper) reveal when the monitoring system itself is causing device-side CPU saturation. - Data freshness tracking shows the gap between the configured poll interval and the actual time since last successful poll, making schedule drift visible before it triggers false alerts.
Related guides
- Flow export-to-ingest latency: why your NetFlow data is minutes behind
- Network monitoring checklist: the signals every production network needs
- NetFlow v9/IPFIX template desync: flows decoded wrong or dropped after a reboot
- Silent UDP flow data loss: why your NetFlow collector is dropping records
- sFlow sampling rate: why your traffic totals are off by 1000x







