SNMP poll response latency: diagnosing a slow poller
SNMP poll response latency is the round-trip time from your collector’s GET or GETBULK request to the device’s response. When it climbs, rate calculations lose accuracy, worker threads hold their slots longer than expected, and the poller falls behind schedule. Healthy devices start appearing stale or unreachable.
The most common misdiagnosis is “the network is slow.” On a LAN, an SNMP GET to sysUpTime should return in single-digit milliseconds. When the same device takes 2 to 5 seconds to respond, ICMP to the same target will usually confirm the path is fine. The bottleneck is almost always the device’s SNMP agent, the collector’s scheduler design, or a specific OID family that triggers expensive computation on the device CPU.
The second misdiagnosis is treating each slow poll in isolation. SNMP latency is contagious in a poller architecture. A slow device holds a worker thread, reducing capacity for other devices. Those devices miss their poll windows, trigger retries, and consume more workers. The schedule drifts, data goes stale, and false “device down” alerts cascade.
What this means
SNMP poll response latency includes three components: network RTT to the device, agent processing time on the device, and any local queueing on either side. A single GET to sysUpTime is the cheapest possible request. A GETBULK walk of a large table like ifTable on a device with hundreds of interfaces, or the FDB on a switch with 50,000 MAC entries, is orders of magnitude more expensive.
Suggested thresholds for investigation:
- Latency greater than 1 second sustained for more than 5 minutes on a critical device.
- A p99 exceeding 5 times the rolling 1-hour baseline indicates degradation.
- A latency variance ratio (p99 divided by p50) above 5 indicates an unstable path or unstable agent.
SNMPv3 adds per-packet CPU cost on both sides for authentication and privacy (AES/DES) processing, and the initial USM discovery adds at least one extra round trip. Baseline SNMPv3 targets separately from SNMPv2c targets on the same device. Default timeouts of 1 second that work fine for SNMPv2c commonly produce false failures on SNMPv3.
The first poll after a device reboot is typically slow because the agent’s MIB cache is cold. Do not alert on first-poll slowness without corroborating signals like coldStart traps or sysUpTime resets.
The cascade is where latency becomes an operational incident:
flowchart TD
A["Slow device or expensive OID walk"] --> B["Worker thread held 10-30s"]
B --> C["Retries consume more workers"]
C --> D["Fewer workers for other devices"]
D --> E["Other devices miss poll windows"]
E --> F["Schedule drifts past interval"]
F --> G["Data goes stale across many devices"]
G --> H["False device-down alerts"]
C --> I["Device control-plane CPU spikes"]
I --> AA single slow SNMP bulk walk can stall a worker thread for 30 seconds or more. On a 60-second poll cycle, one slow device can drift the entire schedule. Some collectors parallelize across workers, but the cycle duration applies to the slowest worker. Poll cycle duration versus configured interval is the most under-monitored meta-signal in NPM.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Scheduler over-subscription | Poll cycle duration exceeds configured interval; many devices go stale simultaneously | Compare poll cycle time to configured interval |
| Device control-plane CPU saturation | Latency spikes during polling windows on one device; ICMP unaffected | Check device CPU during the poll window |
| Expensive OID walks | Latency spikes on specific MIB tables, not on sysUpTime GETs | Walk the suspect OID in isolation and time it |
| SNMPv3 auth/priv overhead | v3 targets consistently slower than v2c on same device | Compare timed GETs for both versions |
| Management network congestion | ICMP RTT to device also elevated; multiple devices affected | ping -c 10 -i 0.2 <host> during poll window |
| Collector CPU or RSS bottleneck | One core pinned at 100%; aggregate CPU looks fine | mpstat -P ALL 1 during poll cycle |
| Device firmware vulnerability | Unexpected device reloads during SNMP polling on Cisco IOS/IOS XE | Check firmware version against CVE-2025-20352 |
| Hardware bus saturation | Delayed responses without visible CPU spike on control plane | Check vendor documentation for your hardware model |
Quick checks
All commands are read-only and safe to run during production.
# Measure baseline SNMP GET latency (sysUpTime is the cheapest OID)
time snmpget -v2c -c <community> -t 5 <device> .1.3.6.1.2.1.1.3.0
# Measure latency variance across 10 sequential polls
for i in {1..10}; do
/usr/bin/time -f "%e" snmpget -v2c -c <community> -t 5 <device> .1.3.6.1.2.1.1.3.0 2>&1 | tail -1
done
# Compare SNMPv2c vs SNMPv3 latency to isolate auth/priv overhead
time snmpget -v2c -c <community> -t 5 <device> .1.3.6.1.2.1.1.3.0
time snmpget -v3 -u <user> -A <auth> -X <priv> -l authPriv -t 5 <device> .1.3.6.1.2.1.1.3.0
# Check management network path latency independently
# Note: -i 0.2 requires root on Linux; use -i 1 without root
ping -c 10 -i 0.2 <host>
# Time a bulk walk of a suspect expensive MIB table (dot1dTpFdbTable)
time snmpwalk -v2c -c <community> -t 30 <device> .1.3.6.1.2.1.17.4.3.1.1
# Check device control-plane CPU (Cisco cpmCPUTotal5secRev)
# <!-- TODO: verify OID .7 maps to cpmCPUTotal5secRev vs cpmCPUTotal5min -->
snmpget -v2c -c <community> <device> .1.3.6.1.4.1.9.9.109.1.1.1.1.7
# Check device control-plane CPU (Juniper jnxOperatingCPU)
snmpwalk -v2c -c <community> <device> .1.3.6.1.4.1.2636.3.1.13.1.8
# Check collector per-core CPU for RSS funneling
mpstat -P ALL 1 5
How to diagnose it
Isolate the scope. Determine whether latency is rising on one device, a group of devices, or the entire estate. One device means a device-side problem. Many devices simultaneously means a collector-side problem or management-network congestion.
Rule out the network path. Run
ping -c 10 -i 0.2 <host>during the poll window. If ICMP RTT is at baseline, the network path is not the bottleneck. Some Cisco devices rate-limit ICMP via CoPP, so slightly elevated ICMP RTT may be intentional de-prioritization rather than congestion.Time the cheapest OID. Run
time snmpgetagainstsysUpTime(.1.3.6.1.2.1.1.3.0). If this returns quickly but other OIDs are slow, the problem is specific to expensive MIB tables, not the agent in general.Identify expensive OIDs. Walk suspect tables individually with explicit timing. Known expensive OIDs on Cisco Catalyst switches include
cefcFRUPowerStatusEntry,ciscoFlashFileEntry,cefcFanTrayStatusEntry, and largeifHCInOctetswalks. On Cisco IOS/IOS XE, the device logs%SNMP-3-RESPONSE_DELAYEDwhen a response exceeds the configurable threshold . The log entry includes the OID and the millisecond cost, which tells you exactly which OID to investigate.Check device CPU during the poll window. On Cisco, use
show process cpu sortedorshow proc cpu | i SNMP Engineto see whether the SNMP process is consuming disproportionate CPU. Brief spikes during a bulk walk are normal. Sustained elevation over minutes indicates the poller is overwhelming the device. Via SNMP, pollcpmCPUTotal5secRevat.1.3.6.1.4.1.9.9.109.1.1.1.1.7(Cisco) orjnxOperatingCPUat.1.3.6.1.4.1.2636.3.1.13.1.8(Juniper).Check collector capacity. Run
mpstat -P ALL 1during a poll cycle. If one core is at 100% while others are idle, the problem is RSS misconfiguration funneling all packet processing to a single CPU. See NIC RSS misconfiguration for diagnosis and fix.Check poll cycle duration. If your collector exposes poll cycle metrics, compare actual cycle time to the configured interval. A cycle at 95% of the interval with no headroom is a capacity problem that will become a schedule-drift problem under any additional load. For deeper analysis of collector resource saturation, see Collector CPU and TSDB write-queue saturation.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| SNMP poll response latency per device | Identifies which device is slow before the cascade starts | p99 > 1s or > 5x rolling 1-hour baseline |
| SNMP timeout and retry rate | Counts polls that exceeded the timeout window | > 5% sustained across multiple devices |
| Poll cycle duration vs configured interval | Detects schedule drift before false-down alerts | Cycle duration exceeds configured interval |
| Device control-plane CPU | SNMP agent competes with routing and management processes | > 70% sustained during poll windows |
| ICMP RTT to device | Isolates network path from agent processing | p99 > 2x rolling baseline without routing change |
| Collector per-core CPU | Identifies RSS funneling or collector-side bottleneck | One core at 100% with others idle |
| Worker queue depth | Detects soft saturation before the cliff | Queue depth > 25% of max or growing |
Fixes
Reduce scheduler concurrency
Lower the number of parallel workers per poll cycle. This reduces simultaneous load on both the collector and the devices being polled. The trade-off is that total cycle time may increase, but individual device responses become more reliable. Target worker utilization below 50% of capacity with queue depth under 25% of maximum.
Increase per-poll timeout
Set the per-poll timeout higher than the slowest expected OID walk on your estate. Default timeouts of 1 to 2 seconds are adequate for simple GETs but routinely exceeded by bulk walks of large tables. Increasing timeout prevents premature retries that waste worker threads on devices that are simply processing a large request. The trade-off is longer detection time for genuinely dead devices. Use the -t flag to set timeout and -r to set retries:
# Example: 5-second timeout, 2 retries
snmpget -v2c -c <community> -t 5 -r 2 <device> .1.3.6.1.2.1.1.3.0
Exclude or schedule expensive OIDs separately
Walk large MIB tables (dot1dTpFdbTable on switches with large MAC tables, cdpCacheTable, ifTable on high-interface-count devices) on a separate, less frequent schedule. For Cisco-specific expensive OIDs identified via %SNMP-3-RESPONSE_DELAYED logs, consider blocking them entirely via SNMP views on the device if the data is not operationally needed.
Use GETBULK instead of GETNEXT
SNMPv2c and later support GETBULK, which retrieves multiple variable bindings in a single round-trip. SNMPv1 requires iterative GETNEXT, with one round-trip per OID, inflating latency linearly with table size. Migrate devices to SNMPv2c or v3 where hardware supports it. If using SNMPv3, baseline latency separately from v2c due to auth/priv overhead.
Fix RSS distribution on the collector
If one CPU core is saturated while others are idle, RSS is funneling all packet processing to a single core. Check IRQ distribution with cat /proc/interrupts | grep -i eth (substitute your interface name) and reconfigure RSS to spread receive interrupts across available cores.
Patch vulnerable devices
If Cisco IOS/IOS XE devices reload unexpectedly during SNMP polling, check firmware against CVE-2025-20352 . An authenticated remote attacker with low-privilege SNMP credentials can trigger a device reload. Restrict SNMP access via ACLs as a partial mitigation until patching is complete.
Prevention
- Monitor poll cycle duration as a first-class metric. If your collector does not expose it, instrument it externally. The cycle duration versus the configured interval is the single best leading indicator of schedule drift. Target cycle duration below 70% of the configured interval.
- Baseline per-device SNMP latency. Without a per-device baseline, you cannot distinguish a device that was always slow from one that degraded. Track p50 and p99 per device and alert on p99 exceeding 5 times the rolling baseline.
- Audit OID coverage. Walk your standard polling profile against representative devices and time each OID family. Remove OIDs that are not operationally needed, especially expensive vendor-specific tables.
- Separate fast and slow polls. Poll
sysUpTimeandifOperStatuson the default fast cycle. Move bulk table walks to a slower cycle with a longer timeout. This prevents one slow walk from blocking availability checks. - Baseline device control-plane CPU during poll windows. If CPU routinely exceeds 50% during polling, the poller is too aggressive for that device class. Reduce concurrency or OID scope for affected devices.
How Netdata helps
Netdata correlates SNMP poll latency with the signals that explain it:
- Per-device SNMP latency tracking with p50 and p99 baselines identifies the long-tail slow device before schedule drift.
- Device control-plane CPU monitoring via SNMP correlates CPU saturation with latency spikes during poll windows.
- Collector per-core CPU metrics expose RSS funneling where aggregate CPU looks fine but one core is saturated.
- ICMP RTT monitoring alongside SNMP latency isolates network path issues from agent processing issues.
- Configurable alerting on latency p99 versus rolling baseline catches degradation early without paging on the first-poll-after-reboot cold cache miss.
Related guides
- Collector CPU and TSDB write-queue saturation: the capacity signals
- NIC RSS misconfiguration: one CPU core silently dropping your telemetry
- Asymmetric routing: why your path and latency measurements lie
- BGP session Established but stale: detecting silent route loss
- Cold-start topology: why your map is incomplete after a collector restart
- ARP cache staleness: when IP-to-MAC mapping goes bad







