SNMP poll response latency: diagnosing a slow poller

SNMP poll response latency is the round-trip time from your collector’s GET or GETBULK request to the device’s response. When it climbs, rate calculations lose accuracy, worker threads hold their slots longer than expected, and the poller falls behind schedule. Healthy devices start appearing stale or unreachable.

The most common misdiagnosis is “the network is slow.” On a LAN, an SNMP GET to sysUpTime should return in single-digit milliseconds. When the same device takes 2 to 5 seconds to respond, ICMP to the same target will usually confirm the path is fine. The bottleneck is almost always the device’s SNMP agent, the collector’s scheduler design, or a specific OID family that triggers expensive computation on the device CPU.

The second misdiagnosis is treating each slow poll in isolation. SNMP latency is contagious in a poller architecture. A slow device holds a worker thread, reducing capacity for other devices. Those devices miss their poll windows, trigger retries, and consume more workers. The schedule drifts, data goes stale, and false “device down” alerts cascade.

What this means

SNMP poll response latency includes three components: network RTT to the device, agent processing time on the device, and any local queueing on either side. A single GET to sysUpTime is the cheapest possible request. A GETBULK walk of a large table like ifTable on a device with hundreds of interfaces, or the FDB on a switch with 50,000 MAC entries, is orders of magnitude more expensive.

Suggested thresholds for investigation:

  • Latency greater than 1 second sustained for more than 5 minutes on a critical device.
  • A p99 exceeding 5 times the rolling 1-hour baseline indicates degradation.
  • A latency variance ratio (p99 divided by p50) above 5 indicates an unstable path or unstable agent.

SNMPv3 adds per-packet CPU cost on both sides for authentication and privacy (AES/DES) processing, and the initial USM discovery adds at least one extra round trip. Baseline SNMPv3 targets separately from SNMPv2c targets on the same device. Default timeouts of 1 second that work fine for SNMPv2c commonly produce false failures on SNMPv3.

The first poll after a device reboot is typically slow because the agent’s MIB cache is cold. Do not alert on first-poll slowness without corroborating signals like coldStart traps or sysUpTime resets.

The cascade is where latency becomes an operational incident:

flowchart TD
    A["Slow device or expensive OID walk"] --> B["Worker thread held 10-30s"]
    B --> C["Retries consume more workers"]
    C --> D["Fewer workers for other devices"]
    D --> E["Other devices miss poll windows"]
    E --> F["Schedule drifts past interval"]
    F --> G["Data goes stale across many devices"]
    G --> H["False device-down alerts"]
    C --> I["Device control-plane CPU spikes"]
    I --> A

A single slow SNMP bulk walk can stall a worker thread for 30 seconds or more. On a 60-second poll cycle, one slow device can drift the entire schedule. Some collectors parallelize across workers, but the cycle duration applies to the slowest worker. Poll cycle duration versus configured interval is the most under-monitored meta-signal in NPM.

Common causes

CauseWhat it looks likeFirst thing to check
Scheduler over-subscriptionPoll cycle duration exceeds configured interval; many devices go stale simultaneouslyCompare poll cycle time to configured interval
Device control-plane CPU saturationLatency spikes during polling windows on one device; ICMP unaffectedCheck device CPU during the poll window
Expensive OID walksLatency spikes on specific MIB tables, not on sysUpTime GETsWalk the suspect OID in isolation and time it
SNMPv3 auth/priv overheadv3 targets consistently slower than v2c on same deviceCompare timed GETs for both versions
Management network congestionICMP RTT to device also elevated; multiple devices affectedping -c 10 -i 0.2 <host> during poll window
Collector CPU or RSS bottleneckOne core pinned at 100%; aggregate CPU looks finempstat -P ALL 1 during poll cycle
Device firmware vulnerabilityUnexpected device reloads during SNMP polling on Cisco IOS/IOS XECheck firmware version against CVE-2025-20352
Hardware bus saturationDelayed responses without visible CPU spike on control planeCheck vendor documentation for your hardware model

Quick checks

All commands are read-only and safe to run during production.

# Measure baseline SNMP GET latency (sysUpTime is the cheapest OID)
time snmpget -v2c -c <community> -t 5 <device> .1.3.6.1.2.1.1.3.0

# Measure latency variance across 10 sequential polls
for i in {1..10}; do
  /usr/bin/time -f "%e" snmpget -v2c -c <community> -t 5 <device> .1.3.6.1.2.1.1.3.0 2>&1 | tail -1
done

# Compare SNMPv2c vs SNMPv3 latency to isolate auth/priv overhead
time snmpget -v2c -c <community> -t 5 <device> .1.3.6.1.2.1.1.3.0
time snmpget -v3 -u <user> -A <auth> -X <priv> -l authPriv -t 5 <device> .1.3.6.1.2.1.1.3.0

# Check management network path latency independently
# Note: -i 0.2 requires root on Linux; use -i 1 without root
ping -c 10 -i 0.2 <host>

# Time a bulk walk of a suspect expensive MIB table (dot1dTpFdbTable)
time snmpwalk -v2c -c <community> -t 30 <device> .1.3.6.1.2.1.17.4.3.1.1

# Check device control-plane CPU (Cisco cpmCPUTotal5secRev)
# <!-- TODO: verify OID .7 maps to cpmCPUTotal5secRev vs cpmCPUTotal5min -->
snmpget -v2c -c <community> <device> .1.3.6.1.4.1.9.9.109.1.1.1.1.7

# Check device control-plane CPU (Juniper jnxOperatingCPU)
snmpwalk -v2c -c <community> <device> .1.3.6.1.4.1.2636.3.1.13.1.8

# Check collector per-core CPU for RSS funneling
mpstat -P ALL 1 5

How to diagnose it

  1. Isolate the scope. Determine whether latency is rising on one device, a group of devices, or the entire estate. One device means a device-side problem. Many devices simultaneously means a collector-side problem or management-network congestion.

  2. Rule out the network path. Run ping -c 10 -i 0.2 <host> during the poll window. If ICMP RTT is at baseline, the network path is not the bottleneck. Some Cisco devices rate-limit ICMP via CoPP, so slightly elevated ICMP RTT may be intentional de-prioritization rather than congestion.

  3. Time the cheapest OID. Run time snmpget against sysUpTime (.1.3.6.1.2.1.1.3.0). If this returns quickly but other OIDs are slow, the problem is specific to expensive MIB tables, not the agent in general.

  4. Identify expensive OIDs. Walk suspect tables individually with explicit timing. Known expensive OIDs on Cisco Catalyst switches include cefcFRUPowerStatusEntry, ciscoFlashFileEntry, cefcFanTrayStatusEntry, and large ifHCInOctets walks. On Cisco IOS/IOS XE, the device logs %SNMP-3-RESPONSE_DELAYED when a response exceeds the configurable threshold . The log entry includes the OID and the millisecond cost, which tells you exactly which OID to investigate.

  5. Check device CPU during the poll window. On Cisco, use show process cpu sorted or show proc cpu | i SNMP Engine to see whether the SNMP process is consuming disproportionate CPU. Brief spikes during a bulk walk are normal. Sustained elevation over minutes indicates the poller is overwhelming the device. Via SNMP, poll cpmCPUTotal5secRev at .1.3.6.1.4.1.9.9.109.1.1.1.1.7 (Cisco) or jnxOperatingCPU at .1.3.6.1.4.1.2636.3.1.13.1.8 (Juniper).

  6. Check collector capacity. Run mpstat -P ALL 1 during a poll cycle. If one core is at 100% while others are idle, the problem is RSS misconfiguration funneling all packet processing to a single CPU. See NIC RSS misconfiguration for diagnosis and fix.

  7. Check poll cycle duration. If your collector exposes poll cycle metrics, compare actual cycle time to the configured interval. A cycle at 95% of the interval with no headroom is a capacity problem that will become a schedule-drift problem under any additional load. For deeper analysis of collector resource saturation, see Collector CPU and TSDB write-queue saturation.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
SNMP poll response latency per deviceIdentifies which device is slow before the cascade startsp99 > 1s or > 5x rolling 1-hour baseline
SNMP timeout and retry rateCounts polls that exceeded the timeout window> 5% sustained across multiple devices
Poll cycle duration vs configured intervalDetects schedule drift before false-down alertsCycle duration exceeds configured interval
Device control-plane CPUSNMP agent competes with routing and management processes> 70% sustained during poll windows
ICMP RTT to deviceIsolates network path from agent processingp99 > 2x rolling baseline without routing change
Collector per-core CPUIdentifies RSS funneling or collector-side bottleneckOne core at 100% with others idle
Worker queue depthDetects soft saturation before the cliffQueue depth > 25% of max or growing

Fixes

Reduce scheduler concurrency

Lower the number of parallel workers per poll cycle. This reduces simultaneous load on both the collector and the devices being polled. The trade-off is that total cycle time may increase, but individual device responses become more reliable. Target worker utilization below 50% of capacity with queue depth under 25% of maximum.

Increase per-poll timeout

Set the per-poll timeout higher than the slowest expected OID walk on your estate. Default timeouts of 1 to 2 seconds are adequate for simple GETs but routinely exceeded by bulk walks of large tables. Increasing timeout prevents premature retries that waste worker threads on devices that are simply processing a large request. The trade-off is longer detection time for genuinely dead devices. Use the -t flag to set timeout and -r to set retries:

# Example: 5-second timeout, 2 retries
snmpget -v2c -c <community> -t 5 -r 2 <device> .1.3.6.1.2.1.1.3.0

Exclude or schedule expensive OIDs separately

Walk large MIB tables (dot1dTpFdbTable on switches with large MAC tables, cdpCacheTable, ifTable on high-interface-count devices) on a separate, less frequent schedule. For Cisco-specific expensive OIDs identified via %SNMP-3-RESPONSE_DELAYED logs, consider blocking them entirely via SNMP views on the device if the data is not operationally needed.

Use GETBULK instead of GETNEXT

SNMPv2c and later support GETBULK, which retrieves multiple variable bindings in a single round-trip. SNMPv1 requires iterative GETNEXT, with one round-trip per OID, inflating latency linearly with table size. Migrate devices to SNMPv2c or v3 where hardware supports it. If using SNMPv3, baseline latency separately from v2c due to auth/priv overhead.

Fix RSS distribution on the collector

If one CPU core is saturated while others are idle, RSS is funneling all packet processing to a single core. Check IRQ distribution with cat /proc/interrupts | grep -i eth (substitute your interface name) and reconfigure RSS to spread receive interrupts across available cores.

Patch vulnerable devices

If Cisco IOS/IOS XE devices reload unexpectedly during SNMP polling, check firmware against CVE-2025-20352 . An authenticated remote attacker with low-privilege SNMP credentials can trigger a device reload. Restrict SNMP access via ACLs as a partial mitigation until patching is complete.

Prevention

  • Monitor poll cycle duration as a first-class metric. If your collector does not expose it, instrument it externally. The cycle duration versus the configured interval is the single best leading indicator of schedule drift. Target cycle duration below 70% of the configured interval.
  • Baseline per-device SNMP latency. Without a per-device baseline, you cannot distinguish a device that was always slow from one that degraded. Track p50 and p99 per device and alert on p99 exceeding 5 times the rolling baseline.
  • Audit OID coverage. Walk your standard polling profile against representative devices and time each OID family. Remove OIDs that are not operationally needed, especially expensive vendor-specific tables.
  • Separate fast and slow polls. Poll sysUpTime and ifOperStatus on the default fast cycle. Move bulk table walks to a slower cycle with a longer timeout. This prevents one slow walk from blocking availability checks.
  • Baseline device control-plane CPU during poll windows. If CPU routinely exceeds 50% during polling, the poller is too aggressive for that device class. Reduce concurrency or OID scope for affected devices.

How Netdata helps

Netdata correlates SNMP poll latency with the signals that explain it:

  • Per-device SNMP latency tracking with p50 and p99 baselines identifies the long-tail slow device before schedule drift.
  • Device control-plane CPU monitoring via SNMP correlates CPU saturation with latency spikes during poll windows.
  • Collector per-core CPU metrics expose RSS funneling where aggregate CPU looks fine but one core is saturated.
  • ICMP RTT monitoring alongside SNMP latency isolates network path issues from agent processing issues.
  • Configurable alerting on latency p99 versus rolling baseline catches degradation early without paging on the first-poll-after-reboot cold cache miss.