Network monitoring checklist: the signals every production network needs
This checklist covers the signals production networks need, organized by detection priority and mapped to maturity levels from survival to expert.
An NPM stack is a federation of collectors, parsers, enrichment services, storage tiers, and an analytics core. Most production incidents are not “the network broke” but “a collector’s UDP buffer dropped packets,” “the NetFlow v9 template cache went stale after a device reboot,” or “the polling worker pool fell behind and now a healthy device looks down.” The checklist is organized to surface those failure modes, not just the top-level symptoms.
The federation at a glance
When a signal is missing, stale, or wrong, the fault is usually one or two subsystems upstream of the dashboard. The typical NPM stack includes:
- Time synchronization substrate (NTP/PTP). Every cross-collector correlation depends on accurate, monotonic time across collectors, polled devices, and API endpoints.
- Polling transport. ICMP, UDP/161 (SNMP), TCP/22 (SSH/CLI scrape), and HTTPS (vendor APIs) reaching each managed endpoint.
- SNMP polling engine. A scheduler fanning OID requests across devices with timeouts, counter tables, and a device state machine (UP / STALE / UNKNOWN / DOWN).
- Flow collection subsystem. NetFlow v5/v9 and IPFIX collectors with template caches, sampling-rate awareness, and flow record storage. sFlow is sample-datagram oriented, not template-flow, and has a different failure profile.
- Topology inference engine. Fuses CDP/LLDP neighbor tables, FDB entries, ARP tables, STP state, and routing tables to derive Layer-2 and Layer-3 topology.
- BGP monitoring subsystem. Active or passive sessions tracking FSM state, prefix announcements, AS-path changes, and RPKI validity.
- Syslog and trap ingestion. UDP/TCP/TLS listeners with parser backpressure, facility/severity handling, and deduplication.
- Vendor API integration layer. Pull-mode clients for SD-WAN controllers, cloud platforms, and modern firewalls, each with their own auth, rate limit, and pagination semantics.
- Storage tiers. Counter TSDB (downsampled for long retention), full-resolution flow store, topology graph DB, raw syslog store, and event/alert log.
Signal domains by detection priority
The domains below are ordered by detection priority: the earliest surfacing of real issues with the best signal-to-noise comes first. Within each domain, the most operationally critical signals are listed first.
Availability
| Signal | Source | Why it matters |
|---|---|---|
| SNMP agent reachability (sysUpTime) | SNMP GET .1.3.6.1.2.1.1.3.0 | No response means agent down, partition, ACL block, or credential issue. Value decrease means reboot. SNMP down with healthy ICMP means agent problem, not device outage. |
| ICMP reachability | ping, fping | Liveness independent of SNMP. ICMP down plus SNMP down equals network problem. ICMP down plus SNMP up equals ICMP rate-limited or blocked (common on firewalls, CoPP). |
| Vendor API reachability and validity | HTTPS to vendor endpoint | For SD-WAN/cloud, the API may be the only telemetry source. HTTP 200 with empty or error payload (PAN-OS <response status="error"> inside HTTP 200) is a silent failure. |
| Flow UDP packet receipt rate | /proc/net/udp, collector stats, nstat | Drop to 0 from one exporter means exporter stopped or partitioned. Drop from all exporters means collector-side failure. |
| Syslog receipt rate and severity | UDP/TCP/TLS port 514 listener | Rate spike with severity escalation means device event. Spike without escalation means noise storm. Silence from a normally-chatty device means isolation or logging failure. |
| SNMP trap rate and type | UDP port 162 listener | linkDown/linkUp pairs mean flap. coldStart means reboot. Silence from a noisy device means trap path broken. |
| BGP session state (FSM) | BGP4-MIB .1.3.6.1.2.1.15.3.1.2, CLI, BMP | Established means exchanging routes. Established with no UPDATE traffic (stale session) is a worse failure than Idle. |
| Interface operational status | IF-MIB .1.3.6.1.2.1.2.2.1.8 (ifOperStatus) | Admin up plus oper down means physical or link-layer failure. Flapping means link instability. |
Errors
| Signal | Source | Why it matters |
|---|---|---|
| Interface errors (ifInErrors, ifOutErrors) | IF-MIB .1.3.6.1.2.1.2.2.1.14, .20 | Incrementing counters mean cable/fiber degradation, SFP failure, duplex mismatch, or EMI. Rate of change matters more than absolute value. |
| Interface discards (ifInDiscards, ifOutDiscards) | IF-MIB .1.3.6.1.2.1.2.2.1.13, .19 | Queue or buffer overflow, or ACL drops. Often the leading indicator of congestion before utilization shows 100%. |
| UDP socket buffer drops | /proc/net/snmp, nstat -az Udp_RcvbufErrors | The number one silent killer for flow, trap, and syslog collectors. Datagrams arrive at the kernel but the application was too slow to drain. Any nonzero value means lost telemetry. |
| SNMP timeout and retry rate | Collector stats, time snmpget | Rising across many devices means collector-side issue. Rising on one device means device-side agent or CPU issue. |
| BGP NOTIFICATION and Cease messages | bgpBackwardTransition trap, CLI, syslog | Cease/1 is maximum prefixes reached. Cease/2 is administrative shutdown. Hold Time Expired (NOTIFICATION code 4) indicates CPU saturation. |
| License and feature validity | Vendor MIBs, PAN-OS API, Meraki API, Cato GraphQL | Feature silently disabled at midnight. Users complain at 09:00. The most common root cause of “the firewall stopped doing what we paid for.” |
Saturation
| Signal | Source | Why it matters |
|---|---|---|
| Interface utilization (% of ifHighSpeed) | IF-MIB ifHCInOctets .1.3.6.1.2.1.31.1.1.1.6, ifHCOutOctets .10, ifHighSpeed .15 | 95% sustained for over 5 min on critical interface means congestion with drops and latency. Use 64-bit HC counters. 32-bit ifInOctets wraps in approximately 3.4 seconds at 10G line rate. |
| NIC RX/TX drops on collector | /proc/net/dev, ethtool -S | Ring buffer overflow before packets reach the socket layer. rx_missed_errors is the most actionable counter. |
| Collector CPU (per-core, %soft) | mpstat -P ALL, /proc/softirqs | High %soft on one core means RSS funneling all packet processing to one CPU. Total CPU may look fine while one core is pinned. |
| Collector disk and TSDB write queue | df, iostat, collector metrics | Cardinality inflation (new subnet, NAT pool, scanner traffic) can fill disk in hours. Write queue growing means TSDB cannot keep up with ingestion. |
| Device control-plane CPU | Cisco .1.3.6.1.4.1.9.9.109.1.1.1.1.7, Juniper .1.3.6.1.4.1.2636.3.1.13.1.8 | Sustained over 90% means SNMP starvation, BGP hold-time expiry, and session drops. |
| Device memory utilization | Cisco .1.3.6.1.4.1.9.9.48.1.1.1.5, HOST-RESOURCES-MIB | Free memory approaching 0 means OOM imminent. Rate of increase over 1%/min means memory leak. |
| BGP RIB and FIB size | BGP4-MIB prefix counts, CLI | Sudden change over 20% in 5 min means route leak or mass withdrawal. Full IPv4 DFZ in 2026 is approximately 940k prefixes. |
| NAT and session table utilization | PAN-OS API, vendor CLI | Approaching limit means new connections denied. Sustained growth means traffic outpacing NAT capacity. |
| API rate-limit remaining | HTTP headers (Retry-After, X-RateLimit-Remaining) | Meraki: 10 req/sec/org. |
Internal state, replication, and correctness
| Signal | Source | Why it matters |
|---|---|---|
| Device uptime (sysUpTime) | SNMP .1.3.6.1.2.1.1.3.0 | Decrease means reboot. 32-bit wrap at approximately 497 days looks like reboot; track wraps separately. |
| Temperature, fan, power supply | ENTITY-SENSOR-MIB .1.3.6.1.2.1.99.1.1.1.4 | Thermal failure, cooling failure, or redundancy lost. Use vendor-defined thresholds, not arbitrary absolute numbers. |
| Interface counter discontinuity | ifCounterDiscontinuityTime .1.3.6.1.2.1.31.1.1.1.3 | Counter reset without sysUpTime reset means SNMP agent inconsistency or counter-source bug. |
| Cross-collector time skew | ntpq -p, chronyc tracking | Over 100ms drift breaks cross-site flow correlation. Over 1s breaks it entirely. |
| NTP offset on monitored devices | hrSystemDate .1.3.6.1.2.1.25.1.2.0 | Device clock drift causes postmortem correlation failure. Consistently the most under-monitored NTP signal. |
| Topology view consistency | CDP/LLDP vs FDB vs ARP cross-validation | Inconsistency means stale data, topology change in progress, or device bug. Three sources agreeing is high confidence; one source alone is low. |
| Flow sampling rate consistency | sFlow MIB, NetFlow v9 template fields | Mismatch means analytics wrong by orders of magnitude. Without sampling-rate correction, sFlow at 1:1000 reports 1/1000 of true traffic. |
| STP root bridge and TCN | BRIDGE-MIB .1.3.6.1.2.1.17.2 | Root bridge change means reconvergence. TCN rate over 5/min means instability. |
Latency, throughput, and security
| Signal | Source | Why it matters |
|---|---|---|
| SNMP poll response latency | time snmpget, collector stats | Over 1s on a normally-fast device means agent or management-network degradation. |
| ICMP round-trip time | ping, fping | p99 over 2x rolling baseline means congestion or path change. High jitter means unstable path. |
| Active path probes | Cisco IPSLA RTTMON MIB, TWAMP, HTTP GET | RTT and loss per path, independent of application. Loss over 1% sustained is degraded. |
| Flow bytes per conversation | NetFlow/sFlow/IPFIX records | Top talkers, DDoS patterns, data exfiltration signals. sFlow requires sampling-rate multiplication for accurate byte counts. |
| Poller poll cycle duration | Collector internal stats | Cycle exceeding configured interval means data is drifting stale. The most under-monitored meta-signal in NPM. |
| Flow exporter drop rate (device-side) | Cisco cnfESPktsDropped .1.3.6.1.4.1.9.9.387.1.4.6 | Device dropped flows that never reached collector. Invisible to collector alone. Compare device-exported rate against collector inbound rate for end-to-end loss detection. |
| Unauthorized SNMP access | snmpInBadCommunityNames .1.3.6.1.2.1.11.4, USM stats | Burst from single source means scanning. Persistent events from many sources means community string “public” still configured. |
| BGP RPKI/ROA invalid acceptance | Vendor CLI show bgp rpki, validators | Any RPKI-invalid route accepted in production is a security event. Verify with public validators before alerting; stale cache produces false invalids. |
| Config changes without ticket | Syslog CONFIG-I, AAA logs, config diff | Change outside maintenance window without change ticket means unauthorized or emergency. Change followed within 30 min by incident is a high-correlation root-cause candidate. |
Monitoring maturity levels
These levels are sequential and cumulative. Each level includes everything below it.
flowchart TD
L4["L4 Expert
BMP, RPKI integrity, per-VRF,
sampling-rate forensics"] --> L3["L3 Mature
UDP drops, RSS, flow end-to-end loss,
NTP on devices, topology confidence"]
L3 --> L2["L2 Operational
CPU and memory, traps, licenses,
topology discovery, API status"]
L2 --> L1["L1 Survival
sysUpTime, ifOperStatus, BGP state,
utilization, errors, syslog, trap port"]L1: survival
The absolute minimum to know if the network is alive and not on fire:
- SNMP reachability (sysUpTime GET) for every critical-path device
- Interface operational status (ifOperStatus) for critical interfaces
- BGP FSM state for critical eBGP and iBGP peers
- Interface utilization (ifHCInOctets / ifHCOutOctets vs ifHighSpeed) for top-10 interfaces
- Interface error counters (ifInErrors, ifOutErrors) for critical interfaces
- Syslog severity 0-3 (EMERG through ERR) forwarded from critical devices
- Flow collector port listening (UDP 2055 for NetFlow, 6343 for sFlow, 4739 for IPFIX)
- Trap receiver bound on UDP 162
A team at L1 catches hard outages. Nothing else.
L2: operational
Everything in L1, plus:
- All interfaces for status, utilization, errors
- All BGP peers for FSM state and prefix count
- SNMP poll latency and timeout rate per device
- Device control-plane CPU and memory
- Flow records received per second; syslog source count
- License days-to-expiry for all licensed features
- Temperature, fan, power supply state
- Topology discovery (CDP/LLDP)
- STP root bridge identity and topology change count
- Vendor API HTTP status for SD-WAN and cloud
- SNMP authentication failure rate
- ColdStart/warmStart detection with alerting
A team at L2 has visibility into most failures. They still miss silent failures, license cliffs, and topology staleness.
L3: mature
Everything in L2, plus:
- UDP socket buffer drops (
Udp_RcvbufErrors) on flow, trap, and syslog collectors - NIC RX/TX drops and RSS IRQ distribution
- Collector CPU (per-core,
%soft) and disk space - TSDB write queue depth and series cardinality
- Flow export-to-ingest latency and sampling rate consistency
- Flow exporter drop rate (device-side) and inbound-vs-exported comparison
- Interface counter discontinuity detection
- NAT/session table utilization
- Vendor API request latency, error rate, and rate-limit remaining
- Active path probes (IPSLA/TWAMP/HTTP) on critical paths
- Cross-collector time skew and NTP offset on monitored devices
- Topology view consistency and inference confidence score
- BGP route advertisement vs reception symmetry
- Poller poll cycle duration vs configured interval
- ARP cache entry count and staleness
- RPKI/ROA validation state for all BGP sessions
- BGP NOTIFICATION Cease subcode parsing (RFC 4486, RFC 8538, RFC 9384)
- Configuration drift detection
- Endpoint positioning orphan rate
A team at L3 catches most incidents in their early stages.
L4: expert
Everything in L3, plus the signals operators add after multiple major incidents:
- BMP (RFC 7854) for Adj-RIB-In visibility (pre-policy and post-policy routes). BGP4-MIB
bgp4PathAttrTableonly reflects best-path routes; Adj-RIB-In entries are not accessible over SNMP. - BGP AS-path baseline deviation detection for own prefixes and upstreams
- RPKI validator health monitoring; alert on “Unknown” rate changes (signals validator outage)
- Sub-prefix hijack detection: alert when a more-specific appears without a less-specific in the RIB
- Smart License and vendor license server reachability monitored continuously
- Per-VRF and per-tenant isolation: BGP RIB size, flow volume, license utilization tracked per VRF
- Per-priority-queue discard counters (vendor QoS MIBs) revealing QoS queue saturation behind moderate utilization
- CoPP (control-plane policer) drop counters
- NIC per-queue drop counters via
ethtool -S(rx_missed_errors,rx_no_dma_resources) /proc/net/softnet_statfor kernel packet processing backpressure- FDB/ARP entry freshness (time since last refresh, computed from polling deltas)
- Flow template cache hit/miss ratio for NetFlow v9/IPFIX
- License grace-period state with feature-specific counter validation (IPS drops at 0 when traffic flows after license expiry)
- Asymmetric routing detection (forward vs reverse probe comparison)
The signals most teams miss
These are the systematic blind spots that keep causing incidents:
UDP socket buffer drops are not monitored.
Udp_RcvbufErrorsis the number one missed signal in flow collection. Charts show declining traffic during incidents that are actually traffic spikes. Production flow collectors neednet.core.rmem_maxraised to 16 MB or higher, tuned to actual ingress volume.License expiry is monitored only when too late. A licensed feature (IPS, VPN, threat prevention) silently disables at midnight. The device stays up. The syslog message is low severity and buried. Users notice at 09:00.
NTP drift on monitored devices is not watched. Two devices 200ms apart on the same flap produce records that do not correlate. Postmortems fail to reconstruct events because timestamps are seconds apart.
BGP “Established but stale” is not detected. The FSM reports Established but UPDATE exchange stopped. Graceful Restart keeps the FSM green while the session is gone. Track
bgpPeerInUpdatesrate and the timestamp of last received prefix.Trap receiver drops are invisible. During a trap flood, the highest-priority trap (root cause) is statistically the most likely to be dropped. There is no per-source drop counter.
Vendor API silent failures are not detected. HTTP 200 with empty payload is treated as “no data” rather than “API is broken.” PAN-OS returns
<response status="error">inside HTTP 200.NetFlow v9/IPFIX template desync is invisible. After a device reboot or upgrade, templates arrive on a 5-30 minute interval. Until then, all data records are silently discarded.
Sampling rate normalization is skipped. sFlow analytics report raw counts without scaling. Bandwidth charts are wrong by the sampling factor (often 1:1000 or worse).
32-bit counter rollover is treated as a real spike.
ifInOctetswraps in approximately 3.4 seconds at 10G line rate. Naive differencing produces terabit spikes or negative utilization.Poller fall-behind is not detected. The scheduler oversubscribes devices, retries compound, control-plane CPU spikes, and healthy devices appear “down.” The platform is the problem; the network is not.
How Netdata helps
Netdata collects many of the collector-side signals in this checklist that most NPM platforms miss:
- SNMP data collection polls sysUpTime, ifOperStatus, ifHCInOctets/ifHCOutOctets, ifInErrors/ifOutErrors, ifInDiscards/ifOutDiscards, and device CPU/memory with configurable intervals down to 1 second.
- Linux system plugins expose
Udp_RcvbufErrors, NIC RX/TX drops from/proc/net/dev, per-core softirq from/proc/softirqs, and/proc/net/softnet_statfor kernel packet processing backpressure. - Network interface metrics include per-NIC ring buffer drops and
ethtool -Scounters likerx_missed_errors, so you can distinguish NIC-level drops from socket-buffer drops. - Cross-layer correlation lets you join rising
Udp_RcvbufErrorswith rising flow receive rate and rising collector CPU in a single view, which is the diagnostic chain for silent UDP flow loss. - NTP metrics surface offset and drift on collectors and, where SNMP exposes
hrSystemDate, on monitored devices.







