$ guides / network / network-monitoring-checklist ▌

Operations Guides

Network monitoring checklist: the signals every production network needs

This checklist covers the signals production networks need, organized by detection priority and mapped to maturity levels from survival to expert.

An NPM stack is a federation of collectors, parsers, enrichment services, storage tiers, and an analytics core. Most production incidents are not “the network broke” but “a collector’s UDP buffer dropped packets,” “the NetFlow v9 template cache went stale after a device reboot,” or “the polling worker pool fell behind and now a healthy device looks down.” The checklist is organized to surface those failure modes, not just the top-level symptoms.

The federation at a glance

When a signal is missing, stale, or wrong, the fault is usually one or two subsystems upstream of the dashboard. The typical NPM stack includes:

Time synchronization substrate (NTP/PTP). Every cross-collector correlation depends on accurate, monotonic time across collectors, polled devices, and API endpoints.
Polling transport. ICMP, UDP/161 (SNMP), TCP/22 (SSH/CLI scrape), and HTTPS (vendor APIs) reaching each managed endpoint.
SNMP polling engine. A scheduler fanning OID requests across devices with timeouts, counter tables, and a device state machine (UP / STALE / UNKNOWN / DOWN).
Flow collection subsystem. NetFlow v5/v9 and IPFIX collectors with template caches, sampling-rate awareness, and flow record storage. sFlow is sample-datagram oriented, not template-flow, and has a different failure profile.
Topology inference engine. Fuses CDP/LLDP neighbor tables, FDB entries, ARP tables, STP state, and routing tables to derive Layer-2 and Layer-3 topology.
BGP monitoring subsystem. Active or passive sessions tracking FSM state, prefix announcements, AS-path changes, and RPKI validity.
Syslog and trap ingestion. UDP/TCP/TLS listeners with parser backpressure, facility/severity handling, and deduplication.
Vendor API integration layer. Pull-mode clients for SD-WAN controllers, cloud platforms, and modern firewalls, each with their own auth, rate limit, and pagination semantics.
Storage tiers. Counter TSDB (downsampled for long retention), full-resolution flow store, topology graph DB, raw syslog store, and event/alert log.

Signal domains by detection priority

The domains below are ordered by detection priority: the earliest surfacing of real issues with the best signal-to-noise comes first. Within each domain, the most operationally critical signals are listed first.

Availability

Signal	Source	Why it matters
SNMP agent reachability (sysUpTime)	SNMP GET `.1.3.6.1.2.1.1.3.0`	No response means agent down, partition, ACL block, or credential issue. Value decrease means reboot. SNMP down with healthy ICMP means agent problem, not device outage.
ICMP reachability	`ping`, `fping`	Liveness independent of SNMP. ICMP down plus SNMP down equals network problem. ICMP down plus SNMP up equals ICMP rate-limited or blocked (common on firewalls, CoPP).
Vendor API reachability and validity	HTTPS to vendor endpoint	For SD-WAN/cloud, the API may be the only telemetry source. HTTP 200 with empty or error payload (PAN-OS `<response status="error">` inside HTTP 200) is a silent failure.
Flow UDP packet receipt rate	`/proc/net/udp`, collector stats, `nstat`	Drop to 0 from one exporter means exporter stopped or partitioned. Drop from all exporters means collector-side failure.
Syslog receipt rate and severity	UDP/TCP/TLS port 514 listener	Rate spike with severity escalation means device event. Spike without escalation means noise storm. Silence from a normally-chatty device means isolation or logging failure.
SNMP trap rate and type	UDP port 162 listener	`linkDown`/`linkUp` pairs mean flap. `coldStart` means reboot. Silence from a noisy device means trap path broken.
BGP session state (FSM)	BGP4-MIB `.1.3.6.1.2.1.15.3.1.2`, CLI, BMP	Established means exchanging routes. Established with no UPDATE traffic (stale session) is a worse failure than Idle.
Interface operational status	IF-MIB `.1.3.6.1.2.1.2.2.1.8` (`ifOperStatus`)	Admin up plus oper down means physical or link-layer failure. Flapping means link instability.

Errors

Signal	Source	Why it matters
Interface errors (ifInErrors, ifOutErrors)	IF-MIB `.1.3.6.1.2.1.2.2.1.14`, `.20`	Incrementing counters mean cable/fiber degradation, SFP failure, duplex mismatch, or EMI. Rate of change matters more than absolute value.
Interface discards (ifInDiscards, ifOutDiscards)	IF-MIB `.1.3.6.1.2.1.2.2.1.13`, `.19`	Queue or buffer overflow, or ACL drops. Often the leading indicator of congestion before utilization shows 100%.
UDP socket buffer drops	`/proc/net/snmp`, `nstat -az Udp_RcvbufErrors`	The number one silent killer for flow, trap, and syslog collectors. Datagrams arrive at the kernel but the application was too slow to drain. Any nonzero value means lost telemetry.
SNMP timeout and retry rate	Collector stats, `time snmpget`	Rising across many devices means collector-side issue. Rising on one device means device-side agent or CPU issue.
BGP NOTIFICATION and Cease messages	bgpBackwardTransition trap, CLI, syslog	Cease/1 is maximum prefixes reached. Cease/2 is administrative shutdown. Hold Time Expired (NOTIFICATION code 4) indicates CPU saturation.
License and feature validity	Vendor MIBs, PAN-OS API, Meraki API, Cato GraphQL	Feature silently disabled at midnight. Users complain at 09:00. The most common root cause of “the firewall stopped doing what we paid for.”

Saturation

Signal	Source	Why it matters
Interface utilization (% of ifHighSpeed)	IF-MIB `ifHCInOctets` `.1.3.6.1.2.1.31.1.1.1.6`, `ifHCOutOctets` `.10`, `ifHighSpeed` `.15`	95% sustained for over 5 min on critical interface means congestion with drops and latency. Use 64-bit HC counters. 32-bit `ifInOctets` wraps in approximately 3.4 seconds at 10G line rate.
NIC RX/TX drops on collector	`/proc/net/dev`, `ethtool -S`	Ring buffer overflow before packets reach the socket layer. `rx_missed_errors` is the most actionable counter.
Collector CPU (per-core, %soft)	`mpstat -P ALL`, `/proc/softirqs`	High `%soft` on one core means RSS funneling all packet processing to one CPU. Total CPU may look fine while one core is pinned.
Collector disk and TSDB write queue	`df`, `iostat`, collector metrics	Cardinality inflation (new subnet, NAT pool, scanner traffic) can fill disk in hours. Write queue growing means TSDB cannot keep up with ingestion.
Device control-plane CPU	Cisco `.1.3.6.1.4.1.9.9.109.1.1.1.1.7`, Juniper `.1.3.6.1.4.1.2636.3.1.13.1.8`	Sustained over 90% means SNMP starvation, BGP hold-time expiry, and session drops.
Device memory utilization	Cisco `.1.3.6.1.4.1.9.9.48.1.1.1.5`, HOST-RESOURCES-MIB	Free memory approaching 0 means OOM imminent. Rate of increase over 1%/min means memory leak.
BGP RIB and FIB size	BGP4-MIB prefix counts, CLI	Sudden change over 20% in 5 min means route leak or mass withdrawal. Full IPv4 DFZ in 2026 is approximately 940k prefixes.
NAT and session table utilization	PAN-OS API, vendor CLI	Approaching limit means new connections denied. Sustained growth means traffic outpacing NAT capacity.
API rate-limit remaining	HTTP headers (`Retry-After`, `X-RateLimit-Remaining`)	Meraki: 10 req/sec/org.

Internal state, replication, and correctness

Signal	Source	Why it matters
Device uptime (sysUpTime)	SNMP `.1.3.6.1.2.1.1.3.0`	Decrease means reboot. 32-bit wrap at approximately 497 days looks like reboot; track wraps separately.
Temperature, fan, power supply	ENTITY-SENSOR-MIB `.1.3.6.1.2.1.99.1.1.1.4`	Thermal failure, cooling failure, or redundancy lost. Use vendor-defined thresholds, not arbitrary absolute numbers.
Interface counter discontinuity	`ifCounterDiscontinuityTime` `.1.3.6.1.2.1.31.1.1.1.3`	Counter reset without sysUpTime reset means SNMP agent inconsistency or counter-source bug.
Cross-collector time skew	`ntpq -p`, `chronyc tracking`	Over 100ms drift breaks cross-site flow correlation. Over 1s breaks it entirely.
NTP offset on monitored devices	`hrSystemDate` `.1.3.6.1.2.1.25.1.2.0`	Device clock drift causes postmortem correlation failure. Consistently the most under-monitored NTP signal.
Topology view consistency	CDP/LLDP vs FDB vs ARP cross-validation	Inconsistency means stale data, topology change in progress, or device bug. Three sources agreeing is high confidence; one source alone is low.
Flow sampling rate consistency	sFlow MIB, NetFlow v9 template fields	Mismatch means analytics wrong by orders of magnitude. Without sampling-rate correction, sFlow at 1:1000 reports 1/1000 of true traffic.
STP root bridge and TCN	BRIDGE-MIB `.1.3.6.1.2.1.17.2`	Root bridge change means reconvergence. TCN rate over 5/min means instability.

Latency, throughput, and security

Signal	Source	Why it matters
SNMP poll response latency	`time snmpget`, collector stats	Over 1s on a normally-fast device means agent or management-network degradation.
ICMP round-trip time	`ping`, `fping`	p99 over 2x rolling baseline means congestion or path change. High jitter means unstable path.
Active path probes	Cisco IPSLA RTTMON MIB, TWAMP, HTTP GET	RTT and loss per path, independent of application. Loss over 1% sustained is degraded.
Flow bytes per conversation	NetFlow/sFlow/IPFIX records	Top talkers, DDoS patterns, data exfiltration signals. sFlow requires sampling-rate multiplication for accurate byte counts.
Poller poll cycle duration	Collector internal stats	Cycle exceeding configured interval means data is drifting stale. The most under-monitored meta-signal in NPM.
Flow exporter drop rate (device-side)	Cisco `cnfESPktsDropped` `.1.3.6.1.4.1.9.9.387.1.4.6`	Device dropped flows that never reached collector. Invisible to collector alone. Compare device-exported rate against collector inbound rate for end-to-end loss detection.
Unauthorized SNMP access	`snmpInBadCommunityNames` `.1.3.6.1.2.1.11.4`, USM stats	Burst from single source means scanning. Persistent events from many sources means community string “public” still configured.
BGP RPKI/ROA invalid acceptance	Vendor CLI `show bgp rpki`, validators	Any RPKI-invalid route accepted in production is a security event. Verify with public validators before alerting; stale cache produces false invalids.
Config changes without ticket	Syslog CONFIG-I, AAA logs, config diff	Change outside maintenance window without change ticket means unauthorized or emergency. Change followed within 30 min by incident is a high-correlation root-cause candidate.

Monitoring maturity levels

These levels are sequential and cumulative. Each level includes everything below it.

flowchart TD
    L4["L4 Expert
BMP, RPKI integrity, per-VRF,
sampling-rate forensics"] --> L3["L3 Mature
UDP drops, RSS, flow end-to-end loss,
NTP on devices, topology confidence"]
    L3 --> L2["L2 Operational
CPU and memory, traps, licenses,
topology discovery, API status"]
    L2 --> L1["L1 Survival
sysUpTime, ifOperStatus, BGP state,
utilization, errors, syslog, trap port"]

L1: survival

The absolute minimum to know if the network is alive and not on fire:

SNMP reachability (sysUpTime GET) for every critical-path device
Interface operational status (ifOperStatus) for critical interfaces
BGP FSM state for critical eBGP and iBGP peers
Interface utilization (ifHCInOctets / ifHCOutOctets vs ifHighSpeed) for top-10 interfaces
Interface error counters (ifInErrors, ifOutErrors) for critical interfaces
Syslog severity 0-3 (EMERG through ERR) forwarded from critical devices
Flow collector port listening (UDP 2055 for NetFlow, 6343 for sFlow, 4739 for IPFIX)
Trap receiver bound on UDP 162

A team at L1 catches hard outages. Nothing else.

L2: operational

Everything in L1, plus:

All interfaces for status, utilization, errors
All BGP peers for FSM state and prefix count
SNMP poll latency and timeout rate per device
Device control-plane CPU and memory
Flow records received per second; syslog source count
License days-to-expiry for all licensed features
Temperature, fan, power supply state
Topology discovery (CDP/LLDP)
STP root bridge identity and topology change count
Vendor API HTTP status for SD-WAN and cloud
SNMP authentication failure rate
ColdStart/warmStart detection with alerting

A team at L2 has visibility into most failures. They still miss silent failures, license cliffs, and topology staleness.

L3: mature

Everything in L2, plus:

UDP socket buffer drops (Udp_RcvbufErrors) on flow, trap, and syslog collectors
NIC RX/TX drops and RSS IRQ distribution
Collector CPU (per-core, %soft) and disk space
TSDB write queue depth and series cardinality
Flow export-to-ingest latency and sampling rate consistency
Flow exporter drop rate (device-side) and inbound-vs-exported comparison
Interface counter discontinuity detection
NAT/session table utilization
Vendor API request latency, error rate, and rate-limit remaining
Active path probes (IPSLA/TWAMP/HTTP) on critical paths
Cross-collector time skew and NTP offset on monitored devices
Topology view consistency and inference confidence score
BGP route advertisement vs reception symmetry
Poller poll cycle duration vs configured interval
ARP cache entry count and staleness
RPKI/ROA validation state for all BGP sessions
BGP NOTIFICATION Cease subcode parsing (RFC 4486, RFC 8538, RFC 9384)
Configuration drift detection
Endpoint positioning orphan rate

A team at L3 catches most incidents in their early stages.

L4: expert

Everything in L3, plus the signals operators add after multiple major incidents:

BMP (RFC 7854) for Adj-RIB-In visibility (pre-policy and post-policy routes). BGP4-MIB bgp4PathAttrTable only reflects best-path routes; Adj-RIB-In entries are not accessible over SNMP.
BGP AS-path baseline deviation detection for own prefixes and upstreams
RPKI validator health monitoring; alert on “Unknown” rate changes (signals validator outage)
Sub-prefix hijack detection: alert when a more-specific appears without a less-specific in the RIB
Smart License and vendor license server reachability monitored continuously
Per-VRF and per-tenant isolation: BGP RIB size, flow volume, license utilization tracked per VRF
Per-priority-queue discard counters (vendor QoS MIBs) revealing QoS queue saturation behind moderate utilization
CoPP (control-plane policer) drop counters
NIC per-queue drop counters via ethtool -S (rx_missed_errors, rx_no_dma_resources)
/proc/net/softnet_stat for kernel packet processing backpressure
FDB/ARP entry freshness (time since last refresh, computed from polling deltas)
Flow template cache hit/miss ratio for NetFlow v9/IPFIX
License grace-period state with feature-specific counter validation (IPS drops at 0 when traffic flows after license expiry)
Asymmetric routing detection (forward vs reverse probe comparison)

The signals most teams miss

These are the systematic blind spots that keep causing incidents:

UDP socket buffer drops are not monitored. Udp_RcvbufErrors is the number one missed signal in flow collection. Charts show declining traffic during incidents that are actually traffic spikes. Production flow collectors need net.core.rmem_max raised to 16 MB or higher, tuned to actual ingress volume.
License expiry is monitored only when too late. A licensed feature (IPS, VPN, threat prevention) silently disables at midnight. The device stays up. The syslog message is low severity and buried. Users notice at 09:00.
NTP drift on monitored devices is not watched. Two devices 200ms apart on the same flap produce records that do not correlate. Postmortems fail to reconstruct events because timestamps are seconds apart.
BGP “Established but stale” is not detected. The FSM reports Established but UPDATE exchange stopped. Graceful Restart keeps the FSM green while the session is gone. Track bgpPeerInUpdates rate and the timestamp of last received prefix.
Trap receiver drops are invisible. During a trap flood, the highest-priority trap (root cause) is statistically the most likely to be dropped. There is no per-source drop counter.
Vendor API silent failures are not detected. HTTP 200 with empty payload is treated as “no data” rather than “API is broken.” PAN-OS returns <response status="error"> inside HTTP 200.
NetFlow v9/IPFIX template desync is invisible. After a device reboot or upgrade, templates arrive on a 5-30 minute interval. Until then, all data records are silently discarded.
Sampling rate normalization is skipped. sFlow analytics report raw counts without scaling. Bandwidth charts are wrong by the sampling factor (often 1:1000 or worse).
32-bit counter rollover is treated as a real spike. ifInOctets wraps in approximately 3.4 seconds at 10G line rate. Naive differencing produces terabit spikes or negative utilization.
Poller fall-behind is not detected. The scheduler oversubscribes devices, retries compound, control-plane CPU spikes, and healthy devices appear “down.” The platform is the problem; the network is not.

How Netdata helps

Netdata collects many of the collector-side signals in this checklist that most NPM platforms miss:

SNMP data collection polls sysUpTime, ifOperStatus, ifHCInOctets/ifHCOutOctets, ifInErrors/ifOutErrors, ifInDiscards/ifOutDiscards, and device CPU/memory with configurable intervals down to 1 second.
Linux system plugins expose Udp_RcvbufErrors, NIC RX/TX drops from /proc/net/dev, per-core softirq from /proc/softirqs, and /proc/net/softnet_stat for kernel packet processing backpressure.
Network interface metrics include per-NIC ring buffer drops and ethtool -S counters like rx_missed_errors, so you can distinguish NIC-level drops from socket-buffer drops.
Cross-layer correlation lets you join rising Udp_RcvbufErrors with rising flow receive rate and rising collector CPU in a single view, which is the diagnostic chain for silent UDP flow loss.
NTP metrics surface offset and drift on collectors and, where SNMP exposes hrSystemDate, on monitored devices.

Silent UDP flow data loss: why your NetFlow collector is dropping records

Network monitoring checklist: the signals every production network needs

Network monitoring checklist: the signals every production network needs

The federation at a glance

Signal domains by detection priority

Availability

Errors

Saturation

Internal state, replication, and correctness

Latency, throughput, and security

Monitoring maturity levels

L1: survival

L2: operational

L3: mature

L4: expert

The signals most teams miss

How Netdata helps

Related guides