$ guides / network ▌

NETWORK · OPERATIONS PLAYBOOK

The link is up, the dashboard is green, and the data is already gone

SNMP polling, flow telemetry, BGP, traps, and topology — how a monitoring pipeline really works, the places it silently loses data, the signals worth watching, and a runbook for each incident.

> Start with the monitoring checklist → # Jump to the full guide list

Network monitoring fails differently from the things it watches. The network can be perfectly healthy while your monitoring quietly goes blind.

Almost every network signal arrives over UDP — SNMP on 161, traps on 162, syslog on 514, flow records on 2055/6343 — and UDP drops silently. When a collector's socket buffer fills, the kernel discards datagrams and increments a counter almost nobody watches. Your flow charts dip, an operator assumes traffic fell, and the truth is that the data never made it off the wire. The same blind spot hides in a dozen places: an SNMP poller that falls behind and reports devices as down, a BGP session that stays Established long after it stopped carrying routes, a NetFlow v9 template that desyncs after an exporter reboot and decodes every field wrong, a counter that rolls over and paints a 4-billion-packet spike that never happened.

These guides are for engineers who already run a network and the monitoring around it — not an introduction to subnetting. The goal is the mental model of how the monitoring pipeline actually behaves, the failure patterns that keep recurring, the signals that catch them before an outage, and the runbooks you wish you'd had during the last 2 a.m. incident where everything was green and nothing worked.

How network monitoring actually works in production

Network monitoring is not one tool. It is a stack of collectors, each speaking a different protocol to a different layer of the estate, fused into one picture. Most failures live in the seams between these layers — in the transport that goes silent, not in the device being watched.

time synchronization

NTP/PTP across every collector and device. Cross-collector correlation — a flow drop paired with a BGP NOTIFICATION, a license window, a trap timestamp — depends on monotonic, aligned clocks. A few seconds of drift makes postmortems unreconstructable.

TIME

polling transport

ICMP, UDP/161 (SNMP), TCP/22 (CLI scrape), and HTTPS (vendor APIs) reaching each managed endpoint. Without working transport, every higher signal is simply absent — and absence looks like silence, not an error.

TRANSPORT

SNMP polling engine

A scheduler fanning OID requests across devices and worker threads, holding the counter table and sysUpTime anchors used for every rate calculation, and tracking each device's UP / STALE / UNKNOWN / DOWN state.

POLLER

flow collection & templates

NetFlow v5/v9 and IPFIX collectors maintaining template caches, decoding records, normalizing sampling rates, and writing flow storage. sFlow is sample-datagram oriented — a different failure profile entirely.

FLOW

traps & syslog ingestion

UDP/162 trap listener and UDP/TCP/TLS syslog pipeline with MIB-resolved varbinds, RFC 3164/5424 framing, and parser backpressure. Push-based, lossy, and the only signal for many event-driven conditions.

EVENTS

BGP & routing monitoring

Active, passive, or BMP sessions tracking FSM state, prefix announcements, AS-path and RPKI validity, and per-prefix reachability. A session can be Established and carrying nothing.

ROUTING

topology inference

A graph builder fusing CDP/LLDP, FDB, ARP, STP, and routing tables into Layer-2/Layer-3 topology and endpoint positioning. Probabilistic — it degrades as input freshness degrades.

TOPOLOGY

storage & retention

Counter TSDB, full-resolution flow store, topology graph, raw syslog, and event log — each with its own disk, CPU, and IOPS profile. Slow storage backs up through the parser and becomes upstream packet loss.

STORAGE

Why this matters: 'traffic dropped' can mean the traffic actually dropped, or a full socket buffer, a poller fall-behind, a desynced flow template, a counter rollover, an SNMP timeout, a stale BGP session, or a disk too slow to drain the parser. Same symptom, eight different layers, eight different signals and fixes.

The failures you'll actually see

Real network-monitoring incidents fall into a small set of recurring shapes. Most of them are failures of the monitoring pipeline, not the network. Recognise the shape and triage gets much faster.

CRITICAL

The silent UDP flow-loss cascade

A collector stops draining its socket buffer fast enough — slow parser, slow disk, single-core RSS pinning — and the kernel silently discards flow datagrams. UdpRcvbufErrors climbs; the flow charts dip; everyone assumes traffic fell. The data was lost at the collector, not on the network.

UdpRcvbufErrors / Udp InErrors incrementing
flow records/sec drops with no device-side change
multiple exporters declining at once
one collector CPU core pinned at 100%

Investigate →

ACTIVE

The poller fall-behind

The SNMP scheduler can't complete its poll cycle within the interval. Polls queue, timeouts rise, and devices flap to UNKNOWN/DOWN even though they're perfectly healthy. The most common cause of a 'is the network down?' false alarm — and it gets worse exactly when the network is busiest.

poll cycle time exceeding the poll interval
SNMP timeout/retry rate climbing
devices oscillating UP/UNKNOWN with no real outage
poller worker pool saturated

Investigate →

IMMINENT

BGP Established but stale

The BGP FSM still reads Established, so every up/down check passes — but the session stopped exchanging updates. Routes age out or freeze, traffic blackholes for a prefix, and the one signal everyone trusts is lying. State alone is not health; you have to watch prefix counts and update activity.

Established session with frozen prefix counts
no UPDATE activity for an unusually long window
reachability loss for prefixes the peer should advertise
hold-timer near expiry without a state change

Investigate →

ACTIVE

The trap & syslog flood

A flapping link or a reconvergence event makes hundreds of devices emit traps and syslog at once. The UDP/162 receiver and the syslog parser saturate, drop events under burst, and the one record you needed — the root-cause linkDown — is the one that got dropped. The storm hides its own cause.

trap/syslog receipt rate spiking orders of magnitude
trap receiver UDP drops under burst
syslog parser backpressure / queue growth
correlated with an interface or STP flap

Investigate →

WATCHFUL

Fake spikes from counter rollover

A 32-bit interface counter wraps past 4.29 billion, or a device reboot resets it, and the naive delta calculation paints an impossible traffic spike. Alerts fire on traffic that never happened; capacity reports are poisoned. The fix is 64-bit ifHC counters and a sysUpTime discontinuity check, not a higher threshold.

instantaneous spike to an implausible rate
spike coincides with a counter reset or reboot
32-bit counters still in use on fast links
sysUpTime discontinuity around the spike

Investigate →

IMMINENT

The vendor-API silent gap

A Meraki, Cato, or PAN-OS pull returns HTTP 200 with an empty or truncated payload — a throttle, an expired token scope, or a pagination bug — and the collector records 'success' while ingesting nothing. The dashboard goes flat and nobody is paged, because 200 is not an error.

HTTP 200 with empty/short payload from a vendor API
metrics flat-line for an API-sourced device set
429 rate or rate-limit-remaining near zero
no corresponding SNMP/flow gap for the same devices

Investigate →

Network monitoring maturity levels

Network observability works in four practical levels. Each is a complete operation, not a stepping stone. Pick the level that matches how much the network matters. Most production networks should land at the second level.

Level 1: Survival

Know that something is wrong

Survival monitoring is the floor: is the device reachable and is the link up? You won't learn why anything broke, but you'll learn that it broke before users phone in. Enough for lab and low-stakes segments.

Device reachability (ICMP / SNMP) Does the device answer a ping and an SNMP get?
Interface operational status Is ifOperStatus up on the links that matter?
Device uptime / unexpected reboot Did sysUpTime reset without your permission?
Collector process alive Is the poller / flow / trap collector actually running?
Interface utilization on uplinks Is a critical link near saturation?
Environment: temperature / PSU / fan Is hardware about to fail?

Level 2: Operational

Diagnose most incidents on your own

Operational monitoring is what most production networks should target. Survival says something is wrong; operational says what. With this coverage your team can usually localize an incident: errors vs discards, poller health, flow receipt, trap/syslog rates, BGP state.

Interface errors and discards ifInErrors/ifOutErrors and ifInDiscards/ifOutDiscards per link.
Interface utilization vs ifHighSpeed Real percent-of-capacity, not raw bits.
SNMP poll success / timeout / retry Is the poller keeping up with its cycle?
Flow UDP receipt rate + UdpRcvbufErrors Are flow datagrams arriving and being kept?
Trap & syslog receipt rate and severity Is the event pipeline flowing and not dropping?
BGP session state per peer Established, and for how long?
Collector CPU, memory, and disk Is the monitoring host itself healthy?
NTP offset on collectors and devices Are timestamps aligned enough to correlate?

Level 3: Mature

Catch problems before they become incidents

Mature monitoring catches the slow bleeds: a BGP session established but stale, a flow template drifting toward desync, sampling rates that aren't normalized, a license inching toward expiry, FDB/ARP tables going stale. None pages you today; each becomes an incident in a month.

BGP prefix counts + UPDATE activity Is an Established session actually carrying routes?
Flow template freshness / desync Did an exporter reboot break decoding?
Sampling-rate normalization Are sFlow/NetFlow totals scaled correctly?
License days-to-expiry per feature Months of headroom before a feature silently disables?
Counter discontinuity / rollover Are rate calcs anchored to sysUpTime?
FDB / ARP / topology freshness Is endpoint position based on current data?
Vendor API 429 rate + payload validity Are pull-mode collectors getting real data?
Per-core collector CPU + NIC RX drops Is one RSS core silently dropping packets?

Level 4: Expert

Reactive instrumentation after real incidents

Expert signals enter your stack the day after an incident proved you needed them: RPKI validity, asymmetric-path detection, NAT/session-table headroom, SD-WAN data-plane vs control-plane, audit-log gap detection. Most teams don't need every one — add the ones your incident history demands.

RPKI validity + AS-path change alerts Is a route being leaked or hijacked?
SD-WAN data-plane loss/latency per tunnel Tunnel up, but is the path actually healthy?
NAT / session-table utilization Headroom before connections start failing?
Asymmetric-routing detection Are path/latency measurements even valid?
Audit-log gap detection Did syslog/trap loss create a blind window?
STP topology-change rate Is the Layer-2 fabric reconverging repeatedly?
Cloud + on-prem flow correlation Does traffic stay visible across the boundary?
Flow export-to-ingest latency How stale is the flow picture you're trusting?

Operating mistakes worth avoiding

The traps network teams keep falling into. Each has a clear fix that most teams only learn after an incident.

Not monitoring UdpRcvbufErrors on collectors

Flow, trap, and syslog data all arrive over UDP and drop silently when the socket buffer fills. <code>UdpRcvbufErrors</code> is the only direct signal, and it's the one counter most teams never graph. Alert on any nonzero increment rate and size <code>net.core.rmem_max</code> to 16 MB+ at deployment.

Treating BGP session state as session health

An <code>Established</code> peer can stop carrying routes and your up/down check stays green. Watch prefix counts and UPDATE activity, not just FSM state — 'Established but stale' is a silent blackhole.

Leaving 32-bit counters on fast links

A 32-bit ifInOctets wraps in seconds on a 10G link and paints a fake multi-billion-packet spike. Use 64-bit ifHC counters and check sysUpTime for discontinuity before trusting any rate.

Not watching NTP drift on devices

Every cross-collector correlation — flow + syslog, BGP + flow drop, license windows — depends on aligned clocks. A few seconds of drift makes incident reconstruction impossible. Monitor offset on collectors and managed devices.

Invisible trap receiver drops

UDP/162 drops thousands of traps a minute under a link-flap storm and no one notices. Monitor the trap receiver's drop counter and socket-buffer fill, and rate-limit at the source — the trap you lose is usually the root cause.

NetFlow v9 / IPFIX template desync goes undetected

After an exporter reboot the collector can decode records against a stale template and silently produce garbage fields. Correlate decode-error rate with exporter reboots and alert when a template cache misses.

Skipping sampling-rate normalization

sFlow and sampled NetFlow report 1-in-N; if analytics don't multiply by N (and N changes per exporter) your totals are off by orders of magnitude. Normalize at ingest and alert when an exporter's sampling rate changes.

Monitoring license expiry only after a feature dies

Many platforms silently disable features — flow export, advanced routing, threat prevention — the day a license lapses, with no SNMP trap. Track days-to-expiry per feature and alert weeks ahead, not on the outage.

Network runbooks in this section

Each guide is a focused runbook for one symptom or topic. Pick one when you have an incident, or use the categories to learn the area.

▸

Start here

▸ Network monitoring checklist →

▸

NetFlow, sFlow & IPFIX

▸

SNMP polling and collection

▸

BGP and routing health

▸

SNMP traps and syslog

▸

Layer 2, STP, and topology

▸

Interfaces, errors, and capacity

▸

SD-WAN and overlay tunnels

▸

Vendor APIs and cloud networking

▸

Collector and pipeline health

▸

Device health and environment

▸

Licensing and entitlements

▸ License expiry silently disabling features →

▸

Network security and integrity

WHERE TO GO NEXT

Setting up network monitoring, or putting out a fire?

If you're starting from scratch, the monitoring checklist is the path of least regret. If you're mid-incident, jump straight to the symptom that matches what you're seeing.

> Start with the checklist > Back to Operations Guides