Interface flapping: link up/down storms and their blast radius

Interface flapping is when a network interface oscillates rapidly between up and down states. Each transition generates a linkDown/linkUp trap pair, syslog entries, and an STP topology change notification. At low rates this is operational noise. At high rates it becomes a multi-layer failure: the trap receiver overflows, the syslog pipeline saturates, STP reconvergence flushes MAC tables across the VLAN, and the monitoring platform reports misleading availability because the poll interval is slower than the flap cadence.

The core signal is ifOperStatus churn from IF-MIB. A single bad SFP, dirty fiber, duplex mismatch, or failing end-station NIC can produce dozens of transitions per minute. The blast radius extends beyond the physical port: every host in the same broadcast domain experiences transient flooding during STP reconvergence, and the trap and syslog flood can mask other critical events at the same receiver. When the UDP socket buffer overflows, the highest-priority trap (often the root-cause hardware alarm) is statistically the most likely to be dropped.

What this means

IF-MIB defines ifOperStatus at .1.3.6.1.2.1.2.2.1.8 and ifAdminStatus at .1.3.6.1.2.1.2.2.1.7. When ifAdminStatus is up but ifOperStatus oscillates, the cause is physical-layer or link-layer instability, not administrative action.

RFC 2863 defines linkDown and linkUp traps that carry three varbinds: ifIndex, ifAdminStatus, and ifOperStatus. Each flap generates one trap pair. On platforms with errdisable detection (Cisco IOS, NX-OS), exceeding the platform-specific flap threshold places the port into errdisabled state, which stops the flapping but also removes the port from service until manual intervention or configured recovery.

flowchart TD
    A[Flapping interface] --> B[linkDown/linkUp traps]
    A --> C[STP topology changes]
    A --> D[Syslog entries]
    B --> E[Trap receiver overflow]
    D --> F[Syslog backpressure]
    C --> G[MAC table flush]
    G --> H[Broadcast flooding]
    E --> I[Root-cause trap lost]
    F --> I

Each STP topology change notification causes a MAC address table flush across the broadcast domain. Until the MAC table repopulates, the switch floods unknown unicast to all ports in the VLAN, consuming bandwidth and exposing traffic to ports that should not receive it.

One further complication: many platforms log link up/down events at syslog severity “informational,” not “warning” or “error.” SIEM correlation rules and severity-based alerting under-prioritize these events. Correlate linkDown/linkUp trap frequency explicitly rather than relying on severity filtering.

Common causes

CauseWhat it looks likeFirst thing to check
Bad cable, SFP, or dirty fiberifInErrors incrementing alongside flapping; single interface affectedifInErrors rate on the flapping port
Duplex mismatchVery high error rates (0.1 to 1%); late collisions in error breakdownSpeed and duplex settings on both ends of the link
Speed or auto-negotiation failureInterface comes up briefly then drops; low error countsNegotiated speed on both ends; hard-code if auto-neg is unreliable
End-station NIC bug or power fluctuationFlapping on an access port; no physical-layer errors on the switchHost-side driver logs; correlate timing with host events
STP instabilityMultiple ports transitioning; root bridge identity may changeRoot bridge priority; dot1dStpTopChanges rate
Unidirectional fiber linkPort appears up on one side; receiving side loses signal silentlySTP Loop Guard status; UDLD where available

Quick checks

All commands below are read-only and safe to run during an active incident. Replace <community>, <device>, and <ifIndex> with your values. Prefer SNMPv3 credentials in production.

# Check current ifOperStatus for a specific interface
snmpget -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.8.<ifIndex>

# Poll ifOperStatus multiple times to catch rapid oscillation
for i in $(seq 1 10); do
  snmpget -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.8.<ifIndex>
  sleep 1
done

# Check ifAdminStatus to distinguish admin-down from oper-down
snmpget -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.7.<ifIndex>

# Check input errors on the flapping interface
snmpget -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.14.<ifIndex>

# Check STP topology change count
snmpget -v2c -c <community> <device> .1.3.6.1.2.1.17.2.4.0

# Count linkDown/linkUp traps by interface from the trap log.
# Adjust field number and delimiter to match your trap log format.
awk -F'|' '{print $4}' /var/log/snmptrapd.log | sort | uniq -c | sort -rn

# Observe incoming linkDown traps in real time
tcpdump -i eth0 -nn 'udp port 162' -c 100

# Check for UDP socket buffer drops on the trap receiver
cat /proc/net/snmp | grep '^Udp:'

# Check syslog volume from the affected device
grep '<device>' /var/log/network-devices/*.log | wc -l

How to diagnose it

  1. Identify the flapping interface. Parse the trap log for linkDown/linkUp pairs. Each trap carries ifIndex as a varbind. The interface with the highest pair count is the likely culprit. If trap logging is unavailable, poll ifOperStatus for all interfaces repeatedly and look for oscillation.

  2. Confirm the flap rate. Poll ifOperStatus on the identified interface at 1-second intervals for 30 to 60 seconds. Count the transitions. A common operational threshold for a link-flap event is more than 3 transitions per minute. More than 5 linkDown/linkUp trap pairs per minute on any interface warrants a ticket.

  3. Check for physical-layer errors. Poll ifInErrors at .1.3.6.1.2.1.2.2.1.14 and ifOutErrors at .1.3.6.1.2.1.2.2.1.20 for the interface index. Incrementing errors confirm a physical-layer fault. High error rates (0.1 to 1% of packets) strongly suggest duplex mismatch. Low but nonzero rates (1e-6 to 1e-4) suggest a single bad fiber strand or dirty connector.

  4. Assess the L2 blast radius. Poll dot1dStpTopChanges at .1.3.6.1.2.1.17.2.4.0. A rising count during the flap confirms STP is reacting. A topology change rate above 5 per minute indicates instability. A TCN burst above 1 per second indicates a link-flap cascade. Check whether the root bridge identity has changed unexpectedly, which would indicate STP priority misconfiguration or a root bridge failure.

  5. Check trap receiver health. Examine Udp_RcvbufErrors in /proc/net/snmp. Any nonzero increment during the flap means the receiver is dropping datagrams. The root-cause trap may already be lost. This is the most under-monitored signal during a flap incident.

  6. Inspect the physical layer directly. On the device, use vendor CLI to check transceiver diagnostics, interface error breakdowns, and negotiated speed and duplex. On Cisco NCS and ISR platforms, show controllers optics exposes SFP/QSFP detection, laser state, and power levels. Look for RX-LOS (receive loss of signal), which indicates the problem is toward the peer, versus TX-LOS, which indicates the local transceiver is not transmitting.

  7. Check both ends of the link. Duplex mismatch and auto-negotiation failures require inspecting both sides. One end may report clean operation while the other accumulates errors. This is especially common when one side is hard-coded and the other is set to auto-negotiate.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ifOperStatus transitions per interfaceCore indicator of flapping; each transition is one up or down eventMore than 3 transitions per minute on any interface
linkDown/linkUp trap pair ratePush-based detection; faster than pollingMore than 5 pairs per minute on any interface
STP topology change count (dot1dStpTopChanges)Measures L2-wide blast radius beyond the port itselfTCN burst more than 1 per second
ifInErrors on the flapping portDistinguishes physical-layer faults from logical issuesAny sustained increment; high rate suggests duplex mismatch
Udp_RcvbufErrors on trap receiverDetects receiver overflow that hides root-cause trapsAny nonzero increment during the flap window
Syslog rate from affected deviceMeasures the noise burden on the logging pipelineRate more than 5x the rolling 1-hour average
Root bridge identityDetects STP instability extending beyond a single portIdentity changed unexpectedly without a change ticket
Topology inference confidenceFlapping degrades endpoint positioning accuracyConfidence dropping for endpoints in the affected VLAN

Fixes

Bad cable, SFP, or dirty fiber

Warning: shutting down the interface is disruptive. If there is no redundant path, this causes downtime for connected hosts.

The immediate priority is to stop the cascade. Administratively disable the interface (shutdown on Cisco platforms) to halt the trap and syslog flood and the STP reconvergence. Then replace the cable, clean or replace the fiber, or swap the SFP with a known-good optic. After replacement, monitor ifInErrors to confirm the rate returns to zero before re-enabling the port.

Tradeoff: shutting down the port stops the L2-wide disruption but removes connectivity for everything behind that port. If the port serves a single critical host with no redundancy, coordinate with the application owner before disabling.

Duplex mismatch

Duplex mismatch is the leading preventable cause of interface flapping. It produces late collisions that feed into the flapping cycle. Check speed and duplex on both ends of the link. Auto-negotiation failures are common on copper runs, especially when one side is hard-coded and the other is set to auto. Hard-code speed and duplex on both ends to eliminate negotiation ambiguity. Verify ifInErrors drops to zero after the change.

Tradeoff: hard-coding removes auto-negotiation flexibility. If the remote device is later replaced with one that does not match the hardcoded settings, the link will fail or operate at reduced performance.

End-station NIC bug or power fluctuation

When the switch side shows no physical-layer errors but the port keeps flapping, the problem is likely at the host. Check host-side driver logs for spurious link state messages. The ixgbe driver on RHEL 7 and RHEL 8 has a known issue where it logs bogus “NIC Link is Up” and “NIC Link is Down” messages even with no cable connected, flooding kernel logs without a genuine link event. If the host has redundant uplinks, shut down the affected port and rely on the backup link while investigating the driver or power issue.

Tradeoff: relying on the backup link reduces available bandwidth during the investigation. Updating the NIC driver or firmware may require a host reboot.

STP instability

If multiple ports are transitioning and the root bridge identity has changed, STP priority misconfiguration is likely. Verify the root bridge is the intended device. If a device with lower bridge priority has joined the network, it has become the root and traffic may flow through suboptimal paths. Correct the priority on the intended root bridge.

Enable STP Loop Guard on blocking ports. Without it, a unidirectional fiber failure (one strand cut) can cause a port to silently transition from blocking to forwarding, creating a broadcast storm. Loop Guard detects the absence of BPDUs on a blocking port and places it into loop-inconsistent state instead of allowing it to forward.

Legacy STP takes 30 to 50 seconds to reconverge. If your environment cannot tolerate this downtime window, consider RSTP, which converges faster but still produces topology change bursts during flapping events.

Tradeoff: Loop Guard may block ports that could otherwise forward traffic if BPDUs are delayed but not actually lost. RSTP is not compatible with all legacy switch platforms.

errdisable recovery

On platforms that support errdisable detection (Cisco IOS, NX-OS), a port that flaps more than the platform threshold enters errdisabled state. The port is shut down, which stops the cascade but also removes the port from service.

Recovery is disabled by default. Enable it explicitly on the device:

errdisable recovery cause link-flap

The default recovery interval is 300 seconds. Without recovery enabled, the port stays disabled until manually re-enabled with shutdown followed by no shutdown.

Tradeoff: automatic recovery means the port will try to come back up after the interval. If the underlying fault persists, the port will flap again and re-enter errdisabled state, producing another burst of traps and syslog. Tune the recovery interval to balance stability against unnecessary downtime.

Prevention

  • Monitor transition rate, not just current state. A single poll sees one snapshot of ifOperStatus and misses the churn between polls. Alert on transition count per minute.
  • Supplement polling with trap-based alerting. ifOperStatus polling has latency bounded by the poll interval. linkDown/linkUp traps provide sub-second detection. A 60-second poll interval can miss a 15-second flap cycle entirely.
  • Alert on ifInErrors proactively. Rising errors on a stable interface predict future flapping. Catch physical-layer degradation before it triggers the cascade.
  • Enable errdisable recovery for link-flap. This prevents permanent port shutdown without manual intervention. Tune the recovery interval based on your tolerance for downtime versus repeated flap cycles.
  • Standardize auto-negotiation policy. Document which links are hard-coded and which use auto-negotiation. Most duplex mismatches arise from inconsistent configuration between ends of a link.