Interface flapping: link up/down storms and their blast radius
Interface flapping is when a network interface oscillates rapidly between up and down states. Each transition generates a linkDown/linkUp trap pair, syslog entries, and an STP topology change notification. At low rates this is operational noise. At high rates it becomes a multi-layer failure: the trap receiver overflows, the syslog pipeline saturates, STP reconvergence flushes MAC tables across the VLAN, and the monitoring platform reports misleading availability because the poll interval is slower than the flap cadence.
The core signal is ifOperStatus churn from IF-MIB. A single bad SFP, dirty fiber, duplex mismatch, or failing end-station NIC can produce dozens of transitions per minute. The blast radius extends beyond the physical port: every host in the same broadcast domain experiences transient flooding during STP reconvergence, and the trap and syslog flood can mask other critical events at the same receiver. When the UDP socket buffer overflows, the highest-priority trap (often the root-cause hardware alarm) is statistically the most likely to be dropped.
What this means
IF-MIB defines ifOperStatus at .1.3.6.1.2.1.2.2.1.8 and ifAdminStatus at .1.3.6.1.2.1.2.2.1.7. When ifAdminStatus is up but ifOperStatus oscillates, the cause is physical-layer or link-layer instability, not administrative action.
RFC 2863 defines linkDown and linkUp traps that carry three varbinds: ifIndex, ifAdminStatus, and ifOperStatus. Each flap generates one trap pair. On platforms with errdisable detection (Cisco IOS, NX-OS), exceeding the platform-specific flap threshold places the port into errdisabled state, which stops the flapping but also removes the port from service until manual intervention or configured recovery.
flowchart TD
A[Flapping interface] --> B[linkDown/linkUp traps]
A --> C[STP topology changes]
A --> D[Syslog entries]
B --> E[Trap receiver overflow]
D --> F[Syslog backpressure]
C --> G[MAC table flush]
G --> H[Broadcast flooding]
E --> I[Root-cause trap lost]
F --> IEach STP topology change notification causes a MAC address table flush across the broadcast domain. Until the MAC table repopulates, the switch floods unknown unicast to all ports in the VLAN, consuming bandwidth and exposing traffic to ports that should not receive it.
One further complication: many platforms log link up/down events at syslog severity “informational,” not “warning” or “error.” SIEM correlation rules and severity-based alerting under-prioritize these events. Correlate linkDown/linkUp trap frequency explicitly rather than relying on severity filtering.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Bad cable, SFP, or dirty fiber | ifInErrors incrementing alongside flapping; single interface affected | ifInErrors rate on the flapping port |
| Duplex mismatch | Very high error rates (0.1 to 1%); late collisions in error breakdown | Speed and duplex settings on both ends of the link |
| Speed or auto-negotiation failure | Interface comes up briefly then drops; low error counts | Negotiated speed on both ends; hard-code if auto-neg is unreliable |
| End-station NIC bug or power fluctuation | Flapping on an access port; no physical-layer errors on the switch | Host-side driver logs; correlate timing with host events |
| STP instability | Multiple ports transitioning; root bridge identity may change | Root bridge priority; dot1dStpTopChanges rate |
| Unidirectional fiber link | Port appears up on one side; receiving side loses signal silently | STP Loop Guard status; UDLD where available |
Quick checks
All commands below are read-only and safe to run during an active incident. Replace <community>, <device>, and <ifIndex> with your values. Prefer SNMPv3 credentials in production.
# Check current ifOperStatus for a specific interface
snmpget -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.8.<ifIndex>
# Poll ifOperStatus multiple times to catch rapid oscillation
for i in $(seq 1 10); do
snmpget -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.8.<ifIndex>
sleep 1
done
# Check ifAdminStatus to distinguish admin-down from oper-down
snmpget -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.7.<ifIndex>
# Check input errors on the flapping interface
snmpget -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.14.<ifIndex>
# Check STP topology change count
snmpget -v2c -c <community> <device> .1.3.6.1.2.1.17.2.4.0
# Count linkDown/linkUp traps by interface from the trap log.
# Adjust field number and delimiter to match your trap log format.
awk -F'|' '{print $4}' /var/log/snmptrapd.log | sort | uniq -c | sort -rn
# Observe incoming linkDown traps in real time
tcpdump -i eth0 -nn 'udp port 162' -c 100
# Check for UDP socket buffer drops on the trap receiver
cat /proc/net/snmp | grep '^Udp:'
# Check syslog volume from the affected device
grep '<device>' /var/log/network-devices/*.log | wc -l
How to diagnose it
Identify the flapping interface. Parse the trap log for linkDown/linkUp pairs. Each trap carries ifIndex as a varbind. The interface with the highest pair count is the likely culprit. If trap logging is unavailable, poll ifOperStatus for all interfaces repeatedly and look for oscillation.
Confirm the flap rate. Poll ifOperStatus on the identified interface at 1-second intervals for 30 to 60 seconds. Count the transitions. A common operational threshold for a link-flap event is more than 3 transitions per minute. More than 5 linkDown/linkUp trap pairs per minute on any interface warrants a ticket.
Check for physical-layer errors. Poll ifInErrors at
.1.3.6.1.2.1.2.2.1.14and ifOutErrors at.1.3.6.1.2.1.2.2.1.20for the interface index. Incrementing errors confirm a physical-layer fault. High error rates (0.1 to 1% of packets) strongly suggest duplex mismatch. Low but nonzero rates (1e-6 to 1e-4) suggest a single bad fiber strand or dirty connector.Assess the L2 blast radius. Poll dot1dStpTopChanges at
.1.3.6.1.2.1.17.2.4.0. A rising count during the flap confirms STP is reacting. A topology change rate above 5 per minute indicates instability. A TCN burst above 1 per second indicates a link-flap cascade. Check whether the root bridge identity has changed unexpectedly, which would indicate STP priority misconfiguration or a root bridge failure.Check trap receiver health. Examine
Udp_RcvbufErrorsin/proc/net/snmp. Any nonzero increment during the flap means the receiver is dropping datagrams. The root-cause trap may already be lost. This is the most under-monitored signal during a flap incident.Inspect the physical layer directly. On the device, use vendor CLI to check transceiver diagnostics, interface error breakdowns, and negotiated speed and duplex. On Cisco NCS and ISR platforms,
show controllers opticsexposes SFP/QSFP detection, laser state, and power levels. Look for RX-LOS (receive loss of signal), which indicates the problem is toward the peer, versus TX-LOS, which indicates the local transceiver is not transmitting.Check both ends of the link. Duplex mismatch and auto-negotiation failures require inspecting both sides. One end may report clean operation while the other accumulates errors. This is especially common when one side is hard-coded and the other is set to auto-negotiate.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| ifOperStatus transitions per interface | Core indicator of flapping; each transition is one up or down event | More than 3 transitions per minute on any interface |
| linkDown/linkUp trap pair rate | Push-based detection; faster than polling | More than 5 pairs per minute on any interface |
| STP topology change count (dot1dStpTopChanges) | Measures L2-wide blast radius beyond the port itself | TCN burst more than 1 per second |
| ifInErrors on the flapping port | Distinguishes physical-layer faults from logical issues | Any sustained increment; high rate suggests duplex mismatch |
| Udp_RcvbufErrors on trap receiver | Detects receiver overflow that hides root-cause traps | Any nonzero increment during the flap window |
| Syslog rate from affected device | Measures the noise burden on the logging pipeline | Rate more than 5x the rolling 1-hour average |
| Root bridge identity | Detects STP instability extending beyond a single port | Identity changed unexpectedly without a change ticket |
| Topology inference confidence | Flapping degrades endpoint positioning accuracy | Confidence dropping for endpoints in the affected VLAN |
Fixes
Bad cable, SFP, or dirty fiber
Warning: shutting down the interface is disruptive. If there is no redundant path, this causes downtime for connected hosts.
The immediate priority is to stop the cascade. Administratively disable the interface (shutdown on Cisco platforms) to halt the trap and syslog flood and the STP reconvergence. Then replace the cable, clean or replace the fiber, or swap the SFP with a known-good optic. After replacement, monitor ifInErrors to confirm the rate returns to zero before re-enabling the port.
Tradeoff: shutting down the port stops the L2-wide disruption but removes connectivity for everything behind that port. If the port serves a single critical host with no redundancy, coordinate with the application owner before disabling.
Duplex mismatch
Duplex mismatch is the leading preventable cause of interface flapping. It produces late collisions that feed into the flapping cycle. Check speed and duplex on both ends of the link. Auto-negotiation failures are common on copper runs, especially when one side is hard-coded and the other is set to auto. Hard-code speed and duplex on both ends to eliminate negotiation ambiguity. Verify ifInErrors drops to zero after the change.
Tradeoff: hard-coding removes auto-negotiation flexibility. If the remote device is later replaced with one that does not match the hardcoded settings, the link will fail or operate at reduced performance.
End-station NIC bug or power fluctuation
When the switch side shows no physical-layer errors but the port keeps flapping, the problem is likely at the host. Check host-side driver logs for spurious link state messages. The ixgbe driver on RHEL 7 and RHEL 8 has a known issue where it logs bogus “NIC Link is Up” and “NIC Link is Down” messages even with no cable connected, flooding kernel logs without a genuine link event. If the host has redundant uplinks, shut down the affected port and rely on the backup link while investigating the driver or power issue.
Tradeoff: relying on the backup link reduces available bandwidth during the investigation. Updating the NIC driver or firmware may require a host reboot.
STP instability
If multiple ports are transitioning and the root bridge identity has changed, STP priority misconfiguration is likely. Verify the root bridge is the intended device. If a device with lower bridge priority has joined the network, it has become the root and traffic may flow through suboptimal paths. Correct the priority on the intended root bridge.
Enable STP Loop Guard on blocking ports. Without it, a unidirectional fiber failure (one strand cut) can cause a port to silently transition from blocking to forwarding, creating a broadcast storm. Loop Guard detects the absence of BPDUs on a blocking port and places it into loop-inconsistent state instead of allowing it to forward.
Legacy STP takes 30 to 50 seconds to reconverge. If your environment cannot tolerate this downtime window, consider RSTP, which converges faster but still produces topology change bursts during flapping events.
Tradeoff: Loop Guard may block ports that could otherwise forward traffic if BPDUs are delayed but not actually lost. RSTP is not compatible with all legacy switch platforms.
errdisable recovery
On platforms that support errdisable detection (Cisco IOS, NX-OS), a port that flaps more than the platform threshold enters errdisabled state. The port is shut down, which stops the cascade but also removes the port from service.
Recovery is disabled by default. Enable it explicitly on the device:
errdisable recovery cause link-flap
The default recovery interval is 300 seconds. Without recovery enabled, the port stays disabled until manually re-enabled with shutdown followed by no shutdown.
Tradeoff: automatic recovery means the port will try to come back up after the interval. If the underlying fault persists, the port will flap again and re-enter errdisabled state, producing another burst of traps and syslog. Tune the recovery interval to balance stability against unnecessary downtime.
Prevention
- Monitor transition rate, not just current state. A single poll sees one snapshot of ifOperStatus and misses the churn between polls. Alert on transition count per minute.
- Supplement polling with trap-based alerting. ifOperStatus polling has latency bounded by the poll interval. linkDown/linkUp traps provide sub-second detection. A 60-second poll interval can miss a 15-second flap cycle entirely.
- Alert on ifInErrors proactively. Rising errors on a stable interface predict future flapping. Catch physical-layer degradation before it triggers the cascade.
- Enable errdisable recovery for link-flap. This prevents permanent port shutdown without manual intervention. Tune the recovery interval based on your tolerance for downtime versus repeated flap cycles.
- Standardize auto-negotiation policy. Document which links are hard-coded and which use auto-negotiation. Most duplex mismatches arise from inconsistent configuration between ends of a link.







