SD-WAN tunnel up but degraded: when the control plane lies

The orchestrator shows your SD-WAN tunnel as UP. Control connections to vSmart or vBond are healthy. OMP sessions are Established. But users at the far end report slow applications, dropped voice calls, or timeouts.

The control plane reports a healthy tunnel while the data plane is degraded with packet loss, latency spikes, or silent traffic drops. Interface counters show UP/UP because the degradation is on the underlay path or inside the encapsulated data plane, not on the local interface.

In Cisco Catalyst SD-WAN, BFD runs automatically on every data plane tunnel once established and cannot be disabled. It continuously measures loss, latency, and jitter. If you are not tracking BFD metrics and SLA probe results, the tunnel looks fine until someone complains.

What this means

In Cisco SD-WAN, the control plane and data plane are architecturally separate. Control plane tunnels use DTLS between WAN Edge routers and controllers (vSmart, vBond, vManage). Data plane tunnels use IPsec or GRE between WAN Edge routers to carry user traffic. A control connection showing “up” proves the router can reach the orchestrator. It says nothing about whether the inter-router data plane tunnel can pass traffic.

BFD is the authoritative signal. Once a data plane tunnel forms between two WAN Edge routers, BFD starts automatically on top of it. BFD session state is the ground truth for tunnel health, not the control connection state.

BFD session states:

  • up: session is active and healthy.
  • down: session has failed, typically due to timeout.
  • NA: initial state before the session establishes.

The transitions counter tracks state changes. An increasing transition count on a session that currently shows “up” indicates intermittent degradation the control plane will never surface.

flowchart TD
    A["Control plane UP
Users report degradation"] --> B{"BFD session state?"} B -->|up| C{"Transitions count rising?"} B -->|down| D["Data plane failed - check underlay"] B -->|empty or NA| E["No BFD session - check color config"] C -->|yes| F["Intermittent loss - check underlay errors"] C -->|no| G{"SLA probes show loss or latency?"} G -->|yes| H["Underlay path issue - run mtr both ways"] G -->|no| I{"IPsec replay drops increasing?"} I -->|yes| J["SA mismatch - request ipsec-rekey"] I -->|no| K["Check DSCP marking, app-route policy"]

Common causes

CauseWhat it looks likeFirst thing to check
Underlay path degradationBFD up but loss, latency, or jitter elevated; app-route SLA breachRun mtr to the tunnel endpoint; check underlay interface errors on both ends
Missing color or tunnel-interface configControl connections up, BFD sessions list is emptyVerify both WAN interfaces under VPN0 have tunnel-interface and color configured
IPsec anti-replay mismatchTunnel up, traffic silently dropped, no BFD down eventCheck show crypto ipsec stats for replay failures
PMTU discovery failureLarge packets dropped, small packets pass, BFD stays upCheck show tunnel statistics bfd for PMTU values; test with varying packet sizes
BFD DSCP handling by providerBFD reports loss while data traffic flows normallyCheck if MPLS provider treats CS6-marked BFD packets differently
Peer device CPU peggedIntermittent BFD timeouts, transitions count risingCheck control-plane CPU on the peer WAN Edge
Asymmetric underlay routingForward path healthy, reverse path degradedRun mtr in both directions; compare paths

Quick checks

These commands are for Viptela-based WAN Edge (vEdge and cEdge running SD-WAN) unless noted:

# BFD session state, colors, source/dest IPs, transitions (Viptela CLI)
show bfd sessions

# Extended view including NAT-translated ports
show bfd sessions detail

# Historical BFD state changes with timestamps
show bfd history

# Per-tunnel TX/RX packet and byte counts
show tunnel statistics

# BFD-specific packet counts including echo TX/RX and PMTU
show tunnel statistics bfd

# OMP TLOC database entries and status codes
show omp tlocs

# IPsec anti-replay failures (cEdge / IOS XE)
show crypto ipsec stats

<!-- TODO: verify command 'show system statistics diff' exists on Viptela/cEdge -->
# System statistics diff for replay integrity drops
show system statistics diff

# Path discovery from a Linux host behind the WAN Edge to the tunnel endpoint
mtr -n -c 100 <tunnel-endpoint-ip>

How to diagnose it

  1. Verify BFD session state. Run show bfd sessions. If sessions are empty despite control connections being up, check that both WAN interfaces under VPN0 have tunnel-interface and color configured. Missing color on one interface is a common cause of zero BFD sessions. Also verify default routes exist for each WAN interface.

  2. Check BFD transitions count. In show bfd sessions, look at the transitions counter for each session. A session showing “up” with a high or increasing transition count means the tunnel is flapping intermittently. Cross-reference with show bfd history for timestamps of state changes.

  3. Examine SLA probe data. Check the orchestrator’s view of tunnel SLA metrics for loss, latency, and jitter. On Cisco vManage, use the Tunnel Health tool for a time-customizable view of operational data plane tunnels. On Cato, query accountSnapshot via the GraphQL API. These show the actual data plane quality measurements that application-aware routing decisions depend on.

  4. Run bidirectional path discovery. Run mtr from hosts at both ends of the tunnel. Asymmetric underlay routing is common. The forward path may be healthy while the reverse path is degraded. If the two directions show different paths or different loss characteristics, you have an asymmetric problem.

  5. Check underlay interface counters. On both WAN Edge routers, examine the underlay transport interfaces for errors, discards, and utilization. Rising ifInErrors points to physical-layer issues (cable, SFP, dirty fiber). Rising ifOutDiscards at high utilization points to congestion. Microbursts can cause discards even when 5-minute average utilization is low.

  6. Check for IPsec anti-replay issues. On cEdge routers, run show crypto ipsec stats and look for replay failures. Anti-replay check failures cause silent packet drops on the data plane while the tunnel remains up. Also check show system statistics diff for increasing rx_replay_integrity_drops, which indicates an IPsec SA mismatch.

  7. Verify PMTU behavior. Check show tunnel statistics bfd for PMTU values. If PMTU discovery fails or is blocked by ISP ACLs, large BFD probe packets or encapsulated data traffic get silently dropped. The BFD session may stay up because small probe packets pass, but user traffic with larger payloads fails.

  8. Consider DSCP handling. BFD control packets are marked CS6 (DSCP 48) in the outer IP header. Some MPLS L3VPN or L2VPN providers handle CS6-marked traffic differently from unmarked traffic. This can cause BFD to report loss while data traffic flows normally, or vice versa. If BFD loss does not correlate with user-reported degradation, DSCP handling mismatch is a candidate.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
BFD session state (up/down/NA)Ground truth for data plane tunnel healthAny transition to “down” or sessions stuck at “NA”
BFD transitions countDetects intermittent flapping invisible in current stateCount increasing on a session showing “up”
BFD loss, latency, jitterContinuous quality measurement of the tunnelSustained values above SLA class thresholds
Application-aware routing decisionsShows whether traffic is being steered away from degraded tunnelsUnexpected path changes or traffic pinned to a suboptimal tunnel
Underlay interface errors (ifInErrors)Physical-layer degradation on the transport pathCounter incrementing at any rate on a critical underlay
Underlay interface discards (ifOutDiscards)Congestion on the underlay egressDiscards rising, especially during peak traffic windows
IPsec anti-replay dropsSilent data plane packet drops with tunnel uprx_replay_integrity_drops increasing in system statistics
SLA poll interval convergenceHow fast degradation is detected and acted uponDetection taking longer than expected for the configured poll interval
Vendor API response validityOrchestrator data is the primary telemetry source for SD-WANHTTP 200 with empty payload or schema mismatch

Fixes

Underlay path degradation

If BFD loss and latency are elevated and mtr shows path issues, the problem is in the underlay. Check both ends for interface errors and discards. If the underlay is MPLS, engage the provider with specific loss and latency data. If the underlay is internet, check for routing changes using BGP monitoring on the WAN Edge. Application-aware routing should steer traffic to an alternate tunnel if one exists, but this depends on SLA class configuration and poll interval timing.

Missing color or tunnel-interface configuration

If show bfd sessions is empty despite control connections being up, verify the configuration. Both WAN interfaces under VPN0 must have tunnel-interface and color configured. Also verify default routes exist for each WAN interface. Setting color <color> restrict or max-control-connections 0 on an interface prevents control connections on that interface, which also affects data plane tunnel formation via that color. Operators sometimes misconfigure this thinking it only affects the control plane.

IPsec anti-replay mismatch

If show crypto ipsec stats shows replay failures or rx_replay_integrity_drops is increasing, the IPsec SAs are mismatched between peers. Force a rekey to clear the mismatch. If drops persist after rekey, verify the authentication-type configuration on both ends.

Warning: Forcing an IPsec rekey is disruptive. Existing flows through the tunnel will be interrupted during rekey. Schedule during a maintenance window if production traffic is affected.

PMTU discovery failure

If large packets are being silently dropped, check whether PMTU discovery is working. If an ISP is blocking ICMP fragmentation-needed messages, PMTU discovery fails silently. On Cisco SD-WAN, BFD automatically negotiates the largest MTU per transport connection when PMTU discovery is enabled. Work with the ISP to allow ICMP fragmentation-needed messages, or set a static MTU that accounts for encapsulation overhead.

SLA convergence too slow

The default convergence time for detecting slowly degrading WAN circuits in Cisco IOS XE Catalyst SD-WAN 17.x is between 10 minutes and 1 hour. Even with the lowest recommended poll interval of 2 minutes and 6 intervals, convergence time is 2 to 12 minutes. The default BFD hello interval is 1 second, but the app-route SLA poll interval defaults to 10 minutes. Setting a very low poll interval can result in false positives due to insufficient sample data.

Enhanced Application-Aware Routing (available in later releases) speeds detection of tunnel performance issues, allowing devices to redirect traffic away from tunnels that do not meet SLA requirements faster than legacy polling mechanisms.

Fortinet SD-WAN false positive SLA failures

If you are running Fortinet SD-WAN on FortiOS 7.2.8, 7.4.4, or 7.4.6, Performance SLAs may report failure even when there is no actual packet loss. This is confirmed bug #1023878. False positive SLA failure logs appear at random intervals after upgrades to affected versions. Community reports indicate the issue persists in 7.4.6 despite claims of resolution in 7.4.5. The workaround is to switch the health-check protocol from “ping” to “DNS” type. Additionally, using both packet-loss-based and latency-based Performance SLAs simultaneously can produce unreliable results on Fortinet. Testing shows packet-loss-only SLA behaves correctly in isolation, but combining metrics causes unpredictable interface state flaps.

Prevention

  • Monitor BFD metrics continuously, not just session state. Track transitions count, loss, latency, and jitter per BFD session. Alert on rising transitions even when the session shows “up.”
  • Track SLA probe results over time. Baseline loss, latency, and jitter per tunnel. Alert on sustained deviation from baseline, not just binary up/down.
  • Run active path probes in both directions. Asymmetric underlay routing is common. Single-direction probes miss half the picture.
  • Monitor underlay interface errors and discards. These are leading indicators of underlay degradation before BFD loss becomes severe enough to trigger SLA actions.
  • Verify SD-WAN configuration at deployment. Missing color, incorrect restrict settings, and missing default routes are the most common reasons for empty BFD sessions.
  • Track IPsec anti-replay counters. Include rx_replay_integrity_drops in regular monitoring. These increment silently and indicate SA mismatch before users notice.
  • Validate vendor API responses, not just HTTP status codes. SD-WAN orchestrator APIs can return HTTP 200 with empty or error payloads during maintenance windows. The HTTP status alone is misleading.

How Netdata helps

Netdata correlates the multiple signal layers that SD-WAN tunnel degradation requires:

  • BFD session metrics collected via vendor APIs or device CLI scraping, including session state, transitions count, and per-tunnel loss, latency, and jitter. Correlate BFD state changes with underlay events to pinpoint root cause.
  • Underlay interface errors and discards from SNMP polling, correlated with BFD degradation timestamps to determine whether the underlay is the source.
  • Vendor API response validity monitoring for SD-WAN orchestrators, flagging HTTP 200 responses with empty payloads that create blind spots in tunnel visibility.
  • Active path probe latency and loss correlated across both tunnel directions to detect asymmetric degradation that single-direction monitoring misses.
  • Control-plane CPU on WAN Edge routers, which when pegged causes intermittent BFD timeouts and delayed SLA convergence.
  • Cross-signal correlation between BFD state changes, underlay interface events, syslog messages, and application-aware routing decisions within the same time window to reconstruct the failure sequence.