BGP session Established but stale: detecting silent route loss
Your BGP session to a transit provider or iBGP peer is Established, but destinations are unreachable. The RIB is missing prefixes from that peer, or the routes it has are stale. No NOTIFICATION was sent, no session flap occurred, and your monitoring trusts the FSM state.
This is the “Established but stale” pattern. KEEPALIVEs are still exchanged at the TCP level, but the UPDATE exchange has stopped. The peer stopped sending routes, a middlebox is silently dropping UPDATE packets, or Graceful Restart is holding the session open after the remote side went down.
BGP session state is a binary “is the TCP session up” signal, not “am I receiving routes.” If you alert only on FSM transitions, you miss the case where the session is up but routing is stale or empty.
What happens during the failure
A BGP session in Established means the TCP connection succeeded and the BGP OPEN exchange completed. After initial synchronization, Adj-RIB-In is populated, End-of-RIB (EOR) markers are sent, and best-path selection installs routes into the RIB and FIB.
“Established but stale” means the session passed OPEN and remains Established, but UPDATE traffic has stopped. KEEPALIVEs (small TCP packets) continue to pass, so the FSM never transitions out of Established, no BGP NOTIFICATION is generated, and no syslog message appears for a session reset.
The routing data is hours old, or the peer stopped advertising entirely. The local RIB may still contain routes from the last successful UPDATE, but those routes may have been withdrawn upstream. Traffic follows stale paths or blackholes.
The divergence between KEEPALIVE health and UPDATE health is the core failure mode. KEEPALIVEs are small packets that pass through MTU bottlenecks and middleboxes. UPDATE messages carrying many prefixes can be larger and more susceptible to silent dropping.
flowchart TD
A["TCP session up
KEEPALIVEs pass"] --> B["BGP FSM: Established"]
B --> C["UPDATE exchange stops"]
C --> D["Routes stale or missing in RIB"]
D --> E["Alert shows green
FSM is Established"]
E --> F["Users hit stale paths
or blackholes"]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Peer CPU pegged | KEEPALIVEs sent but no UPDATEs generated; peer control plane saturated | bgpPeerInUpdates flat; peer CPU near 100% |
| Middlebox dropping UPDATEs | Session up, KEEPALIVEs pass, larger UPDATE packets silently dropped (often MTU/MSS mismatch) | TCP MSS negotiation; packet capture on port 179 |
| Graceful Restart masking loss | FSM stays Established during restart window; stale routes retained in RIB | GR/LLGR state on the session; EOR marker timing |
| Route-flap damping suppressing routes | Session up, but all received routes damped; prefix count near zero | Damping state on received prefixes |
| Policy/filter change on peer | Session up, prefix count drops; peer changed export policy | Compare advertised-routes vs received-routes |
| Session hung in half-state | Vendor-specific bug; TCP session healthy but BGP process not processing | show tcp brief and BGP process state |
Quick checks
These are read-only diagnostic commands. Commands shown are Cisco IOS/XE syntax; Juniper and other vendors have equivalent show commands.
# Check FSM state and prefix counters per neighbor
ssh <router> 'show ip bgp summary | begin Neighbor'
# Check last UPDATE received and reset reason for a specific peer
ssh <router> 'show ip bgp neighbors <peer-ip> | include update|reset|received'
# Check for BGP NOTIFICATIONs in syslog for this peer
ssh <router> 'show log | include BGP.*<peer-ip>'
# Confirm the underlying TCP session on port 179 is healthy
ssh <router> 'show tcp brief | include <peer-ip>'
# Check if routes from the peer are actually in the RIB
ssh <router> 'show ip route <prefix-from-peer>'
# SNMP: poll bgpPeerState for all peers
snmpwalk -v2c -c <community> <router> .1.3.6.1.2.1.15.3.1.2
# SNMP: poll bgpPeerInUpdates to track UPDATE receive rate
snmpwalk -v2c -c <community> <router> .1.3.6.1.2.1.15.3.1.10
# Check Graceful Restart state on the session
ssh <router> 'show ip bgp neighbors <peer-ip> | include Graceful'
How to diagnose it
Confirm the FSM is Established but stale. Check
show ip bgp summaryfor the peer. The state should be Established (value 6 in the BGP4-MIB). Then checkbgpPeerInUpdates: if the counter has not incremented in minutes or hours while the session is Established, the peer is not sending UPDATEs.Check prefix count trends. Compare the current per-peer prefix count against a baseline. A drop of more than 50% without a session state change is a route-withdrawal event. A flat prefix count on a peer that normally announces churn (such as a transit provider with a full table) is suspicious.
Look for NOTIFICATIONs. Search syslog for BGP messages related to the peer. Absence of a NOTIFICATION confirms this is a silent stale session, not an explicit reset. If a NOTIFICATION is present, the failure is a different pattern; check the Cease subcode for the cause.
Verify TCP health. Run
show tcp briefto confirm the underlying TCP connection on port 179 is established and has not reset. If the TCP session is gone but the FSM still shows Established, you likely have a vendor-specific half-state bug or a stuck BGP process.Check Graceful Restart state. If GR is enabled, the FSM can remain in Established during the restart window while the peer is actually down. Check the GR timer state and whether the EOR marker has been received. On Cisco IOS/XE, the default GR restart-time is 120 seconds and stale-path-time is 360 seconds when GR is enabled. GR is disabled by default and must be explicitly configured.
Long-Lived Graceful Restart (LLGR), available on Juniper Junos 15.1 and later, can extend stale route retention to hours or longer. LLGR routes carry a community marker (llgr-stale) that distinguishes them from fresh routes.
Check for MTU/MSS issues. BGP KEEPALIVEs are small packets that pass through path-MTU bottlenecks. UPDATE messages carrying many prefixes can be much larger. If the path has an MTU mismatch and TCP MSS negotiation is broken, KEEPALIVEs pass but UPDATEs are silently dropped. Capture on port 179 on both ends to confirm.
Verify routes are actually in the RIB. Pick a prefix that should be announced by the peer and check
show ip route <prefix>. If the route is present but the next-hop is unreachable, the issue is downstream of the peer. If the route is absent, the peer is not announcing it, or your import policy rejected it.Compare with external visibility. If you have a BMP (RFC 7854) station, compare its Adj-RIB-In for the peer against the local router’s view. BMP provides pre-policy and post-policy route visibility that BGP4-MIB cannot expose. Without BMP, use
show ip bgp neighbor <peer> received-routes(requires soft reconfiguration inbound, which is memory-expensive on full-feed peers).
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
bgpPeerState (.1.3.6.1.2.1.15.3.1.2) | Base FSM state per peer | Any transition out of Established; but Established alone is insufficient |
bgpPeerInUpdates (.1.3.6.1.2.1.15.3.1.10) | Rate of UPDATE messages received from peer | Counter flat while session is Established |
| Per-peer prefix count | Routes actually installed from the peer | Drop greater than 50% without session state change |
| Last UPDATE timestamp | Time since last route announcement from peer | Not advancing for minutes or hours |
| BGP NOTIFICATION in syslog | Explicit session reset or error code | Absence during a route-loss event confirms silent stale session |
| Graceful Restart state | Whether session is held up by stale-path retention | GR active with no EOR marker received |
| TCP session health | Underlying transport on port 179 | TCP reset or half-open despite FSM showing Established |
| BMP Adj-RIB-In (RFC 7854) | Pre-policy and post-policy per-prefix visibility | Mismatch between BMP view and local RIB |
Fixes
Peer CPU saturation
If the peer device’s control-plane CPU is pegged, it may send KEEPALIVEs but lack CPU to generate UPDATE messages. This is common during BGP reconvergence storms, SNMP polling overload, or on undersized hardware.
Check the peer’s control-plane CPU (Cisco: cpmCPUTotal5min at .1.3.6.1.4.1.9.9.109.1.1.1.1.7; Juniper: jnxOperatingCPU at .1.3.6.1.4.1.2636.3.1.13.1.8). If the peer is overloaded, coordinate with the peer’s operator. Do not reset the session unilaterally: a reset forces a full table re-send and worsens the CPU spike on both ends.
Middlebox dropping UPDATE packets
A firewall, load balancer, or middleware between the BGP speakers may be rate-limiting or dropping packets above a certain size. KEEPALIVEs (small) pass through; UPDATEs (larger) are dropped.
Capture traffic on TCP port 179 on both ends to confirm UPDATEs are leaving the peer but not arriving. Check for MTU/MSS mismatch and enable TCP MTU path discovery on the BGP group or session. If a middlebox cannot be reconfigured, move the BGP session to a path without the middlebox.
Graceful Restart masking session loss
GR and LLGR are designed to retain routes during a controlled restart. They can also mask a session loss if the remote side crashes and does not come back within the configured window.
Track GR state separately from FSM state. Monitor the EOR marker: if the peer restarted, the EOR marker signals the end of initial route exchange. Absence of EOR after the restart window means the peer did not complete resynchronization. Set GR timers to operational reality: if your restart window is too long, stale routes persist longer than necessary.
Route-flap damping suppressing routes
If route-flap damping is enabled on received prefixes, a flapping upstream can cause all routes from that peer to be suppressed. The session stays Established, but no routes are installed.
BGP route dampening is largely deprecated at ISP scale because it suppresses legitimate route withdrawals. If damping is active on a transit-facing session, evaluate whether it should be removed or scoped to specific prefix ranges only.
Policy or filter change on the peer
The peer may have changed its export policy and stopped advertising routes. The session stays Established because KEEPALIVEs still flow.
Compare received-routes count against the peer’s expected advertised-routes. If the peer inadvertently stopped exporting, contact them. This is an operational coordination issue, not a protocol issue.
BFD tracking the wrong interface
If BFD is configured for the BGP session but tracks the wrong interface or path, BFD may report the session as healthy while the actual data path for UPDATE messages is broken. Verify BFD session state and which interface it tracks. Confirm BFD and BGP reference the same transport path.
Prevention
Alert on bgpPeerInUpdates rate, not just FSM state. The single most important step is to alert when UPDATE rate goes to zero while the session is Established. This catches the stale-Established pattern that FSM monitoring alone misses.
Baseline per-peer prefix counts. Track each peer’s normal prefix count and alert on deviations. A peer with a full Internet table that drops to near zero is page-worthy, even if the session is Established.
Deploy BMP where possible. BMP (RFC 7854) provides Adj-RIB-In visibility (pre-policy and post-policy routes), per-prefix real-time streaming, and withdrawal tracking. BGP4-MIB’s bgp4PathAttrTable only reflects best-path routes, and Adj-RIB-In entries return “NA” over SNMP. BMP is the only way to get full pre-policy visibility without relying on soft-reconfiguration inbound.
Monitor Graceful Restart state separately. Do not treat a GR-active session the same as a healthy Established session. Alert if GR is active beyond the expected window.
Track the last-UPDATE timestamp. If your collector can timestamp the last bgpPeerInUpdates increment, alert when it exceeds a threshold appropriate for the peer type (for example, 30 minutes for a transit peer, 5 minutes for an iBGP peer with active churn).
Periodically verify RIB freshness. Check next-hop reachability for a sample of prefixes from each peer to catch stale routes.
How Netdata helps
- The SNMP collector polls
bgpPeerStateandbgpPeerInUpdatesat sub-minute intervals, giving you the UPDATE-rate signal that catches stale-Established sessions before users notice. - Correlate BGP session state with interface counters on the peer-facing interface. If the interface is up but
bgpPeerInUpdatesis flat, the session is stale. - Track per-peer prefix counts over time. Anomaly detection surfaces gradual prefix declines that a static threshold would miss.
- Correlate BGP events with syslog. A NOTIFICATION paired with a session-state drop confirms a real reset; absence of NOTIFICATION during route loss confirms the stale-Established pattern.
- Cross-reference BGP session health with device control-plane CPU. If CPU spikes coincide with UPDATE rate dropping to zero, the peer is overloaded.
Related guides
- Network monitoring checklist: the signals every production network needs
- SNMP timeouts and retries: why devices show as down when they aren’t
- SNMP poll response latency: diagnosing a slow poller
- SNMP poller falling behind: the polling-storm cascade and how to catch it
- NetFlow vs sFlow vs IPFIX: what they measure and how each one fails
- SNMP counter discontinuity after reboot: bogus rate spikes explained







