Interface discards with low utilization: diagnosing ifInDiscards/ifOutDiscards

ifOutDiscards is climbing on a critical uplink. Utilization sits at 35%. No CRC errors, no input errors, no physical-layer alarms. The link is up and passing traffic, but something is silently dropping packets, and your averaged utilization metrics are not telling you why.

The gap between what the counters show and what the silicon is doing comes down to two things: the averaging window on utilization, and the fact that discards happen at buffer-queue granularity, not at link-rate granularity.

The IF-MIB gives you the signals you need. The challenge is knowing what those counters actually measure, where their blind spots are, and which vendor-specific counters fill the gaps.

What this means

ifInDiscards (OID .1.3.6.1.2.1.2.2.1.13) and ifOutDiscards (.1.3.6.1.2.1.2.2.1.19) are IF-MIB counters defined in RFC 2863. They count packets that the interface chose to discard to free buffer space, even though no errors were detected. They are distinct from ifInErrors (.1.3.6.1.2.1.2.2.1.14) and ifOutErrors (.1.3.6.1.2.1.2.2.1.20), which count frames with hardware-detected problems such as CRC failures, runts, or alignment errors.

Both discard counters are Counter32. Before trusting any rate calculation, check ifCounterDiscontinuityTime (.1.3.6.1.2.1.31.1.1.1.3) to rule out counter resets from device reboots, interface flaps, or manual counter clears. A naive differencing algorithm that does not handle wrap or discontinuity will produce phantom discard spikes that look exactly like real ones.

The core diagnostic problem is temporal resolution. Utilization is computed as 8 * (delta octets) / (delta time * ifHighSpeed) using 64-bit HC counters (ifHCInOctets at .1.3.6.1.2.1.31.1.1.1.6, ifHCOutOctets at .1.3.6.1.2.1.31.1.1.1.10). But this is an average over the polling interval or the device load interval. On Cisco IOS/IOS-XE, the default load interval is 300 seconds and is adjustable from 30 to 600 seconds. A microburst that fills the egress buffer for 50 milliseconds is invisible in a 30-second average, let alone a 5-minute one. A 10G interface receiving traffic from four independent 1G sources can exhaust its egress queue in milliseconds if those flows arrive simultaneously, even though the 30-second average never exceeds 40% utilization.

That is the signature pattern: discards at low sustained utilization almost always means microbursts.

flowchart TD
    A["Discards incrementing on interface"] --> B{"ifCounterDiscontinuityTime changed?"}
    B -- "Yes" --> C["Counter reset: spike may be calculation artifact"]
    B -- "No" --> D{"Utilization sustained above 80%?"}
    D -- "Yes" --> E["Capacity exhaustion: link is saturated"]
    D -- "No" --> F{"ifInErrors or ifOutErrors also rising?"}
    F -- "Yes" --> G["Physical-layer fault: cable, SFP, optics"]
    F -- "No" --> H{"Single queue class affected?"}
    H -- "Yes" --> I["QoS buffer threshold or policer drop"]
    H -- "No" --> J["Microburst: sub-second spike invisible to averaged counters"]

Common causes

CauseWhat it looks likeFirst thing to check
Microburst congestionifOutDiscards rising, utilization under 70%, no errorsPer-queue drop counters via vendor QoS MIBs or CLI
Speed or duplex mismatchDiscards on one side, errors on the otherInterface speed and duplex negotiation on both ends
QoS buffer thresholdDiscards concentrated in one queue classPer-queue stats via vendor-specific commands
Input policer or ACLifInDiscards rising, no corresponding output dropsApplied policies and ACL hit counters
Counter wrap or discontinuitySudden massive spike with no traffic correlationifCounterDiscontinuityTime and sysUpTime
Undersized device bufferDiscards correlate with aggregate traffic, not one flowPlatform buffer allocation settings

Quick checks

# Poll discard counters via SNMP
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.13  # ifInDiscards
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.19  # ifOutDiscards

# Check for counter discontinuity (should be 0 or stable)
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.31.1.1.1.3

# Poll error counters to distinguish drops from physical faults
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.14  # ifInErrors
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.2.2.1.20  # ifOutErrors

# Check utilization using 64-bit HC counters
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.31.1.1.1.6   # ifHCInOctets
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.31.1.1.1.10  # ifHCOutOctets
snmpwalk -v2c -c <community> <device> .1.3.6.1.2.1.31.1.1.1.15  # ifHighSpeed

# On Cisco IOS/IOS-XE: detailed drop and queue breakdown
ssh <device> 'show interface <iface> | include drop|queue|buffer'
ssh <device> 'show platform hardware fed active qos queue stats interface <iface>'

How to diagnose it

  1. Rule out counter artifacts. Poll ifCounterDiscontinuityTime. If it changed since the last poll, the discard spike may be a calculation artifact from a counter reset, not real drops. Also check sysUpTime (.1.3.6.1.2.1.1.3.0) to confirm the device did not reboot.

  2. Distinguish discards from errors. Poll ifInErrors and ifOutErrors alongside the discard counters. If errors are also rising, the problem is physical-layer: cable, SFP, dirty fiber, or duplex mismatch. Focus on the physical path, not the buffer queue.

  3. Confirm the utilization gap. Compute utilization from ifHCInOctets, ifHCOutOctets, and ifHighSpeed. If utilization is genuinely low (under 70%) and discards are rising, you are looking at microbursts, QoS policy drops, or an input policer. If utilization is actually high (above 80%), the link is saturated and the low utilization reading was an artifact of a long load interval.

  4. Shorten the load interval. On Cisco, set load-interval 30 on the affected interface in configuration mode. This is non-disruptive (affects statistics only, not forwarding) and tightens the averaging window from the 300-second default to 30 seconds. It will not reveal millisecond-scale bursts, but it catches multi-second spikes that a 5-minute interval hides.

  5. Inspect per-queue drops. The port-level ifOutDiscards counter aggregates all queue classes. On modern ASICs, buffer allocation is per-queue, not per-port. Use vendor-specific commands to see which queue is dropping:

    • Cisco Catalyst 9000: show platform hardware fed active qos queue stats interface <iface> shows per-queue enqueue and drop thresholds (TH0, TH1, TH2).
    • Arista EOS: show queue-monitor length (LANZ) provides real-time egress queue depth, enabled by default on supported platforms.

    On Cisco IOS/IOS-XE, ifInDiscards counts only “No Buffer Drops,” while the legacy counter locIfInputQueueDrops equals “Queue Limit Drops + No Buffer Drops.” So ifInDiscards is a proper subset of the input queue pressure the device actually experienced. For output, locIfOutputQueueDrops equals ifOutDiscards. Your SNMP ifInDiscards value may undercount the real input queue problem.

  6. Capture the burst. Embedded Packet Capture (EPC) on Cisco is unsuitable for microburst analysis because it caps capture throughput at a rate well below line rate. Use a TX-only SPAN of the affected interface instead, collected while drops are actively incrementing. Source and destination SPAN ports must have the same or higher speed; otherwise the SPAN session introduces its own drops.

  7. Check for speed mismatch. A 10G interface connected to a 1G neighbor will drop on the egress side whenever the remote cannot absorb at line rate. This is independent of overall average load. Verify interface speed and duplex on both ends.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ifOutDiscards rateLeading indicator of egress buffer exhaustionAny nonzero sustained rate on a critical interface
ifInDiscards rateInput queue overflow or policer and ACL dropsSustained nonzero rate
ifHCInOctets / ifHCOutOctetsUtilization computation; must use 64-bit for links at or above 100 MbpsSustained above 80% indicates genuine capacity issue
ifInErrors / ifOutErrorsDistinguishes physical-layer faults from buffer dropsAny nonzero rate changes the diagnosis from buffer tuning to hardware investigation
ifCounterDiscontinuityTimeValidates counter continuity before rate calculationAny change between polls invalidates the delta
Per-queue drop countersReveals which QoS class is actually droppingSingle queue accounting for most port-level drops
sysUpTimeCorrelates counter resets with device rebootsDecrease between polls indicates reboot event

Fixes

Microburst congestion

The fundamental fix is to increase available buffer or reduce burstiness. On platforms that support it, increase the buffer allocated to the affected queue. On Cisco Catalyst 9000, adjust the buffer ratio per class using queue-buffers ratio <0-100> inside the policy-map class configuration. On platforms with intra-ASIC buffer sharing (Cisco UADP 3.0-based Catalyst 9500 HP and 9600, from IOS XE 17.2.1), enable qos share-buffer in global configuration to allow AQM buffers to be shared between ASIC cores, which reduces microburst-induced discards on multi-core designs.

If buffer tuning is insufficient, the traffic pattern itself may need shaping. Deploy ingress shaping or scheduling changes to smooth the burst before it reaches the congested egress interface.

Speed or duplex mismatch

Verify autonegotiation results on both ends. A speed mismatch (10G feeding 1G) creates inherent egress drops on the faster side. Fix by matching interface speeds, deploying shaping to the slower rate, or upgrading the remote to match.

QoS policy drops

If discards are concentrated in a specific queue class, the QoS policy may be doing exactly what it was configured to do. Evaluate whether the drop rate is expected for that traffic class. If not, adjust the queue buffer ratio or the policer rate for that class.

Counter wrap

Use 64-bit HC octets counters (ifHCInOctets, ifHCOutOctets) for utilization on links at or above 100 Mbps. The 32-bit ifInOctets wraps in approximately 3.4 seconds at 10G line rate, 34 seconds at 1G, and 5.7 minutes at 100M. No 64-bit discard counters exist in IF-MIB. Ensure your polling system detects Counter32 wrap via ifCounterDiscontinuityTime and handles Counter32 arithmetic for discard rate calculations.

Prevention

  • Poll at the shortest interval your collector can sustain. Five-minute polling misses microbursts entirely. One-minute polling catches multi-second spikes. Sub-second bursts remain invisible to SNMP polling regardless of interval.
  • Use 64-bit HC counters for utilization. Never use 32-bit ifInOctets or ifOutOctets for links at or above 100 Mbps.
  • Set load intervals to 30 seconds on critical interfaces to tighten the averaging window and reduce the gap between what the chart shows and what the buffer experienced.
  • Monitor per-queue drop counters, not just port-level aggregates. Port-level ifOutDiscards hides which traffic class is affected.
  • Track ifCounterDiscontinuityTime alongside discard counters. A counter reset without a corresponding sysUpTime reset indicates a counter-source bug or interface flap.
  • Baseline discard behavior during normal operation. On Catalyst 9000, output drops may be reported in bytes by default, not packets. Calculate the ratio: (total output drops) / (total output bytes transmitted) x 100. A value below 0.01% over a multi-week counter lifetime is typically transient microburst noise rather than a sustained problem.

How Netdata helps

Netdata collects interface-level SNMP counters including ifInDiscards, ifOutDiscards, ifInErrors, ifOutErrors, and 64-bit HC octets counters. The value for this scenario is correlation:

  • Discard rate against utilization. Overlay ifOutDiscards delta against ifHCOutOctets-derived utilization on the same chart. Discards climbing while utilization stays low is the microburst signature.
  • Discards against errors. Correlate ifInDiscards with ifInErrors on the same interface. Errors rising alongside discards shifts the investigation from buffer tuning to physical-layer inspection.
  • Counter discontinuity detection. Netdata tracks counter continuity across polling intervals, filtering phantom spikes from wraps or resets.
  • Per-interface alerting with context. Configure discard-rate alerts per interface criticality. A rising discard rate with positive second derivative (accelerating drops) signals congestion cascade risk before impact manifests.