Silent UDP flow data loss: why your NetFlow collector is dropping records
Your flow analytics show traffic declining on multiple exporters simultaneously. SNMP interface counters say traffic is rising. No device alarms, no exporter config changes, no visible network events. The most likely cause: your collector is silently dropping UDP datagrams at the kernel socket buffer boundary.
UDP has no delivery guarantee. When the socket receive buffer fills, the kernel discards incoming datagrams silently. No error is logged. The only signal is UdpRcvbufErrors in /proc/net/snmp, a counter most teams do not monitor. During a traffic spike or DDoS, your charts may show “normal” or declining traffic while actual packet rates are significantly higher.
What happens at the kernel
When the kernel receives a UDP datagram but cannot deliver it to the application because the socket receive buffer is full, it increments UdpRcvbufErrors and discards the packet. This is distinct from NIC-level drops (which happen before the kernel sees the packet) and application-level parsing failures (which happen after delivery).
The cascade: the exporter sends flow records correctly, the NIC receives them, but the application-layer parser or aggregator does not drain the socket buffer fast enough. The buffer fills. Every additional datagram is dropped. The flow bytes chart shows a decline because dropped packets never make it to the TSDB. Downstream slowness (slow parser, slow disk) causes upstream drops (socket buffer overflow). Increasing the socket buffer only buys time if the downstream bottleneck is not addressed.
flowchart TD
A[Exporter] -->|UDP datagrams| B[NIC ring buffer]
B --> C[Kernel backlog]
C --> D[Socket buffer]
D --> E[Parser thread]
E --> F[TSDB storage]
B -.->|rx_missed_errors| B1[NIC drops]
C -.->|RX errors| C1[Backlog drops]
D -.->|UdpRcvbufErrors| D1[Socket drops]
E -.->|backpressure fills buffer| D
F -.->|slow disk blocks parser| ECommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Undersized socket buffer | UdpRcvbufErrors rising during traffic peaks; flat outside peaks | net.core.rmem_max and whether collector sets SO_RCVBUF |
| Single-core RSS bottleneck | Total CPU fine; one core at 100% during high packet rates | mpstat -P ALL 1 and cat /proc/interrupts |
| Parser or TSDB backpressure | UdpRcvbufErrors rising alongside growing TSDB write queue | Collector self-stats for write queue depth and parser throughput |
| GC or scheduler pauses | Periodic burst drops correlated with runtime GC events | Collector runtime GC logs or pause metrics |
| NIC ring buffer exhaustion | RX drops in /proc/net/dev and ethtool -S incrementing | ethtool -g <iface> and ethtool -S <iface> | grep -i drop |
| Exporter sampling-rate change | Apparent flow volume drops without collector-side signal | Device-side export counters and exporter config |
Quick checks
All commands below are read-only and safe on a production collector.
# Kernel UDP receive buffer errors (cumulative counter)
cat /proc/net/snmp | grep '^Udp:'
# Preferred: nstat gives absolute totals with delta column
nstat -az UdpRcvbufErrors
# Alternative: netstat UDP statistics section
netstat -su | grep -A1 "Udp:"
# Current socket buffer fill for the flow listener (port 2055 = NetFlow)
ss -lun '( sport = :2055 )' -m
# NIC-level drops (happen before socket buffer)
cat /proc/net/dev
# Detailed NIC drop counters
ethtool -S eth0 | grep -i drop
# Current NIC ring buffer settings
ethtool -g eth0
# Per-core CPU (look for one core pinned at 100%)
mpstat -P ALL 1 5
# RSS IRQ distribution across cores
cat /proc/interrupts | grep eth0
# Softirq rates for packet processing
cat /proc/softirqs | grep -E 'NET_RX|NET_TX'
# Verify packets are arriving at the NIC
tcpdump -i eth0 -nn 'udp port 2055' -c 1000
How to diagnose it
Confirm drops exist. Run
nstat -az UdpRcvbufErrorstwice, 60 seconds apart. If the counter increments, datagrams are being dropped at the socket buffer. Any nonzero increment rate on a flow collector is lost telemetry.Determine where the loss occurs. Compare the device-side export rate against the collector inbound rate. On a Cisco device, poll export counters via SNMP. If the device-exported rate exceeds the collector inbound rate, loss is in transit or at the collector.
Rule out NIC-level drops. Check
/proc/net/devfor RX drops on the collector flow-ingress interface. If RX drops are incrementing alongsideUdpRcvbufErrors, the NIC ring buffer is also overflowing. This is a separate problem (ring buffer size and RSS).Check for single-core saturation. Run
mpstat -P ALL 1 5during a high-traffic window. If one core is at 100% softirq or sys while others are idle, RSS is funneling all packet processing to one CPU. This creates a serialization bottleneck regardless of socket buffer size.Check collector-internal backpressure. If the collector exposes a TSDB write queue depth metric, check whether it is growing. A rising write queue means the parser or storage layer cannot keep up, which backs up into the socket buffer. The root cause is downstream.
Rule out exporter-side changes. If only a single exporter shows declining flow volume, check the exporter sampling rate. A change from 1:100 to 1:1000 reduces flow record volume by 10x without any collector-side drop. Multiple exporters declining simultaneously points to the collector.
Check for runtime pauses. If drops occur in periodic bursts rather than continuous streams, a managed-runtime collector (Java, Go) may be experiencing GC stop-the-world pauses. During a pause, the application stops draining the socket buffer.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
UdpRcvbufErrors (/proc/net/snmp) | Only direct kernel signal for socket buffer overflow drops | Any nonzero increment rate on a flow collector |
| Flow packets received rate | Establishes whether the collector is receiving datagrams at all | Rate drops to zero from one exporter while others are normal |
NIC RX drops (/proc/net/dev) | Pre-socket-buffer drops at the hardware level | Incrementing alongside UdpRcvbufErrors means full-stack overload |
Per-core CPU utilization (mpstat) | Single-core saturation from RSS misconfiguration is invisible in aggregate | One core at 100% with others idle during high packet rates |
Softirq NET_RX rate (/proc/softirqs) | Kernel packet processing load distribution | Concentrated on one core rather than distributed |
| Collector inbound vs. device exported rate | End-to-end loss detection | Device exported exceeds collector inbound with no device-side drops |
| TSDB write queue depth | Downstream backpressure that backs up into the socket buffer | Queue growing without bound |
Fixes
Undersized socket buffer
The Linux default net.core.rmem_max is typically 212,992 bytes (208 KB) on most distributions, which is inadequate for high-pps flow collectors. Production deployments should target 16 MB or higher, with 33 MB for very high-volume collectors.
# Immediate (runtime, non-persistent)
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.rmem_default=16777216
# Persistent: add to /etc/sysctl.d/99-flow-collector.conf
# net.core.rmem_max=16777216
# net.core.rmem_default=16777216
Warning: setting rmem_default to 16 MB allocates that much for every UDP socket on the system. On a dedicated flow collector this is usually fine; on a shared host, only raise rmem_max and let the application request the buffer size.
Raising rmem_max only sets the ceiling. The application must also request a larger buffer via setsockopt(SO_RCVBUF). Some collectors set this internally with a hardcoded value that ignores the kernel ceiling. For nfcapd, use the -B flag:
# Set nfcapd socket buffer size (bytes)
nfcapd -B 262144 -w /var/nfdump -p 2055
The nfcapd man page recommends raising this value above 100k for high-volume traffic. If the application’s hardcoded SO_RCVBUF value is below rmem_max, raising rmem_max alone has no effect on that collector.
Note: the kernel internally doubles the value passed to SO_RCVBUF for bookkeeping overhead. A socket requesting 128 KB gets 256 KB of actual buffer space (subject to the rmem_max ceiling). The effective buffer is larger than the application requests, not smaller.
Single-core RSS bottleneck
If mpstat shows one core at 100% and others idle, RSS is not distributing receive interrupts across CPUs.
- Verify current IRQ distribution:
cat /proc/interrupts | grep eth0 - Check how many receive queues the NIC exposes:
ethtool -l eth0 - Enable RSS on the NIC if supported, or configure RPS (Receive Packet Steering) to distribute packets to multiple CPUs in software via
/sys/class/net/<dev>/queues/rx-<n>/rps_cpus
Alternatively, if the collector supports SO_REUSEPORT, run multiple collector processes sharing the same port. This lets the kernel distribute incoming datagrams across worker processes.
NIC ring buffer exhaustion
If /proc/net/dev shows RX drops incrementing, the NIC hardware ring buffer is overflowing before packets reach the kernel socket layer.
# Check current ring buffer settings
ethtool -g eth0
# Increase RX ring buffer (driver and NIC dependent)
ethtool -G eth0 rx 4096
Warning: ethtool -G resizes the ring buffer at runtime and may cause momentary packet loss during the transition. Apply during a maintenance window or low-traffic period. For persistence, add it to your interface bring-up script or systemd-networkd configuration.
Parser or TSDB backpressure
If socket buffer drops are caused by the collector not draining fast enough, increasing the buffer only delays the drops. The root cause is downstream:
- Parser bottleneck: check if the parser uses expensive regex or per-record string operations on every flow record.
- TSDB write blocking: if the storage layer is slow, the write queue grows, blocking the parser thread, which stops draining the socket buffer. Check disk I/O latency with
iostat -xz 1and TSDB series cardinality. - Thread pool exhaustion: if the collector uses a fixed worker pool, verify it is sized for peak load.
Raising rmem_max buys time but does not solve the throughput problem.
GC or scheduler pauses
For collectors built on managed runtimes (Java, Go, .NET), GC stop-the-world pauses can stall the socket drain for tens of milliseconds. During high-pps periods, this is enough to overflow even a correctly sized buffer. The pattern is periodic burst drops rather than continuous loss.
Check the collector runtime for GC pause duration metrics. If pauses exceed the time it takes to fill the socket buffer at your peak packet rate, tune the GC (larger heap, different collector algorithm, lower allocation rate in the parser path) or switch to a collector with lower pause overhead.
Prevention
- Monitor
UdpRcvbufErrorscontinuously. This counter is the earliest signal for socket buffer overflow. Alert on any nonzero increment rate. A dropped-to-received ratio above 0.1% is actionable data loss. - Set
rmem_maxto 16 MB or higher at deployment time. Do not rely on the default. For very high-volume sFlow collectors, target 33 MB. - Verify RSS distribution at deployment. After configuring a new collector, confirm packet processing interrupts are distributed across multiple cores using
cat /proc/interrupts. - Track collector inbound rate against device-side export counters. This is the only reliable end-to-end loss detection method.
- Monitor per-core CPU, not just aggregate. A single saturated core from RSS misconfiguration is invisible in aggregate CPU metrics.
- Separate flow storage from log storage. Log files on the same volume as the TSDB can fill the disk and cause the collector to silently drop records.
How Netdata helps
Netdata provides the following signals relevant to diagnosing silent UDP flow loss:
- IPv4 UDP statistics including
UdpRcvbufErrorsfrom/proc/net/snmp, as part of net stack monitoring. Configure an alarm on any nonzero rate for collectors running flow listeners. - Per-core CPU utilization with softirq breakdown, making single-core RSS saturation visible without manual
mpstatsessions. - Network interface error and drop counters from
/proc/net/dev, covering the NIC ring buffer layer that precedes socket buffer drops. - SoftIRQ rates (
NET_RXandNET_TX) per CPU, showing whether packet processing is concentrated on one core. - Disk I/O latency and utilization on the collector, for diagnosing TSDB write backpressure that backs up into the socket buffer.
Correlate these during incidents: a rising UdpRcvbufErrors rate alongside elevated per-core softirq on the receive core and growing disk I/O wait points to a cascade from storage backpressure through the parser into the socket buffer.
Related guides
No related guides are available in this section yet.







