$ guides / network / network-netflow-udp-flow-loss ▌

Operations Guides

Silent UDP flow data loss: why your NetFlow collector is dropping records

Your flow analytics show traffic declining on multiple exporters simultaneously. SNMP interface counters say traffic is rising. No device alarms, no exporter config changes, no visible network events. The most likely cause: your collector is silently dropping UDP datagrams at the kernel socket buffer boundary.

UDP has no delivery guarantee. When the socket receive buffer fills, the kernel discards incoming datagrams silently. No error is logged. The only signal is UdpRcvbufErrors in /proc/net/snmp, a counter most teams do not monitor. During a traffic spike or DDoS, your charts may show “normal” or declining traffic while actual packet rates are significantly higher.

What happens at the kernel

When the kernel receives a UDP datagram but cannot deliver it to the application because the socket receive buffer is full, it increments UdpRcvbufErrors and discards the packet. This is distinct from NIC-level drops (which happen before the kernel sees the packet) and application-level parsing failures (which happen after delivery).

The cascade: the exporter sends flow records correctly, the NIC receives them, but the application-layer parser or aggregator does not drain the socket buffer fast enough. The buffer fills. Every additional datagram is dropped. The flow bytes chart shows a decline because dropped packets never make it to the TSDB. Downstream slowness (slow parser, slow disk) causes upstream drops (socket buffer overflow). Increasing the socket buffer only buys time if the downstream bottleneck is not addressed.

flowchart TD
    A[Exporter] -->|UDP datagrams| B[NIC ring buffer]
    B --> C[Kernel backlog]
    C --> D[Socket buffer]
    D --> E[Parser thread]
    E --> F[TSDB storage]
    B -.->|rx_missed_errors| B1[NIC drops]
    C -.->|RX errors| C1[Backlog drops]
    D -.->|UdpRcvbufErrors| D1[Socket drops]
    E -.->|backpressure fills buffer| D
    F -.->|slow disk blocks parser| E

Common causes

Cause	What it looks like	First thing to check
Undersized socket buffer	`UdpRcvbufErrors` rising during traffic peaks; flat outside peaks	`net.core.rmem_max` and whether collector sets `SO_RCVBUF`
Single-core RSS bottleneck	Total CPU fine; one core at 100% during high packet rates	`mpstat -P ALL 1` and `cat /proc/interrupts`
Parser or TSDB backpressure	`UdpRcvbufErrors` rising alongside growing TSDB write queue	Collector self-stats for write queue depth and parser throughput
GC or scheduler pauses	Periodic burst drops correlated with runtime GC events	Collector runtime GC logs or pause metrics
NIC ring buffer exhaustion	RX drops in `/proc/net/dev` and `ethtool -S` incrementing	`ethtool -g <iface>` and `ethtool -S <iface> \| grep -i drop`
Exporter sampling-rate change	Apparent flow volume drops without collector-side signal	Device-side export counters and exporter config

Quick checks

All commands below are read-only and safe on a production collector.

# Kernel UDP receive buffer errors (cumulative counter)
cat /proc/net/snmp | grep '^Udp:'

# Preferred: nstat gives absolute totals with delta column
nstat -az UdpRcvbufErrors

# Alternative: netstat UDP statistics section
netstat -su | grep -A1 "Udp:"

# Current socket buffer fill for the flow listener (port 2055 = NetFlow)
ss -lun '( sport = :2055 )' -m

# NIC-level drops (happen before socket buffer)
cat /proc/net/dev

# Detailed NIC drop counters
ethtool -S eth0 | grep -i drop

# Current NIC ring buffer settings
ethtool -g eth0

# Per-core CPU (look for one core pinned at 100%)
mpstat -P ALL 1 5

# RSS IRQ distribution across cores
cat /proc/interrupts | grep eth0

# Softirq rates for packet processing
cat /proc/softirqs | grep -E 'NET_RX|NET_TX'

# Verify packets are arriving at the NIC
tcpdump -i eth0 -nn 'udp port 2055' -c 1000

How to diagnose it

Confirm drops exist. Run nstat -az UdpRcvbufErrors twice, 60 seconds apart. If the counter increments, datagrams are being dropped at the socket buffer. Any nonzero increment rate on a flow collector is lost telemetry.
Determine where the loss occurs. Compare the device-side export rate against the collector inbound rate. On a Cisco device, poll export counters via SNMP. If the device-exported rate exceeds the collector inbound rate, loss is in transit or at the collector.
Rule out NIC-level drops. Check /proc/net/dev for RX drops on the collector flow-ingress interface. If RX drops are incrementing alongside UdpRcvbufErrors, the NIC ring buffer is also overflowing. This is a separate problem (ring buffer size and RSS).
Check for single-core saturation. Run mpstat -P ALL 1 5 during a high-traffic window. If one core is at 100% softirq or sys while others are idle, RSS is funneling all packet processing to one CPU. This creates a serialization bottleneck regardless of socket buffer size.
Check collector-internal backpressure. If the collector exposes a TSDB write queue depth metric, check whether it is growing. A rising write queue means the parser or storage layer cannot keep up, which backs up into the socket buffer. The root cause is downstream.
Rule out exporter-side changes. If only a single exporter shows declining flow volume, check the exporter sampling rate. A change from 1:100 to 1:1000 reduces flow record volume by 10x without any collector-side drop. Multiple exporters declining simultaneously points to the collector.
Check for runtime pauses. If drops occur in periodic bursts rather than continuous streams, a managed-runtime collector (Java, Go) may be experiencing GC stop-the-world pauses. During a pause, the application stops draining the socket buffer.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`UdpRcvbufErrors` (`/proc/net/snmp`)	Only direct kernel signal for socket buffer overflow drops	Any nonzero increment rate on a flow collector
Flow packets received rate	Establishes whether the collector is receiving datagrams at all	Rate drops to zero from one exporter while others are normal
NIC RX drops (`/proc/net/dev`)	Pre-socket-buffer drops at the hardware level	Incrementing alongside `UdpRcvbufErrors` means full-stack overload
Per-core CPU utilization (`mpstat`)	Single-core saturation from RSS misconfiguration is invisible in aggregate	One core at 100% with others idle during high packet rates
Softirq NET_RX rate (`/proc/softirqs`)	Kernel packet processing load distribution	Concentrated on one core rather than distributed
Collector inbound vs. device exported rate	End-to-end loss detection	Device exported exceeds collector inbound with no device-side drops
TSDB write queue depth	Downstream backpressure that backs up into the socket buffer	Queue growing without bound

Fixes

Undersized socket buffer

The Linux default net.core.rmem_max is typically 212,992 bytes (208 KB) on most distributions, which is inadequate for high-pps flow collectors. Production deployments should target 16 MB or higher, with 33 MB for very high-volume collectors.

# Immediate (runtime, non-persistent)
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.rmem_default=16777216

# Persistent: add to /etc/sysctl.d/99-flow-collector.conf
# net.core.rmem_max=16777216
# net.core.rmem_default=16777216

Warning: setting rmem_default to 16 MB allocates that much for every UDP socket on the system. On a dedicated flow collector this is usually fine; on a shared host, only raise rmem_max and let the application request the buffer size.

Raising rmem_max only sets the ceiling. The application must also request a larger buffer via setsockopt(SO_RCVBUF). Some collectors set this internally with a hardcoded value that ignores the kernel ceiling. For nfcapd, use the -B flag:

# Set nfcapd socket buffer size (bytes)
nfcapd -B 262144 -w /var/nfdump -p 2055

The nfcapd man page recommends raising this value above 100k for high-volume traffic. If the application’s hardcoded SO_RCVBUF value is below rmem_max, raising rmem_max alone has no effect on that collector.

Note: the kernel internally doubles the value passed to SO_RCVBUF for bookkeeping overhead. A socket requesting 128 KB gets 256 KB of actual buffer space (subject to the rmem_max ceiling). The effective buffer is larger than the application requests, not smaller.

Single-core RSS bottleneck

If mpstat shows one core at 100% and others idle, RSS is not distributing receive interrupts across CPUs.

Verify current IRQ distribution: cat /proc/interrupts | grep eth0
Check how many receive queues the NIC exposes: ethtool -l eth0
Enable RSS on the NIC if supported, or configure RPS (Receive Packet Steering) to distribute packets to multiple CPUs in software via /sys/class/net/<dev>/queues/rx-<n>/rps_cpus

Alternatively, if the collector supports SO_REUSEPORT, run multiple collector processes sharing the same port. This lets the kernel distribute incoming datagrams across worker processes.

NIC ring buffer exhaustion

If /proc/net/dev shows RX drops incrementing, the NIC hardware ring buffer is overflowing before packets reach the kernel socket layer.

# Check current ring buffer settings
ethtool -g eth0

# Increase RX ring buffer (driver and NIC dependent)
ethtool -G eth0 rx 4096

Warning: ethtool -G resizes the ring buffer at runtime and may cause momentary packet loss during the transition. Apply during a maintenance window or low-traffic period. For persistence, add it to your interface bring-up script or systemd-networkd configuration.

Parser or TSDB backpressure

If socket buffer drops are caused by the collector not draining fast enough, increasing the buffer only delays the drops. The root cause is downstream:

Parser bottleneck: check if the parser uses expensive regex or per-record string operations on every flow record.
TSDB write blocking: if the storage layer is slow, the write queue grows, blocking the parser thread, which stops draining the socket buffer. Check disk I/O latency with iostat -xz 1 and TSDB series cardinality.
Thread pool exhaustion: if the collector uses a fixed worker pool, verify it is sized for peak load.

Raising rmem_max buys time but does not solve the throughput problem.

GC or scheduler pauses

For collectors built on managed runtimes (Java, Go, .NET), GC stop-the-world pauses can stall the socket drain for tens of milliseconds. During high-pps periods, this is enough to overflow even a correctly sized buffer. The pattern is periodic burst drops rather than continuous loss.

Check the collector runtime for GC pause duration metrics. If pauses exceed the time it takes to fill the socket buffer at your peak packet rate, tune the GC (larger heap, different collector algorithm, lower allocation rate in the parser path) or switch to a collector with lower pause overhead.

Prevention

Monitor UdpRcvbufErrors continuously. This counter is the earliest signal for socket buffer overflow. Alert on any nonzero increment rate. A dropped-to-received ratio above 0.1% is actionable data loss.
Set rmem_max to 16 MB or higher at deployment time. Do not rely on the default. For very high-volume sFlow collectors, target 33 MB.
Verify RSS distribution at deployment. After configuring a new collector, confirm packet processing interrupts are distributed across multiple cores using cat /proc/interrupts.
Track collector inbound rate against device-side export counters. This is the only reliable end-to-end loss detection method.
Monitor per-core CPU, not just aggregate. A single saturated core from RSS misconfiguration is invisible in aggregate CPU metrics.
Separate flow storage from log storage. Log files on the same volume as the TSDB can fill the disk and cause the collector to silently drop records.

How Netdata helps

Netdata provides the following signals relevant to diagnosing silent UDP flow loss:

IPv4 UDP statistics including UdpRcvbufErrors from /proc/net/snmp, as part of net stack monitoring. Configure an alarm on any nonzero rate for collectors running flow listeners.
Per-core CPU utilization with softirq breakdown, making single-core RSS saturation visible without manual mpstat sessions.
Network interface error and drop counters from /proc/net/dev, covering the NIC ring buffer layer that precedes socket buffer drops.
SoftIRQ rates (NET_RX and NET_TX) per CPU, showing whether packet processing is concentrated on one core.
Disk I/O latency and utilization on the collector, for diagnosing TSDB write backpressure that backs up into the socket buffer.

Correlate these during incidents: a rising UdpRcvbufErrors rate alongside elevated per-core softirq on the receive core and growing disk I/O wait points to a cascade from storage backpressure through the parser into the socket buffer.

No related guides are available in this section yet.

Silent UDP flow data loss: why your NetFlow collector is dropping records

Silent UDP flow data loss: why your NetFlow collector is dropping records

What happens at the kernel

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Undersized socket buffer

Single-core RSS bottleneck

NIC ring buffer exhaustion

Parser or TSDB backpressure

GC or scheduler pauses

Prevention

How Netdata helps

Related guides