Udp_RcvbufErrors: tuning kernel receive buffers for flow, trap, and syslog collectors
Udp_RcvbufErrors is incrementing on your flow collector. Flow charts show traffic declining during what is actually a traffic spike. The kernel is receiving datagrams from exporters but the socket receive buffer is full, so it drops them silently. No application-level counter moves. No error log fires. The dashboards lie downward while the real traffic goes upward.
Flow collectors (NetFlow v5/v9, IPFIX, sFlow), SNMP trap receivers (UDP 162), and syslog receivers (UDP 514) all depend on UDP socket buffers. When the buffer overflows, the kernel increments Udp_RcvbufErrors in /proc/net/snmp and discards the datagram. The application never sees it.
The common Linux default net.core.rmem_max of 4,194,304 bytes (4 MB) is the starting point for most incidents at this layer. Production flow collectors typically need 16 MB or more. Very high-pps sFlow collectors may need 33 MB. But raising the ceiling alone is not always the fix: the application must request a larger buffer via SO_RCVBUF, the parser must drain it fast enough, and the global UDP memory ceiling (net.ipv4.udp_mem) can impose a separate limit.
What this means
When a UDP datagram arrives, the kernel attempts to place it in the destination socket’s receive buffer. If the buffer is full because the application has not read from it fast enough, the kernel drops the datagram and increments Udp_RcvbufErrors. The counter is system-wide across all UDP sockets. It does not tell you which socket, which port, or which exporter was affected.
Two layers of drops exist, and they have different fixes:
- NIC ring buffer drops happen at the hardware level, before the packet reaches the socket layer. Check
/proc/net/devRX drop columns andethtool -S <iface>for counters likerx_missed_errors. - Socket buffer drops happen after the NIC has accepted the packet, at the kernel-to-application delivery boundary. Check
Udp_RcvbufErrorsandss -lun -mRecv-Q.
Both must be monitored. If only one is rising, it narrows the problem. If both are rising, the entire receive path is saturated.
flowchart TD
A[UdpRcvbufErrors incrementing] --> B{NIC RX drops rising too?}
B -- Yes --> C[Fix NIC ring buffer and RSS first]
B -- No --> D[Problem is at socket layer]
D --> E{ss -m: Recv-Q near limit?}
E -- No --> F[Check udp_mem global pressure]
E -- Yes --> G{Collector CPU pattern?}
G -- One core at 100% --> H[RSS misconfiguration]
G -- System-wide high --> I[Parser or TSDB bottleneck]
G -- Low or normal --> J[Undersized rmem_max]
J --> K[Raise rmem_max + rmem_default]
K --> L[Verify app sets SO_RCVBUF]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Undersized rmem_max | Drops proportional to incoming packet rate; ss -m shows Recv-Q at limit | sysctl net.core.rmem_max |
| Slow consumer (parser or TSDB write blocked) | Drops during bursts; collector CPU not fully utilized (I/O bound) | Collector write queue depth or parser stats |
| RSS misconfiguration | One CPU core pinned at 100% while others are idle; drops during high pps | cat /proc/interrupts | grep <iface> |
| Global UDP memory pressure | Drops continue even after raising rmem_max and SO_RCVBUF | cat /proc/sys/net/ipv4/udp_mem |
| Application not requesting larger buffer | rmem_max raised but ss -m shows buffer still at old size | getsockopt return value or app config |
Quick checks
All read-only and safe to run on a production collector:
# System-wide UdpRcvbufErrors counter
nstat -az UdpRcvbufErrors
# Same data via /proc/net/snmp (RcvbufErrors column)
cat /proc/net/snmp | grep '^Udp:'
# Current socket buffer fill for a flow listener on port 2055
ss -lun '( sport = :2055 )' -m
# Current rmem_max and rmem_default
sysctl net.core.rmem_max net.core.rmem_default
# Global UDP memory pressure limits (min pressure max, in pages)
cat /proc/sys/net/ipv4/udp_mem
# NIC RX drops (happen before socket layer)
cat /proc/net/dev
# Detailed NIC drop counters
ethtool -S eth0 | grep -i drop
# Per-core CPU utilization and softirq distribution
mpstat -P ALL 1 5
# IRQ distribution for the NIC
cat /proc/interrupts | grep eth0
# Kernel packet processing backpressure
cat /proc/net/softnet_stat
How to diagnose it
Confirm the counter is actively incrementing. Run
nstat -az UdpRcvbufErrorstwice, 30 seconds apart. The second value should be higher if drops are ongoing. A historically nonzero value that is not growing may represent a past incident already resolved.Check whether NIC-level drops are also rising. Read
/proc/net/devandethtool -S <iface>forrx_missed_errors. If NIC drops are rising alongsideUdp_RcvbufErrors, fix the NIC ring buffer and RSS first. The socket buffer overflow is a downstream symptom of packets arriving faster than the kernel can process them at all.Inspect the listener socket’s current buffer state. Run
ss -lun '( sport = :2055 )' -m(replace 2055 with 6343 for sFlow, 4739 for IPFIX, 162 for traps, 514 for syslog). If Recv-Q is near the buffer limit, the application is not draining fast enough.Check the effective buffer size. The
ss -moutput shows the actual receive buffer allocated. If you raisedrmem_maxbut the socket still shows the old size, the application has not calledsetsockopt(SO_RCVBUF)with the larger value, or it was started before the sysctl change. Already-running sockets do not pick up a newrmem_maxautomatically.Examine CPU utilization per core. Run
mpstat -P ALL 1 5. A single core at 100% in the%softcolumn indicates RSS is funneling all packet processing to one CPU. System-wide high CPU indicates a parser or TSDB bottleneck.Check global UDP memory pressure. If drops persist after raising
rmem_maxand verifyingSO_RCVBUF, readcat /proc/sys/net/ipv4/udp_mem. This sets the global UDP memory ceiling across all sockets (format: min pressure max, in pages). If aggregate UDP memory exceeds thepressurethreshold, the kernel drops packets even when individual socket buffers have room.Compare device-side export counts against collector inbound rate. On a Cisco device,
snmpget -v2c -c <community> <device> .1.3.6.1.4.1.9.9.387.1.4.4returnscnfESPktsExported. If the device exported significantly more than the collector received, the gap is silent loss in transit or at the socket buffer. This is the only reliable end-to-end loss detection method.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
UdpRcvbufErrors | The only direct kernel signal for socket buffer drops | Any nonzero increment in production |
ss -m Recv-Q | Shows real-time buffer fill per socket | Recv-Q approaching buffer limit |
UdpInDatagrams | Total UDP datagrams received, for computing drop ratio | Drop ratio > 0.001 (0.1%) |
NIC RX drops (/proc/net/dev) | Drops at hardware layer, before socket | Any nonzero RX drop rate on flow-ingress NIC |
Per-core CPU %soft | Indicates RSS distribution problems | Single core at 100% while others idle |
| Collector write queue depth | Slow consumer backing up the buffer | Queue growing without bound |
| Flow packets received rate | Incoming load on the collector | Spike correlated with drop spike |
udp_mem utilization | Global UDP memory pressure | Aggregate near pressure threshold |
| Flow inbound vs device exported | End-to-end loss detection | Inbound significantly less than exported |
Fixes
Raise rmem_max and rmem_default
The immediate fix for an undersized ceiling:
# Runtime change (takes effect for new sockets immediately)
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.rmem_default=8388608
# Persistent configuration
cat >> /etc/sysctl.d/99-udp-collector.conf << 'EOF'
net.core.rmem_max = 16777216
net.core.rmem_default = 8388608
EOF
sysctl --system
Start with 16 MB for rmem_max and 8 MB for rmem_default. For very high-volume sFlow collectors, 33 MB may be necessary. Already-running sockets do not pick up the new rmem_max automatically. The collector process must restart or re-bind its listener socket for the new ceiling to take effect.
Verify the application sets SO_RCVBUF
Raising rmem_max sets the ceiling, but the application must explicitly request a larger buffer via setsockopt(SOL_SOCKET, SO_RCVBUF, size). The kernel internally doubles the requested value for bookkeeping overhead, so getsockopt() returns roughly 2x what was requested. This doubling is documented in socket(7) and is normal behavior.
If the application uses SO_RCVBUFFORCE (requires CAP_NET_ADMIN or root), it can exceed rmem_max. Some hardened or containerized builds disable SO_RCVBUFFORCE, causing the application to fall back to the unprivileged SO_RCVBUF path silently. Check the application documentation for how it configures receive buffers. For rsyslog’s imudp module, the rcvbufSize parameter controls this. If rsyslog drops privileges before opening the socket, the unprivileged SO_RCVBUF call may be capped at rmem_max.
Raise udp_mem under global pressure
If UdpRcvbufErrors persists after raising rmem_max and verifying SO_RCVBUF, the system may be hitting the global UDP memory ceiling. Read cat /proc/sys/net/ipv4/udp_mem (values are in pages, typically 4 KB each). If aggregate UDP memory is near the pressure value, raise the max field proportionally. Raising rmem_max alone allows more sockets to request large buffers, which increases aggregate kernel memory pressure. Under udp_mem pressure, the kernel drops packets aggressively even within individual socket limits. The fix is to raise both rmem_max and udp_mem.max together.
Fix RSS distribution
If one CPU core is at 100% in %soft while others are idle, RSS is funneling all flow traffic to a single core. Verify IRQ distribution with cat /proc/interrupts | grep <iface>. The fix is platform-specific. Some NICs require ethtool -X to set the RSS indirection table. Others need IRQ affinity adjustments via /proc/irq/<n>/smp_affinity. The goal is to distribute receive interrupts across multiple cores so no single core becomes the bottleneck.
Fix the consumer
If collector CPU is system-wide high (not just one core), the bottleneck is the parser or the TSDB write path, not the buffer size. Raising rmem_max buys time by absorbing bursts but does not fix the throughput problem. Identify whether the parser is CPU-bound (heavy regex on every record) or I/O-bound (TSDB write queue blocking the ingestion thread). Common fixes: simplify parsing logic, batch TSDB writes, move log files to a separate volume from the TSDB, or scale the collector horizontally.
Prevention
- Set
rmem_maxandrmem_defaultbefore deploying collectors. Apply the sysctl configuration as part of host provisioning, not as incident response. 16 MB is a safe baseline; 33 MB for high-volume sFlow. - Monitor
UdpRcvbufErrorscontinuously. Any nonzero increment is abnormal in production. Alert on it directly, not on a derived threshold. - Verify
SO_RCVBUFafter every collector restart. Confirm the effective buffer size withss -lun -m. Configuration changes during upgrades can silently reset buffer settings. - In Kubernetes, apply sysctls inside the pod network namespace. Each pod has its own network namespace. Changing
rmem_maxon the host node does not affect pod containers unless the setting is applied inside the pod (privileged init container or DaemonSet). CNI plugins vary in whether they inherit host sysctls, so verify empirically. - On Azure AKS, the default
rmem_maxis 1,048,576 bytes (1 MB). This is insufficient for any moderately busy collector. UselinuxOSConfigin the Node Pool API to raisenetCoreRmemMaxandnetCoreRmemDefaultbefore deploying UDP-based collectors. - Separate the TSDB volume from log storage. Log growth on the same volume as the TSDB has caused collector outages when disk fills.
- Monitor per-core CPU. RSS misconfiguration is invisible in aggregate CPU utilization. Track per-core
%softto catch single-core saturation before it causes drops.
How Netdata helps
- Netdata collects
UdpRcvbufErrorsfrom/proc/net/snmpnatively, with per-second resolution. Alert on any nonzero increment without manual instrumentation. - The
ipv4collector exposes the full UDP SNMP table, includingUdpInDatagrams,UdpRcvbufErrors, andUdpInErrors. Correlating receive rate against drop rate gives you the loss ratio directly. - Per-core CPU metrics are collected by default, making RSS misconfiguration visible as one core at 100% while others are idle.
- NIC RX and TX drop counters from
/proc/net/devandethtool -Sare collected natively. Correlating NIC drops against socket buffer drops narrows the problem to the correct layer. - If Netdata is your syslog or trap receiver, the same
UdpRcvbufErrorscounter applies. Netdata monitors its own ingestion health. - Disk space and I/O metrics on the collector host help detect TSDB write bottlenecks before they back up the UDP receive buffer.
Related guides
- ARP cache staleness: when IP-to-MAC mapping goes bad
- Asymmetric routing: why your path and latency measurements lie
- Audit log gaps: detecting syslog/trap tampering or loss
- BGP flapping: why a peer keeps resetting and how to find the cause
- BGP NOTIFICATION and Cease messages: what each subcode is telling you
- BGP RIB and FIB growth: monitoring route-table size before it bites
- BGP route leak and hijack: the detection signals and alerts that matter
- BGP session Established but stale: detecting silent route loss
- Cold-start topology: why your map is incomplete after a collector restart
- Locating endpoints behind NAT and wireless: the positioning problem
- Stale FDB/MAC tables: why endpoint location is wrong
- NetFlow storage sizing: how much disk your flow collector really needs







