Syslog parser backpressure: when one chatty device stalls the pipeline

A single device floods your syslog collector. The parser thread pool saturates, queues fill, and UDP datagrams start dropping at the kernel socket buffer. Critical messages from other devices, including BGP NOTIFICATIONS and hardware alarms, are silently lost. The dashboard shows a normal or slightly elevated syslog rate because dropped packets never reach the application layer.

The collector process is still running. The network is fine. The failure is inside the ingestion pipeline, at the seam between the kernel socket buffer and the parser, where backpressure builds and has nowhere to go.

How backpressure develops

Backpressure cascades through three stages.

Stage one: a single source (a flapping interface, a misconfigured debug level, a compromised host, a device in a boot loop) generates syslog at a rate that exceeds the parser’s drain capacity. The parser thread pool becomes CPU-bound on regex matching, RFC 3164/5424 framing, or enrichment lookups.

Stage two: the collector’s internal queue fills. In rsyslog, if a per-action queue cannot drain in time, messages back up into the main message queue. Once the main queue reaches its high-water mark, the collector attempts to throttle delayable inputs (TCP, RELP, imfile). But UDP syslog is inherently non-delayable. There is no flow control in UDP.

Stage three: the kernel socket receive buffer overflows. Datagrams arriving when the buffer is full are silently dropped. The counter UdpRcvbufErrors (the RcvbufErrors column under Udp: in /proc/net/snmp) increments. The collector never sees these messages. No log entry records the loss. The highest-priority messages from other devices, which arrived during the burst window, are statistically the most likely to be dropped because they land in a buffer that is already full.

flowchart TD
    A[Chatty device flood] --> B[Parser thread pool saturates]
    B --> C[Internal queue fills to high-water mark]
    C --> D{Delayable input?}
    D -->|TCP/RELP| E[Sender throttled]
    D -->|UDP 514| F[Cannot throttle]
    F --> G[Kernel socket buffer overflows]
    G --> H[UdpRcvbufErrors increments]
    H --> I[Silent loss from ALL devices]

Because the main queue and parser pool are shared resources, backpressure affects every source sending to that collector. A BGP NOTIFICATION from a core router, a hardware alarm from a firewall, an authentication failure from a switch: all can be lost if they arrive during the window when the buffer is full.

Common causes

CauseWhat it looks likeFirst thing to check
Link-flap or STP cascadeBurst of linkDown/linkUp syslog pairs, thousands per secondifOperStatus history on the flapping interface; correlate with STP topology change count
Debug-level logging left onSustained high-volume DEBUG or INFO from one device, no severity escalationPer-source syslog rate breakdown; check device running config for debug enable
Device boot loopRepeating boot sequence messages from one device at regular intervalssysUpTime for that device; coldStart trap rate
Compromised or scanning hostSpike in auth-failure or security syslog from one source IPSource IP in syslog messages; correlate with SNMP auth failures
Collector-side parser bottleneckCPU-bound parser, high %user or %soft on collector, all sources affectedmpstat per-core CPU; per-thread CPU via top -H
Undersized UDP socket bufferUdpRcvbufErrors incrementing during bursts, receiver process not CPU-boundsysctl net.core.rmem_max; ss -lun -m for current Recv-Q

Quick checks

Run these on the syslog collector host.

# Primary silent-loss signal: datagrams dropped because the socket buffer was full
nstat -az UdpRcvbufErrors

# Raw UDP statistics from /proc (look at the RcvbufErrors column under Udp:)
cat /proc/net/snmp | grep '^Udp:'

# Syslog listener socket state and buffer fill level
ss -lun '( sport = :514 )' -m

# Per-core CPU saturation (RSS funneling shows as one core pinned)
mpstat -P ALL 1 5

# NIC ring buffer drops (pre-socket-buffer loss)
# Replace eth0 with your collector interface
ethtool -S eth0 | grep -iE 'drop|miss'

# Current kernel max receive buffer size
sysctl net.core.rmem_max net.core.rmem_default

# rsyslog queue stats (requires impstats module loaded and a destination configured)
# Adjust path to your impstats output file
grep -E 'queue|enqueued|full' /var/log/rsyslog-stats.log

# Per-source syslog volume (field position depends on your syslog format)
# $4 is typical for RFC 3164 traditional format; adjust for your layout
awk '{print $4}' /var/log/network-devices.log | sort | uniq -c | sort -rn | head -20

Systematic diagnosis

  1. Confirm silent loss. Check nstat -az UdpRcvbufErrors. Any nonzero increment means datagrams arrived at the kernel but were dropped because the socket buffer was full. This is definitive evidence the collector cannot keep up.

  2. Determine whether the bottleneck is the parser or the queue. Run mpstat -P ALL 1 5. High %user indicates parser CPU. High %soft indicates kernel packet processing. A single core at 100% with others idle points to RSS funneling, not parser throughput.

  3. Identify the chatty source. Break down syslog volume by source device. The chatty source will dominate by orders of magnitude. Correlate with the device’s operational state: is an interface flapping? Is debug logging enabled? Did the device recently reboot?

  4. Check the collector’s internal queue depth. In rsyslog with impstats, look for queue size approaching the configured maximum and for full or delay indicators. In syslog-ng, check the stats counters for output queue length on each destination.

  5. Verify scope of impact. Compare the syslog receive rate from a known-quiet device during the burst versus outside it. If the quiet device’s messages are missing during the burst, the backpressure is global and the shared pipeline is stalled.

  6. Rule out exporter-side loss. Check the device’s own logging counters to confirm it is sending what you expect. If the device reports a higher send rate than the collector receives, the gap is in transit or at the collector.

Metrics to monitor

SignalWhy it mattersWarning sign
UdpRcvbufErrors (/proc/net/snmp)Only direct signal of UDP datagrams dropped at the socket bufferAny nonzero increment; proportional to incoming rate means chronic undersizing
Syslog receive rate per sourceIdentifies which device is generating the burstSingle source exceeding 5x its rolling 1-hour average sustained
Collector per-core CPU (mpstat)Detects parser thread pool saturation or RSS funnelingSingle core at 100% with others idle, or aggregate %user above 70%
rsyslog main queue depth (impstats)Shows backpressure building before drops occurQueue size approaching queue.size; full events logged
syslog-ng output queue lengthShows destination backpressureQueue growing without bound for a specific destination
NIC RX drops (/proc/net/dev)Pre-socket-buffer loss at the ring bufferrx_missed_errors incrementing
Syslog severity distributionDistinguishes real events from noise stormsRate spike without severity escalation means noise (flap, debug)
/proc/net/softnet_statKernel packet processing backpressureColumn 3 (dropped) incrementing

Fixes

Isolate the chatty source in its own queue

The most effective fix is structural: prevent one source from consuming the shared parser and queue resources.

In rsyslog, assign the chatty device to a dedicated ruleset with its own action queue. Configure disk-assisted queuing (queue.filename) for that ruleset so bursts spill to disk rather than backing up into the main queue. Use RainerScript queue.* syntax; legacy dollar-sign directives still work but can produce nondeterministic behavior when mixed with advanced syntax in rsyslog 8.x.

In syslog-ng, route the chatty source to a separate log path with explicit flow-control and disk-based buffering. Without flow-control declared, a slow destination in a shared log path causes silent message drops across all sources in that path. With flow-control, syslog-ng spills to disk before dropping, buying time during downstream outages.

The tradeoff: isolating the source means its messages may be delayed during bursts (disk-assisted queuing adds latency). For a noisy source whose messages are low-value, this is acceptable. For a source whose messages are high-value but high-volume (a core firewall), you need a bigger queue or more parser threads, not isolation alone.

Size the UDP socket buffer correctly

The Linux default net.core.rmem_max varies by distribution and may be as low as 212,992 bytes (208 KB) on some systems. For a syslog collector receiving bursts, this is frequently insufficient. The buffer must be large enough to absorb several seconds of peak burst while the parser catches up.

Set net.core.rmem_max to 16 MB or higher for production syslog collectors, and set SO_RCVBUF explicitly on the listener socket. In rsyslog, use the so-rcvbuf() option on the imudp input. In syslog-ng, use so-rcvbuf() on the UDP source definition.

When sizing, target 1 to 2 seconds of peak-rate headroom. The kernel internally allocates roughly twice the value you request via SO_RCVBUF (capped at rmem_max), so account for that when calculating.

Scale parser threads

The default number of worker threads per queue in rsyslog is 1. A single worker processing a burst from one source will serialize all parsing for that queue. Increase queue.workerThreads to allow parallel processing.

For syslog-ng versions before 4.2, UDP reception on a given port is single-threaded regardless of CPU cores. Even so-reuseport(yes) routes all packets from one source IP to the same thread. syslog-ng 4.2.0 introduces an ebpf(reuseport(sockets(N))) plugin that distributes a single high-rate UDP source across N worker threads using eBPF SO_REUSEPORT. This plugin is disabled by default at compile time and requires a recent kernel.

Apply rate limiting at ingress

If the chatty source is genuinely noisy and its messages are low-value, rate-limit it at the collector before messages enter the main queue.

In rsyslog, the imuxsock module enforces per-PID rate limiting by default (200 messages per 5-second interval). When exceeded, rsyslog logs imuxsock begins to drop messages from pid XXXX due to rate-limiting. This applies to local Unix socket inputs, not remote UDP. For remote UDP sources, there is no built-in per-source-IP rate limiter in imudp.

Do not disable rate limiting entirely (RateLimit.Burst=0). Without any rate limit, a runaway source can fill /var and take down the collector’s host.

Prevention

  • Monitor UdpRcvbufErrors continuously. Any nonzero increment on a syslog collector is abnormal. Alert on it, not just chart it.
  • Track syslog receive rate per source. A single source dominating volume is a finding, not just noise.
  • Load impstats (rsyslog) or enable stats counters (syslog-ng) permanently. Queue depth is a leading indicator that precedes drops by minutes. Without it, you are blind until the kernel starts dropping.
  • Size net.core.rmem_max proactively. Do not wait for the first burst to discover the default is too small. 16 MB is a reasonable starting point for a production syslog collector.
  • Verify RSS IRQ distribution. One core at 100% during a syslog burst with other cores idle means RSS is funneling all UDP interrupts to one CPU. Check cat /proc/interrupts | grep eth0 to verify distribution.
  • Separate syslog storage from TSDB storage. A syslog flood that fills the disk volume can take down flow collection running on the same host.
  • Pre-configure isolation rulesets for known-noisy device classes. Wireless controllers, load balancers, and devices prone to debug-level logging should have dedicated queues from day one.

How Netdata helps

  • UdpRcvbufErrors monitoring. Netdata charts UDP socket buffer drops system-wide from /proc/net/snmp. Configure an alert on any nonzero increment for hosts running syslog collectors.
  • Per-core CPU breakdown. Netdata’s cpu collector provides per-core utilization including softirq time. A single core pinned at 100% during a syslog burst is visible without running mpstat manually.
  • NIC ring buffer drops. The netdev collector tracks /proc/net/dev RX drops, which precede socket-buffer drops in the loss cascade.
  • Disk space on syslog volumes. Netdata’s disk collector monitors free space and fill rate. A syslog flood filling /var is detected before it causes a hard failure.
  • Cross-signal correlation. During a syslog flood, Netdata’s unified timeline lets you correlate the burst with device control-plane CPU, interface state changes, and BGP events on the same dashboard, which helps confirm whether the syslog noise reflects a real network event or is purely a logging artifact.