Collector CPU and TSDB write-queue saturation: the capacity signals

When a network monitoring collector saturates, the first visible symptom is rarely high collector CPU. It is traffic charts showing a decline during a traffic spike, an SNMP poll cycle drifting past its configured interval, or unexplained gaps in flow data. The degradation sits one to three subsystems downstream of the actual bottleneck, which is why collector-side incidents are frequently misdiagnosed.

This reference covers the capacity signals that precede data loss. The signals are organized by the data path through the collector: NIC receive, kernel socket buffer, parser and aggregator threads, TSDB write queue, and disk. Each stage has its own saturation signature, degradation curve, and leading indicators.

Use this as a checklist for capacity monitoring on flow collectors, SNMP pollers, syslog receivers, trap receivers, and the TSDB backing them. The vendor-specific section covers OpenTelemetry Collector, Prometheus remote_write, VictoriaMetrics, Telegraf, and ntopng/nProbe.

The saturation path

Collector saturation is a cascade, not a single event. A packet arrives at the NIC ring buffer, traverses the kernel socket buffer, reaches the parser and aggregator, enters the TSDB write queue, and is flushed to disk. A bottleneck at any stage backs up everything upstream of it. Each stage requires a different remediation: NIC ring buffer tuning, socket buffer sizing, parser optimization, queue capacity increases, or disk IOPS.

flowchart TD
    A["NIC ring buffer
rx_missed_errors"] --> B["UDP socket buffer
Udp_RcvbufErrors"] B --> C["Parser / aggregator
per-core %soft, %user"] C --> D["TSDB write queue
depth vs capacity"] D --> E["Disk / storage
free space, iostat await"] E -->|backpressure| D D -->|backpressure| C C -->|slow drain| B

Downward arrows represent data movement. Upward arrows represent backpressure. When disk I/O saturates, the TSDB write queue grows. When the queue grows, the parser blocks on enqueue. When the parser blocks, the socket buffer drains slowly. When the socket buffer overflows, the kernel drops packets silently and Udp_RcvbufErrors increments. Every signal below sits at one of these five stages.

Collector CPU signals

Aggregate CPU is the least useful signal on a multi-core collector. A 16-core machine with one core pinned at 100% from Receive Side Scaling (RSS) funneling shows roughly 6% aggregate. The bottleneck is real, but the aggregate hides it. Always read per-core utilization.

SignalWhat it meansWhere to read itThreshold
Per-core %soft (softirq)Kernel packet processing on a specific core. One core at 100% with others idle indicates RSS funneling all interrupts to one CPU.mpstat -P ALL 1 5Any core at 100% sustained with others idle
Per-core %sysKernel overhead, often context switching or packet processing.mpstat -P ALL 1 5Rising alongside %soft
Per-core %user on collector processParser or aggregator bottleneck in user space. Regex-heavy per-record processing is a common cause.top -H -p $(pgrep -d, <collector>)Sustained high %user on parser threads
Load average (1-min)Sustained oversubscription./proc/loadavg> 0.7 x core count sustained
NET_RX / NET_TX softirq rateKernel receive and transmit processing load.watch -n1 'cat /proc/softirqs'Rate proportional to incoming packet rate
IRQ distribution across coresVerifies RSS distributes packet interrupts across cores rather than funneling to one.grep <iface> /proc/interruptsSingle core receiving most interrupts

Collector CPU at > 90% sustained for more than 5 minutes means data loss is imminent. Above 70% sustained warrants investigation. High %soft on the NIC-receive core is expected during high packet rates; high %soft on unrelated cores suggests RSS misconfiguration or a driver issue. Use ethtool -S <iface> to check for rx_missed_errors, which indicate the NIC hardware ring buffer itself is dropping packets before the kernel can process them.

Correlate CPU signals with UDP socket buffer drops and TSDB write queue depth. A rising flow receive rate combined with rising per-core %soft and rising Udp_RcvbufErrors unambiguously identifies collector-side overload.

TSDB write-queue and disk signals

The TSDB write queue buffers between the parser and storage. When it grows, data is produced faster than it can be persisted. Disk fills are cliff events: the TSDB stops accepting writes and data is lost. Write-queue growth is more gradual but ends the same way.

SignalWhat it meansWhere to read itThreshold
Disk free percentageApproaching the cliff where TSDB stops accepting writes.df -h /var/lib/<tsdb>< 20% TICKET, < 10% PAGE
TSDB write queue depthBacklog of unwritten samples. Growing without bound means the TSDB cannot keep up with ingestion.Collector stats endpoint or vendor metric> 2x rolling 1-hour average TICKET, unbounded growth PAGE
iostat %utilDisk busy percentage during write bursts.iostat -xz 1 5Approaching 100% sustained
iostat awaitAverage I/O latency. Rising await means disk performance is degrading under load.iostat -xz 1 5> 20ms is a leading indicator
Series cardinalityNumber of distinct time series the TSDB tracks. The silent inflation driver. A single /24 subnet added to a flow collector can add tens of thousands of new series.TSDB introspection or stats endpointGrowth > 5%/week is concerning
Disk fill rateBytes written per day. Accelerating fill rate without a known cause often points to cardinality inflation.df over time, or TSDB ingest metricsAccelerating beyond 7-day trend

Two production gotchas: first, logs on the same volume as the TSDB have caused outages; use separate volumes. Second, some TSDBs (Prometheus, VictoriaMetrics) compact periodically, causing disk I/O spikes that look like saturation but are normal. Correlate compaction windows with I/O spikes before alerting.

Leading indicators and runway estimation

Each contested resource has a degradation curve and a runway. The curve tells you how failure arrives: cliff or gradual. The runway tells you how long you have before impact.

ResourceLeading indicatorDegradation curveHeadroom target
Collector CPU (parser/aggregator)Parser throughput (records/sec) vs incoming rate (packets/sec x records/packet). Per-core %soft rising.Gradual then cliff. Parser slows, queue grows, latency rises, eventually buffer drops begin.Parser capacity > 2x peak incoming rate. CPU < 60%.
TSDB write queueQueue depth trending up. Write latency rising.Graceful then cliff. Latency increases, then backpressure, then drops.Queue depth at baseline. No enqueue failures.
Disk spaceFree bytes trending down. Fill rate accelerating.Cliff at 0% free. TSDB stops accepting writes.> 30% free. 7+ days runway at current growth rate.
TSDB cardinalitySeries count trending up. Distinct label values increasing. TSDB process memory rising.Gradual inflation then cliff on memory exhaustion or query performance collapse.Growth rate < 1%/week sustained. No unexpected jumps.
Worker thread poolWorker queue depth growing. Processing latency rising. Timeout rate per minute.Soft saturation then cliff. Latency rises, timeouts appear, false device down alerts cascade.Utilization < 50% of worker capacity. Queue depth < 25% of max. Timeout rate 0%.
UDP socket bufferUdp_RcvbufErrors incrementing. ss -lun -m showing Recv-Q approaching buffer limit.Cliff. Once the buffer overflows, every additional packet is dropped.Buffer sized to absorb burst. 16 MB+ for high-pps collectors.

Runway estimation formulas:

  • Disk: days_to_fill = free_bytes / bytes_per_day_trend, where the trend is calculated over the past 7 days to account for weekday/weekend variation.
  • TSDB cardinality: multiply the disk trend by the cardinality growth rate. If cardinality is growing at 5% per week, expect disk consumption to accelerate proportionally.
  • Poller workers: time_to_saturation = current_capacity * (1 - current_utilization) / recent_growth_rate. Always apply a 50% safety margin.

Vendor-specific queue-depth signals

The capacity signals above apply to any collector. The metrics below are specific to the major open-source stacks.

OpenTelemetry Collector. The critical queue metrics are otelcol_exporter_queue_size (current depth relative to capacity), otelcol_exporter_enqueue_failed_spans (increments when the export queue rejects data because it is full), and otelcol_processor_refused_spans (data refused by the memory_limiter; should be zero in steady state). Scale up when the queue is sustained above 60-70% of capacity. Scale down when consistently below 20%. A separate signal, otelcol_exporter_send_failed_spans, increments on permanent export failures such as HTTP 4xx or connection refused; adding replicas does not fix this.

The memory_limiter processor historically refuses data with a retryable error when the soft limit is breached, relying on upstream receivers to re-queue. This creates a feedback loop with bounded queues that can amplify data loss. A drop instead of refuse mode is tracked but . Pair the memory_limiter with GOMEMLIMIT set to 80-90% of the container or host memory limit to give the Go runtime proactive GC headroom.

Prometheus remote_write. All backpressure knobs live under queue_config. The default capacity is 10,000 samples per shard. The default max_samples_per_send is 2,000. The default max_shards is 30 . The key diagnostic metric is prometheus_remote_storage_samples_pending, which tracks samples waiting in the shard queue. Lag between prometheus_remote_storage_queue_highest_sent_timestamp_seconds and prometheus_remote_storage_highest_timestamp_in_seconds signals queue backlog.

A recent Prometheus commit tightened remote_write resharding logic to prevent deadlocks. Older configs with manually inflated capacity values may behave differently now. Validate that per-shard memory remains reasonable.

VictoriaMetrics. The primary signal is vmagent_remotewrite_pending_data_bytes: bytes scraped but not yet sent to the remote write target. Connection saturation is diagnosed with max(rate(vm_rpc_send_duration_seconds_total{}[1m])) by(addr). A value of 0.9 seconds means the connection is more than 90% saturated. Under sustained write load, vminsert nodes can become CPU- or network-saturated between themselves and vmstorage nodes, causing the remote write client to fall behind without explicit queue-full errors.

Telegraf. Telegraf drops metrics when its internal buffer reaches metric_buffer_limit. The log message “Metric buffer limit exceeded” appears when the buffer cannot drain faster than data arrives. The default metric_buffer_limit is 10,000 . When multiple InfluxDB outputs are configured and at least one is reachable, Telegraf resets the buffer size counter even if a failed output still has queued data. This means internal_buffer_size can mislead operators into believing the buffer is draining when one output is permanently backed up.

ntopng/nProbe. The critical socket receive buffer warning appears when /proc/sys/net/core/rmem_max is below 8,388,608 bytes (8 MB). The log message instructs operators to increase it. The Linux default net.core.rmem_max is often 212992 bytes , which is inadequate for high-pps flow collectors. Production deployments should target 16 MB or higher.

How Netdata helps

Netdata monitors the full saturation path end to end, correlating signals across stages that operators normally check in isolation:

  • Per-core CPU breakdown including %soft (softirq), %sys, and %user, so RSS funneling to a single core is visible without custom mpstat dashboards.
  • UDP socket buffer drops via kernel counters (Udp_RcvbufErrors), with anomaly detection that surfaces the first nonzero increment rather than waiting for a fixed threshold.
  • Disk utilization, free space, and I/O latency (await, %util) on the TSDB volume, with fill-rate trending for runway estimation.
  • Collector process metrics when Netdata runs alongside the NPM stack, exposing parser thread CPU and write-queue depth where the vendor exposes them.
  • Cross-signal correlation between rising flow receive rate, rising UDP buffer drops, and rising TSDB write queue depth, which together unambiguously identify collector-side saturation versus exporter-side or network-side issues.