Flow export-to-ingest latency: why your NetFlow data is minutes behind

Flow export-to-ingest latency accumulates across a pipeline: the exporter’s active timeout, the UDP transport path, the kernel socket buffer, the collector’s parser, and the storage write queue. Each stage can add seconds or minutes, and each has a different fix.

The most common cause is the active timeout default on most network devices: 30 minutes. Long-lived flows (VPN tunnels, database connections, bulk transfers) are not exported until the timer expires. The collector is not slow and the network is not congested. The device is behaving as configured. But if you need near-real-time visibility, a 30-minute export delay is indistinguishable from broken telemetry.

What this means

Flow export-to-ingest latency is the time between when a flow record leaves the exporter and when the collector ingests it. On NetFlow v9, IPFIX, and sFlow, you can compute this directly from timestamps embedded in the records. On NetFlow v5, there is no exporter-side timestamp, so you must infer latency by comparing flow end times against wall clock time or SNMP-derived traffic patterns.

Sustained latency above 30 seconds means something in the pipeline is bottlenecked. Above 5 minutes, the data is too stale for real-time detection use cases such as DDoS visibility, anomaly detection, or operational troubleshooting. The causes fall into two categories: configuration-driven latency (active timeout too long, template refresh misaligned) and resource-driven latency (collector backlog, UDP buffer drops, parser saturation).

flowchart LR
    A["Device flow cache
active timeout: 30 min default"] --> B["Export buffer"] B -->|"UDP datagram"| C["Kernel socket buffer
rmem_max varies by distro"] C -->|"drain"| D["Parser / aggregator"] D --> E["TSDB write queue"] E --> F["Stored and queryable"] B -.->|"overflow = silent drops"| G["cnfESPktsDropped"] C -.->|"overflow = silent drops"| H["Udp_RcvbufErrors"] D -.->|"backlog = growing latency"| I["Queue depth rising"]

Common causes

CauseWhat it looks likeFirst thing to check
Active timeout too long (30 min default)Long-lived flows arrive in batches every 30 minutes; short flows arrive fineCheck the exporter’s active timeout configuration
Collector UDP buffer overflowTraffic charts show declining volume while SNMP interface counters show traffic is normal or risingcat /proc/net/snmp | grep '^Udp:' for RcvbufErrors
Collector parser or write backlogFlow latency grows steadily; queue depth metric risingCheck collector intake and write queue depth
Device-side export buffer dropsDevice exports fewer records than expected; gap between device and collector countsCisco: snmpget .1.3.6.1.4.1.9.9.387.1.4.6 (cnfESPktsDropped)
Template desync (NetFlow v9 / IPFIX)Datagrams arriving but decoded records are zero or anomalously lowCheck collector logs for “template not found” or cache miss
Clock skew on exporterLatency values are impossible (negative or wildly inconsistent)Check NTP offset on the exporter

Quick checks

# Kernel UDP socket buffer drops on the collector
cat /proc/net/snmp | grep '^Udp:'
# RcvbufErrors column: any nonzero value is silent data loss

# Current socket buffer fill for the flow listener
ss -lun '( sport = :2055 )' -m
# NetFlow v5/v9 uses port 2055; IPFIX uses 4739; sFlow uses 6343

# Current rmem_max setting
sysctl net.core.rmem_max

# Inspect flow record timestamps for latency (nfdump)
nfdump -R /var/nfdump/ -o fmt:'%ts %te %fl' | head -50
# Compare flow end time against current wall clock

# Device-side export drops (Cisco IOS-XE)
snmpget -v2c -c <community> <device> .1.3.6.1.4.1.9.9.387.1.4.6.0
# cnfESPktsDropped: should be near zero relative to exported count
<!-- TODO: verify these Cisco OIDs. They vary by platform, IOS version, and MIB support. Test against your specific device. -->

# Device-side total exported (Cisco IOS-XE)
snmpget -v2c -c <community> <device> .1.3.6.1.4.1.9.9.387.1.4.4.0
# cnfESPktsExported: compare rate against collector inbound rate

# Collector CPU per core (look for single-core saturation from RSS)
mpstat -P ALL 1 5
# Watch %soft (softirq) and %sys columns

# Collector write queue depth (vendor-specific stats endpoint)
curl -s http://localhost:<stats-port>/metrics | grep -E 'write|queue'

# NTP offset on the exporter (if SNMP accessible)
snmpget -v2c -c <community> <device> .1.3.6.1.2.1.25.1.2.0
# Returns device-local time (hrSystemDate); compare against collector time

How to diagnose it

  1. Determine whether the latency is configuration-driven or resource-driven. If all flows from a single exporter are uniformly delayed (long-lived flows arriving at 30-minute intervals), the active timeout is the cause. If latency varies and correlates with traffic volume, the bottleneck is in the collector pipeline.

  2. Check the active timeout on the exporter. On Cisco IOS-XE, inspect the flow monitor configuration. The default is typically 30 minutes. If you need near-real-time visibility, set it to 60 seconds. The tradeoff is more export packets and higher collector load. Juniper and other vendors have equivalent settings under flow sampling or monitoring configuration.

  3. Check for UDP socket buffer drops. Run cat /proc/net/snmp | grep '^Udp:' and look at the RcvbufErrors column. Any nonzero increment means the kernel received datagrams but could not deliver them to the application because the socket buffer was full. This is silent data loss, not just latency. The dropped records never arrive.

  4. Compare device-side export counts against collector inbound counts. Poll cnfESPktsExported on the device and compare the rate against what the collector reports receiving. A gap means records are being lost in transit through the UDP buffer, NIC ring, or network path.

  5. Check collector parser and write queue depth. If the collector exposes a queue depth metric, watch it over time. A growing queue means the parser or storage layer cannot keep up with the incoming rate. The backlog adds latency to every record in the queue.

  6. Verify template state if using NetFlow v9 or IPFIX. After a device reboot or configuration change, templates must be re-received before data records can be decoded. Check collector logs for template-related messages. The blind window can be 5 to 30 minutes depending on the template refresh interval.

  7. Check for clock skew. If the exporter’s clock is off, the computed latency will be wrong. Verify NTP synchronization on the exporter. Even 200ms of drift can break cross-device correlation in postmortem analysis.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Flow export-to-ingest latencyDirect measure of pipeline delaySustained above 30 sec; above 5 min means data is stale
UDP socket buffer drops (Udp_RcvbufErrors)Silent data loss at the kernel layerAny nonzero increment
Flow packets received rateConfirms exporter is sending and collector is receivingSudden drop from one exporter means exporter or path issue
Flow exporter drops (cnfESPktsDropped)Device-side overflow before records leave the deviceRising rate during traffic bursts
Collector inbound vs device exported rateEnd-to-end loss detectionGap above 0.1% of exported rate
Collector CPU per coreSingle-core saturation from RSS funnelingOne core at 100% while others are idle
TSDB write queue depthStorage-layer backlogGrowing queue without bound
NTP offset on exporterCorrupts latency computation and cross-device correlationOffset above 100ms
Template cache state (v9 / IPFIX)Records discarded if template is missingCollector logs showing template cache misses

Fixes

Active timeout tuning

The active timeout controls how often long-lived flows are exported. The default of 30 minutes means a persistent flow is invisible until either it ends or the timer expires. For real-time use cases, set it to 60 seconds.

The tradeoff is increased export volume. Shorter timeouts generate more flow records and more UDP datagrams. A 60-second active timeout on a busy exporter can increase export packet rate by an order of magnitude compared to 30 minutes. Ensure your collector can handle the additional load before making this change globally.

Cisco ASA note: The flow-export active refresh-interval command was removed in ASA versions 8.5(1) through 9.1(1), causing persistent tunnels to export only at teardown. If you are running an affected version, upgrading to 9.1(2) or later restores the capability.

UDP buffer sizing

If Udp_RcvbufErrors is incrementing, the kernel socket buffer is overflowing. The fix has two parts:

  1. Raise the system maximum. Run sysctl -w net.core.rmem_max=16777216 to set it to 16 MB. For very high-volume collectors, use 33 MB. This is a runtime change; persist it in /etc/sysctl.d/ for reboot survival.

  2. Ensure the collector application explicitly sets SO_RCVBUF on its listener socket. Without this, the socket uses rmem_default (typically around 212 KB on most Linux distributions), which is far too small for flow collection regardless of rmem_max.

Also check RSS (Receive Side Scaling) configuration. If all flow traffic is funneled to a single CPU core, that core becomes the bottleneck even though the system has spare capacity. Verify IRQ distribution with cat /proc/interrupts | grep eth0.

Collector parser backlog

If the write queue is growing, the parser or storage layer is the bottleneck. Options:

  • Increase parser thread count if the collector supports it.
  • Check whether the parser is doing expensive work per record (regex matching, enrichment lookups) that can be deferred or batched.
  • Verify disk IOPS are not saturated. Flow record storage is disk-intensive.
  • Separate flow storage from syslog storage on different volumes.

Template refresh alignment

For NetFlow v9 and IPFIX, templates must be received before data records can be decoded. After a device reboot or config change, the collector is blind until the next template arrives. Set the template refresh interval shorter than the active timeout. A misaligned configuration where template refresh is longer than active timeout causes template expiry before data records arrive, leaving the collector unable to decode flows.

RFC 7011 Section 8.4 requires that exporting processes using UDP periodically retransmit active templates. Ensure your exporter’s template refresh rate is configured and that the collector’s template cache lifetime is at least three times the retransmission interval.

Prevention

  • Set active timeout to 60 seconds on exporters where real-time visibility matters. Document the increased export load and size collectors accordingly.
  • Monitor Udp_RcvbufErrors continuously. This is the single most missed signal in flow collection. Any nonzero value represents silent data loss. Set rmem_max to 16 MB or higher and confirm the collector sets SO_RCVBUF.
  • Compare device-side export counts against collector inbound counts. This is the only reliable end-to-end loss detection method. A sustained gap means records are lost somewhere in the path.
  • Verify NTP on exporters. Clock skew corrupts latency computation and breaks cross-device correlation.
  • Test template cache persistence across collector restarts. Some collectors lose in-memory template state on restart, creating a 5 to 30 minute blind window.

How Netdata helps

  • UDP buffer drop detection. Netdata monitors Udp_RcvbufErrors from /proc/net/snmp with per-second resolution, catching socket buffer overflow before it becomes a data gap.
  • Per-core CPU breakdown. Netdata’s CPU collector shows softirq and system time per core, making RSS funneling visible without manual mpstat sessions.
  • Collector process metrics. Netdata monitors the collector’s own CPU, memory, and disk I/O, correlating collector-side bottlenecks with flow data latency.
  • NIC-level drop counters. Netdata tracks /proc/net/dev RX drops and ethtool counters such as rx_missed_errors, distinguishing NIC-level drops from socket-buffer drops.
  • Cross-signal correlation. When flow data latency spikes, Netdata’s unified timeline lets you correlate it with CPU saturation, disk I/O wait, and UDP buffer errors in a single view.