Cold-start topology: why your map is incomplete after a collector restart

You restart your flow collector or topology engine for a routine upgrade. The process comes back up cleanly. The dashboard loads. But the topology map is half-empty, endpoint positions are wrong, and within minutes someone pages you asking why a security investigation points to the wrong switch port.

The root cause is not a bug. After a restart, the topology inference engine has no cached neighbor tables, no FDB entries, no ARP data, and potentially no flow templates. It must rebuild all of these from live polling and flow data before it can produce a reliable view. The window between restart and first complete topology ranges from a few minutes to over 30 minutes, depending on poll cadence, template refresh intervals, and which sources your topology engine fuses.

The danger: topology engines often return answers during this warmup window without flagging them as incomplete. Confidence scores may be low or absent from the UI. Operators query endpoint positions, get results, and act on them. A security investigation traces a MAC address to a stale or partially-constructed switch port, and the wrong team gets paged.

What happens during cold start

Cold-start topology is the state where a topology inference engine has been restarted and has not yet accumulated enough data from its input sources to produce a complete and reliable topology view.

The engine fuses multiple independent data sources to derive Layer-2 and Layer-3 topology. These include CDP/LLDP neighbor tables, FDB entries, ARP tables, STP state, routing tables, and flow records. Each source repopulates at its own cadence after a restart:

  • CDP/LLDP neighbor data repopulates when the next SNMP poll cycle reaches each device and walks the neighbor tables.
  • FDB and ARP entries repopulate as devices learn MACs and resolve IPs, which only happens as traffic flows through them. A switch port with no active traffic will have an empty FDB entry regardless of what is physically connected.
  • Flow records for NetFlow v9/IPFIX require the collector to receive a template before any data records can be decoded. Templates are sent over UDP on a configurable interval, typically 5 to 30 minutes. Until the first template arrives, all data records from that exporter are silently discarded.
  • Endpoint positioning (which switch port a given MAC or IP is connected to) is probabilistic, derived from the agreement of multiple sources. With partial input, confidence is low.

Until enough sources converge, the topology view is partial, confidence scores are low, and endpoint positioning queries may return stale, cached, or incorrect data.

flowchart TD
    A[Collector or topology engine restart] --> B[Template cache wiped]
    A --> C[Neighbor table cache cleared]
    A --> D[FDB and ARP cache cleared]
    B --> E[Waiting for template refresh from exporter]
    E -->|5 to 30 min typical| F[Templates received]
    F --> G[Flow records decodable]
    C --> H[First poll cycle completes]
    D --> I[Devices repopulate FDB and ARP as traffic flows]
    H --> J[CDP and LLDP neighbors mapped]
    I --> K[Endpoint positions inferable]
    G --> L[Flow-derived topology available]
    J --> M{Multiple sources converging?}
    K --> M
    L --> M
    M -->|poll cycle x 3 typical| N[Full topology with high confidence]

Common causes

CauseWhat it looks likeFirst thing to check
Template cache eviction after collector restartFlow datagrams arriving but zero records decoded; collector logs show template not found or cache missCollector logs for template-related messages
Topology engine restart with no persistent stateAll confidence scores at zero; endpoint queries return unknown or stale cached resultsTopology confidence score endpoint or dashboard
Slow poll cycle on a large device estateTopology slowly fills in over many minutes; some devices still missing after the first cyclePoll cycle duration vs configured poll interval
FDB/ARP not yet repopulated on devicesEndpoints show as orphaned or positioned at the wrong portFDB entry count and freshness on key switches
UDP template packet loss during cold startTemplates expected but not received; gap persists beyond nominal refresh intervaltcpdump on collector NIC for template datagrams

Quick checks

All commands are read-only and safe during an active investigation.

# Check if flow records are being decoded vs just received
curl -s http://localhost:<stats-port>/metrics | grep -E 'flow.*received|flow.*decoded'

# Look for template cache miss messages in collector logs
grep -i 'template' /var/log/<collector>.log | tail -20

# Check topology inference confidence score
curl -s http://localhost:<port>/api/topology/confidence | jq

# Check poll cycle duration vs configured interval
curl -s http://localhost:<port>/metrics | grep -E 'poll.*cycle|poll.*duration'

# Verify FDB is repopulating on a key switch (Q-BRIDGE-MIB, VLAN-aware)
snmpwalk -v2c -c <community> <switch> .1.3.6.1.2.1.17.7.1.2.2.1.1 | wc -l

# Confirm UDP datagrams are arriving at the collector (NetFlow on 2055, IPFIX on 4739)
tcpdump -i eth0 -nn 'udp port 2055' -c 100

# Check UDP socket buffer drops (data may be arriving but being dropped)
cat /proc/net/snmp | grep '^Udp:'

# Verify NTP sync on the collector (clock skew widens the template gap)
chronyc tracking 2>/dev/null || ntpq -p

How to diagnose it

  1. Confirm the restart happened and note the timestamp. Check collector process start time or sysUpTime. The warmup window starts from the restart, not from when you first noticed the problem.

  2. Determine whether the gap is template-related, topology-engine-related, or both. If flow records are received but decoded is zero, the template cache is the bottleneck. If flow decoding is working but the topology is still sparse, the topology engine is still warming up from polling data.

  3. Check which input sources have repopulated. Run CDP/LLDP walks, FDB walks, and ARP walks on key devices. Compare entry counts to your known baseline. A switch that normally has 500 MACs showing 50 means the FDB has not repopulated.

  4. Query the topology confidence score. If your engine exposes it, check whether average confidence is still near zero or climbing. A confidence score below baseline after a restart means the topology is not yet trustworthy.

  5. Calculate the expected warmup window. Allow poll cycle duration multiplied by 3 for a first complete topology view. Add the template refresh interval if your topology engine depends on flow data for endpoint positioning. Platforms with a 60-second template refresh produce a short gap; platforms with a 30-minute refresh produce a substantial one.

  6. Verify that templates are actually arriving. Use tcpdump to confirm template datagrams are reaching the collector. If the exporter’s template refresh interval has passed and no template has arrived, suspect UDP packet loss or a network path issue between the exporter and collector.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Topology inference confidence scoreTells you whether endpoint positions are reliable enough to act onAverage confidence below baseline or near zero after restart
Flow records decoded vs received ratioTemplate cache miss means data records are silently discardedReceived greater than zero but decoded equals zero
FDB/MAC table freshnessStale or empty FDB means endpoint positioning will be wrong or missingEntry count well below baseline; entries older than 3 to 4 times the refresh interval
Poll cycle duration vs configured intervalSlow cycles delay the topology rebuild proportionallyCycle duration approaching or exceeding the configured poll interval
ARP cache entry count and stalenessStale ARP means wrong IP-to-MAC mappings for endpoint positioningEntry count well below baseline after restart
Template cache hit/miss ratioMisses indicate the collector cannot decode incoming flow recordsMiss ratio climbing after restart without recovery
UDP socket buffer drops (Udp_RcvbufErrors)Template packets may be dropped before the collector processes themNonzero and incrementing during the cold-start window

Fixes

Template cache gaps

The most impactful fix is shortening the template refresh interval on exporters before planned maintenance. On Cisco IOS, you can temporarily force frequent template resends from config mode:

! Force template resend every 1 packet - restore normal rate after
ip flow-export template refresh-rate 1

This is safe and non-disruptive to data forwarding. Restore the normal refresh rate after the collector confirms template receipt.

For collectors that support it, configure template cache persistence across restarts. nfdump and similar tools require explicit cache persistence configuration.

If your collector does not persist templates, the gap is unavoidable on restart. Plan restarts outside of security-critical windows.

Topology engine warmup

The primary fix is operational discipline, not configuration:

  • Wait for poll cycle x 3 before trusting topology queries. This is the standard guidance for first complete topology view after a collector restart.
  • Check confidence scores before acting on endpoint positioning results. If confidence is low, the answer is not trustworthy.
  • Suppress automated actions that depend on topology during the warmup window. If your incident response automation pages a team based on endpoint positioning, add a confidence check or a post-restart cooldown period.

FDB and ARP repopulation delays

FDB and ARP entries only repopulate as traffic flows. On quiet switch ports, the FDB may remain empty for extended periods. There is no safe way to force population without generating traffic.

If your topology engine depends on FDB freshness for endpoint positioning, ensure your poll cadence is fast enough relative to the FDB aging timeout on your switches. If the aging timeout is 4 hours and your poll cycle is 30 minutes, entries are refreshed frequently enough under normal operation. But after a restart, the first complete view still requires waiting for the poll to complete and for devices to have learned MACs.

Prevention

  • Mark cold-start state explicitly. If your topology engine does not flag incomplete views, add a wrapper or dashboard annotation that shows time since last restart and expected warmup completion time.
  • Monitor confidence scores as a first-class signal. Alert when average confidence drops below baseline for sustained periods, not just after restarts.
  • Use NTP synchronization on collectors and exporters. Clock skew can cause template refresh timestamp comparisons to reject otherwise valid templates, extending the effective blind spot beyond the nominal refresh interval.
  • Plan restarts during low-risk windows. The cold-start gap is most dangerous when it coincides with a security event that requires flow forensics.
  • Consider NetFlow v5 for fixed-format export where template gaps are unacceptable. NetFlow v5 has a fixed record format and does not require template exchange. However, v5 is deprecated on many modern platforms in favor of v9/IPFIX.
  • Shorten template refresh intervals on exporters. A 60-second refresh produces a shorter gap than a 30-minute refresh. Balance against the increased management-plane traffic from more frequent template sends.

How Netdata helps

  • Monitors the collector’s flow packet receive rate alongside the decoded record rate, making template cache gaps visible as a divergence between received and decoded counters.
  • Collects Udp_RcvbufErrors from /proc/net/snmp by default on Linux, catching template packets dropped at the kernel socket buffer before the application sees them.
  • Per-core CPU metrics help verify the collector is not CPU-starved during the ingestion burst that follows restart, when all exporters resume sending simultaneously.
  • Collector disk space and I/O metrics catch the TSDB write spike that accompanies cold-start ingestion.
  • SNMP poll latency and timeout metrics help verify the poller is completing cycles fast enough for the topology engine to rebuild within the expected window.
  • Cross-metric correlation in Netdata dashboards lets you align collector restart timestamps with confidence score drops, flow decode gaps, and FDB repopulation curves in a single view.