Correlating cloud VPC flow logs with on-prem NetFlow

Cloud flow logs and on-prem flow records share the 5-tuple concept but diverge in nearly every dimension that matters for correlation: transport, latency, sampling, timestamps, topology, and NAT visibility. Cloud providers emit VPC flow logs via push to object storage with implicit sampling and aggregation intervals measured in minutes. On-premises devices export NetFlow v5/v9, IPFIX, or sFlow over UDP with configurable sampling and near-real-time delivery.

This gap is a recurring contributor to operational incidents. An attacker pivoting from a compromised cloud workload to on-prem via VPN is invisible across the boundary if no join exists between the two telemetry sources. The same gap hides legitimate operational issues: cross-boundary packet loss, asymmetric routing through cloud transit gateways, and NAT translation mismatches between cloud NAT and on-prem firewalls.

This reference covers the format incompatibilities, sampling semantics, timestamp behavior, and NAT opacity that make cross-environment flow correlation operationally difficult. It is aimed at operators building or maintaining a normalization layer across these two domains.

Why cloud and on-prem flow data resist correlation

Cloud flow logs are not flow records in the NetFlow/IPFIX sense. They are aggregated, sampled, and delivered asynchronously. There is no SNMP, no CDP, no LLDP. The flow record is the only native signal. Topology is the cloud provider’s graph. Delivery is push-based to object storage. Lag is minutes, not seconds. Sampling is implicit.

DimensionOn-prem NetFlow/IPFIXCloud flow logs
TransportUDP push to collectorPush to object storage, polled by consumer
LatencyNear real-time (seconds)Minutes (aggregation plus delivery)
SamplingExplicit, configurable per exporterImplicit, provider-controlled
TimestampsExporter clock (NTP-dependent)Provider-aggregated start and end epochs
TopologyCDP/LLDP/FDB/ARP availableCloud graph only
NAT visibilityPost-NAT at perimeterPre-NAT inside VPC (if pkt-* fields enabled)
Template or cacheNetFlow v9/IPFIX template cacheNo template concept
flowchart LR
    CL["Cloud flow logs
AWS / GCP / Azure
push to object storage
minutes of lag"] -->|"timestamp skew
up to 60s (AWS)"| NORM["Normalization layer
windowed 5-tuple join
NAT translation logs
direction inference"] OP["On-prem NetFlow / IPFIX
UDP export
near real-time"] -->|"NTP-dependent
sampling-aware"| NORM NORM --> CORR["Cross-boundary
correlation
same conversation
different vantage points"]

Provider reference: cloud flow log semantics

Each cloud provider’s flow log format has distinct semantics that affect how records can be joined with on-prem data.

AWS VPC Flow Logs

AWS VPC Flow Logs aggregate captured packets into intervals. The default aggregation interval is 10 minutes, reducible to 1 minute. Format versions 2 through 11 exist, each adding fields without removing prior ones.

Core 5-tuple fields: srcaddr, dstaddr, srcport, dstport, protocol (IANA number). Additional fields include packets, bytes, start and end (Unix epoch seconds), action (ACCEPT or REJECT), and log-status.

For correlation through NAT gateways or EKS pods, pkt-srcaddr and pkt-dstaddr are essential. Without them, srcaddr and dstaddr reflect the translated IP, not the original. EKS pods have separate pod IPs from node ENI IPs. pkt-srcaddr exposes the pod IP while srcaddr shows the node ENI IP.

The log-status field distinguishes data gaps:

  • SKIPDATA: records were dropped internally by AWS due to capacity constraints. One SKIPDATA record can represent multiple uncaptured flows.
  • NODATA: no traffic on that ENI during the interval. Not a gap, but operators frequently confuse it with SKIPDATA.

The tcp-flags field is a bitmask aggregated across the entire aggregation interval: FIN=1, SYN=2, RST=4, SYN-ACK=18. A short-lived connection that opens and closes within a single aggregation interval may appear as a single record with combined flags (SYN+FIN = 3, SYN-ACK+FIN = 19). This makes TCP state machine reconstruction unreliable compared to on-prem NetFlow v5, which records a single OR’d flags value per flow representing the union of all flags seen during that flow’s lifetime. For packet-level TCP handshake analysis, neither cloud nor NetFlow records are a substitute for a packet capture.

The flow-direction field (added in version 5) resolves initiator ambiguity for AWS-side flows.

GCP VPC Flow Logs

GCP uses dual-stage sampling. The primary sampling rate is opaque and dynamic, varying with host load. The secondary rate is configurable from 0.0 to 1.0. Primary sampling is uncontrollable by the operator.

Aggregation intervals are configurable: 5 seconds, 30 seconds, 1 minute, 5 minutes, 10 minutes, or 15 minutes.

GCP flow logs do not indicate which endpoint initiated a flow. They identify packet direction relative to the interface only. This complicates correlation with on-prem NetFlow, which also lacks initiator context natively.

A critical asymmetry exists in firewall interaction:

  • Egress packets are sampled before egress firewall rules are evaluated. Denied packets can still appear in logs.
  • Ingress packets are sampled after ingress firewall rules are evaluated. Dropped packets are not logged.

This means GCP flow logs overestimate allowed egress and underestimate blocked ingress compared to on-prem firewalls that log both directions.

Azure NSG Flow Logs and VNet Flow Logs

Azure NSG Flow Logs version 2 introduces flow state tracking with B (Begin), C (Continue), and E (End) states, plus bidirectional byte and packet counters.

Byte and packet counts are not recorded for flows affected by non-default inbound rules. Reported totals will be lower than actual traffic for those flows. This affects chargeback correlation and volumetric validation.

Azure NSG Flow Logs are officially retiring on 30 September 2027. The successor is VNet Flow Logs, which operates at the virtual network level rather than per-NSG and captures platform-rule traffic that NSG flow logs miss. Teams staying on NSG flow logs should use version 2 only. Version 1 lacks byte and packet counters entirely.

Timestamp skew: the biggest correlation killer

Timestamp alignment is the single hardest problem in cross-environment flow correlation. Each source introduces skew differently.

AWS VPC Flow Logs: the start and end timestamps can be up to 60 seconds off from actual packet receipt or transmission. AWS documentation states these values might be either up to 60 seconds before the packet was received on the network interface or up to 60 seconds after.

GCP VPC Flow Logs: the primary sampling stage adds interpolation layers. Missed packets are compensated by interpolation from captured packets, which introduces additional timestamp uncertainty.

On-prem NetFlow/IPFIX: timestamps depend on exporter clock accuracy. Juniper has documented IPFIX timestamp inaccuracy on certain platforms where timestamps diverge from system time despite apparent NTP sync.

NetFlow v5: flow records do not carry absolute timestamps. Flow timing is derived from system uptime offsets (the FIRST and LAST fields relative to export uptime), which means accuracy depends on the exporter’s clock and uptime counter. NetFlow v9, IPFIX, and sFlow include timestamps that can be used directly to compute export-to-ingest latency.

The operational rule: always use a windowed join, never an exact timestamp match. A 2-minute tolerance is conservative for most paths. Five minutes is safer for high-latency cross-VPN paths where aggregation and delivery lag compound. Note that NTP drift between collectors is a separate problem from flow-log aggregation lag. Even sub-second NTP offset between a cloud flow log epoch and an on-prem NetFlow exporter clock will not matter if the join window is set to minutes, but it will compound with aggregation lag on paths where every second counts.

Sampling and completeness gaps

Cloud flow logs introduce data gaps that have no direct equivalent in on-prem flow collection.

AWS SKIPDATA: one SKIPDATA record can represent multiple uncaptured flows. The on-prem side shows traffic. The cloud side shows nothing. Treat any SKIPDATA record during a known traffic window as a data-quality finding.

GCP secondary sampling: when set below 1.0, flow entries are discarded randomly. Correlation against on-prem NetFlow will show missing entries proportional to the secondary sample rate. At 0.5 secondary sampling, expect roughly half the cloud-side flows to be absent.

Cloud delivery lag: cloud flow logs arrive via object storage poll, not real-time UDP push. The cloud side of a correlated view is always delayed relative to the on-prem side. Real-time alerting on cross-boundary patterns is not feasible with cloud flow logs alone. Retrospective correlation is the realistic use case.

NAT and identity correlation

NAT boundaries break 5-tuple joins. The endpoint IP inside the flow record is the NAT’d address. Security teams investigate the wrong host. Topology inference places the endpoint at the NAT device port, not the actual endpoint.

Inside the VPC: cloud flow logs may show pre-NAT IPs. AWS pkt-srcaddr and pkt-dstaddr expose the original pod or instance IP before NAT gateway translation. Without these fields, only the translated IP is visible.

At the perimeter: on-prem NetFlow typically reflects post-NAT IPs at the perimeter device. The cloud side may show pre-NAT IPs inside the VPC. The on-prem side shows the translated address as traffic exits the cloud.

To correlate across the NAT boundary, operators must normalize to whichever vantage point they are correlating from, and integrate NAT translation logs (cloud NAT logs, firewall session logs) into the enrichment pipeline. Without translation logs, identity recovery is impossible for flows that crossed the NAT boundary.

See locating endpoints behind NAT and wireless for the related problem of placing endpoints whose IP appears only behind a NAT device.

Direction ambiguity

Direction is ambiguous in both cloud and on-prem flow data, but in different ways.

GCP: no initiator direction at all. Flow logs identify packet direction relative to the interface, not which endpoint started the conversation.

AWS: the flow-direction field (version 5 and later) resolves this for AWS-side flows. Earlier versions lack it.

On-prem NetFlow: lacks initiator context natively. Flow direction is typically inferred from port assignment (low port = server) or from template metadata, not from packet inspection.

For cross-environment correlation, a flow initiated from cloud to on-prem appears as ingress on the on-prem side and egress on the cloud side. Without explicit direction fields, the join must rely on 5-tuple symmetry (same source and destination pair, ports swapped) rather than directional matching.

The normalization layer

Cross-domain correlation requires a normalization layer that translates cloud flow log formats and on-prem flow records into a common schema. Commercial platforms exist for this. For teams building their own normalization, the minimum requirements are:

  • Common timestamp field: normalize all timestamps to UTC epoch seconds. Apply a windowed join with tolerance appropriate to the path. Two to five minutes is the practical range for cross-cloud-to-on-prem paths.
  • Sampling-rate awareness: multiply sampled counts by the sampling rate for both cloud and on-prem sources. GCP’s opaque primary sampling rate means cloud-side byte and packet counts may be inherently unreliable for volumetric comparison against on-prem data.
  • NAT translation integration: join cloud NAT logs and on-prem firewall session logs into the enrichment pipeline to recover pre-NAT and post-NAT identity.
  • Direction normalization: infer conversation direction from 5-tuple symmetry rather than relying on provider-specific direction fields.
  • Completeness tracking: track SKIPDATA (AWS), sampling gaps (GCP), and template-cache misses (NetFlow v9/IPFIX) as data-quality signals, not just as absent records. A gap on one side with traffic on the other is itself a finding.

Signals to watch across the boundary

SignalWhy it mattersWarning sign
Timestamp offset between sourcesWindowed joins fail silently when skew exceeds toleranceFlows present on one side, absent on the other despite known traffic
Cloud log delivery lagPrevents real-time cross-boundary alertingCloud records arrive 5 to 10 minutes after on-prem records for the same conversation
SKIPDATA or NODATA frequencyIndicates cloud-side data loss that creates false gaps in correlationSudden increase in SKIPDATA records during a traffic spike
GCP secondary sampling rateBelow 1.0, random flow entries are discardedCorrelation shows missing cloud entries proportional to the configured rate
NAT translation log retentionWithout translation logs, pre-NAT identity is unrecoverableInvestigation window exceeds NAT log retention period
NTP offset on on-prem exportersDrift compounds with aggregation lag to shift flow records outside the join windowFlow records from different devices do not align for the same event
Azure non-terminating flow countsByte and packet totals are silently absent for affected flowsCloud-side volumetric totals consistently lower than on-prem for same path

How Netdata helps

Netdata can serve as the on-prem half of the correlation equation:

  • On-prem flow collection: Netdata collects NetFlow v5/v9, IPFIX, and sFlow data from network devices, providing the on-prem telemetry that cloud flow logs must be joined against.
  • Per-second metric resolution: Netdata’s collection frequency allows tight temporal correlation between on-prem flow data and contextual signals such as interface counters, BGP state, and syslog events.
  • Gap corroboration: when cloud flow logs show a gap, Netdata’s on-prem signals (interface utilization, error counters, discard counters) can confirm whether traffic actually flowed during the gap or whether the cloud-side absence reflects a real outage.
  • NTP monitoring: Netdata tracks NTP offset on collectors and can alert when clock skew exceeds thresholds that would compound with cloud-side aggregation lag.
  • UDP buffer health: for on-prem flow collectors, Netdata monitors Udp_RcvbufErrors and NIC RX drops, ensuring the on-prem half of the correlation is not silently losing data before it reaches storage.