The only agent that thinks for itself

Autonomous Monitoring with self-learning AI built-in, operating independently across your entire stack.

Unlimited Metrics & Logs
Machine learning & MCP
5% CPU, 150MB RAM
3GB disk, >1 year retention
800+ integrations, zero config
Dashboards, alerts out of the box
> Discover Netdata Agents

Centralized metrics streaming and storage

Aggregate metrics from multiple agents into centralized Parent nodes for unified monitoring across your infrastructure.

Stream from unlimited agents
Long-term data retention
High availability clustering
Data replication & backup
Scalable architecture
Enterprise-grade security
> Learn about Parents

Fully managed cloud platform

Access your monitoring data from anywhere with our SaaS platform. No infrastructure to manage, automatic updates, and global availability.

Zero infrastructure management
99.9% uptime SLA
Global data centers
Automatic updates & patches
Enterprise SSO & RBAC
SOC2 & ISO certified
> Explore Netdata Cloud

Deploy Netdata Cloud in your infrastructure

Run the full Netdata Cloud platform on-premises for complete data sovereignty and compliance with your security policies.

Complete data sovereignty
Air-gapped deployment
Custom compliance controls
Private network integration
Dedicated support team
Kubernetes & Docker support
> Learn about Cloud On-Premises

Powerful, intuitive monitoring interface

Modern, responsive UI built for real-time troubleshooting with customizable dashboards and advanced visualization capabilities.

Real-time chart updates
Customizable dashboards
Dark & light themes
Advanced filtering & search
Responsive on all devices
Collaboration features
> Explore Netdata UI

Monitor on the go

Native iOS and Android apps bring full monitoring capabilities to your mobile device with real-time alerts and notifications.

iOS & Android apps
Push notifications
Touch-optimized interface
Offline data access
Biometric authentication
Widget support
> Download apps

The future of infrastructure observability

See our strategic direction across AI-native observability, full-stack signals, operational intelligence, and enterprise platform maturity.

AI-native observability
Full-stack signal coverage
Operational intelligence
Enterprise platform maturity
Agent releases every 6 weeks
Cloud continuous delivery
> Explore Product Roadmap

Best energy efficiency

True real-time per-second

100% automated zero config

Centralized observability

Multi-year retention

High availability built-in

Zero maintenance

Always up-to-date

Enterprise security

Complete data control

Air-gap ready

Compliance certified

Millisecond responsiveness

Infinite zoom & pan

Works on any device

Native performance

Instant alerts

Monitor anywhere

AI-native observability

Continuous delivery

Open source foundation

80% Faster Incident Resolution

AI-powered troubleshooting from detection, to root cause and blast radius identification, to reporting.

True Real-Time and Simple, even at Scale

Linearly and infinitely scalable full-stack observability, that can be deployed even mid-crisis.

90% Cost Reduction, Full Fidelity

Instead of centralizing the data, Netdata distributes the code, eliminating pipelines and complexity.

See and Map Your Entire Network

Live topology, flow analytics, and SNMP device and trap monitoring — unified with your full-stack observability.

Control Without Surrender

SOC 2 Type 2 certified with every metric kept on your infrastructure.

Integrations

800+ collectors and notification channels, auto-discovered and ready out of the box.

800+ data collectors
Auto-discovery & zero config
Cloud, infra, app protocols
Notifications out of the box
> Explore integrations
Real Results
46% Cost Reduction

Reduced monitoring costs by 46% while cutting staff overhead by 67%.

— Leonardo Antunez, Codyas

Zero Pipeline

No data shipping. No central storage costs. Query at the edge.

From Our Users
"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

No Query Language

Point-and-click troubleshooting. No PromQL, no LogQL, no learning curve.

Enterprise Ready
67% Less Staff, 46% Cost Cut

Enterprise efficiency without enterprise complexity—real ROI from day one.

— Leonardo Antunez, Codyas

SOC 2 Type 2 Certified

Zero data egress. Only metadata reaches the cloud. Your metrics stay on your infrastructure.

Full Coverage
800+ Collectors

Auto-discovered and configured. No manual setup required.

Any Notification Channel

Slack, PagerDuty, Teams, email, webhooks—all built-in.

Built for the People Who Get Paged

Because 3am alerts deserve instant answers, not hour-long hunts.

Every Industry Has Rules. We Master Them.

See how healthcare, finance, and government teams cut monitoring costs 90% while staying audit-ready.

Monitor Any Technology. Configure Nothing.

Install the agent. It already knows your stack.
From Our Users
"A Rare Unicorn"

Netdata gives more than you invest in it. A rare unicorn that obeys the Pareto rule.

— Eduard Porquet Mateu, TMB Barcelona

99% Downtime Reduction

Reduced website downtime by 99% and cloud bill by 30% using Netdata alerts.

— Falkland Islands Government

Real Savings
30% Cloud Cost Reduction

Optimized resource allocation based on Netdata alerts cut cloud spending by 30%.

— Falkland Islands Government

46% Cost Cut

Reduced monitoring staff by 67% while cutting operational costs by 46%.

— Codyas

Real Coverage
"Plugin for Everything"

Netdata has agent capacity or a plugin for everything, including Windows and Kubernetes.

— Eduard Porquet Mateu, TMB Barcelona

"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

Real Speed
Troubleshooting in 30 Seconds

From 2-3 minutes to 30 seconds—instant visibility into any node issue.

— Matthew Artist, Nodecraft

20% Downtime Reduction

20% less downtime and 40% budget optimization from out-of-the-box monitoring.

— Simon Beginn, LANCOM Systems

Pay per Node. Unlimited Everything Else.

One price per node. Unlimited metrics, logs, users, and retention. No per-GB surprises.

Free tier—forever
No metric limits or caps
Retention you control
Cancel anytime
> See pricing plans

What's Your Monitoring Really Costing You?

Most teams overpay by 40-60%. Let's find out why.

Expose hidden metric charges
Calculate tool consolidation
Customers report 30-67% savings
Results in under 60 seconds
> See what you're really paying

Your Infrastructure Is Unique. Let's Talk.

Because monitoring 10 nodes is different from monitoring 10,000.

On-prem & air-gapped deployment
Volume pricing & agreements
Architecture review for your scale
Compliance & security support
> Start a conversation

Monitoring That Sells Itself

Deploy in minutes. Impress clients in hours. Earn recurring revenue for years.

30-second live demos close deals
Zero config = zero support burden
Competitive margins & deal protection
Response in 48 hours
> Apply to partner

Per-Second Metrics at Homelab Prices

Same engine, same dashboards, same ML. Just priced for tinkerers.

Community: Free forever · 5 nodes · non-commercial
Homelab: $90/yr · unlimited nodes · fair usage
> Get the Homelab Plan

$1,000 Per Referral. Unlimited Referrals.

Your colleagues get 10% off. You get 10% commission. Everyone wins.

10% of subscriptions, up to $1,000 each
Track earnings inside Netdata Cloud
PayPal/Venmo payouts in 3-4 weeks
No caps, no complexity
> Get your referral link
Cost Proof
40% Budget Optimization

"Netdata's significant positive impact" — LANCOM Systems

Calculate Your Savings

Compare vs Datadog, Grafana, Dynatrace

Savings Proof
46% Cost Reduction

"Cut costs by 46%, staff by 67%" — Codyas

30% Cloud Bill Savings

"Reduced cloud bill by 30%" — Falkland Islands Gov

Enterprise Proof
"Better Than Combined Alternatives"

"Better observability with Netdata than combining other tools." — TMB Barcelona

Real Engineers, <24h Response

DPA, SLAs, on-prem, volume pricing

Why Partners Win
Demo Live Infrastructure

One command, 30 seconds, real data—no sandbox needed

Zero Tickets, High Margins

Auto-config + per-node pricing = predictable profit

Homelab Ready
Free Video Course

8-episode Netdata tutorial by LearnLinux.tv

76k+ GitHub Stars

3rd most starred monitoring project

Worth Recommending
Product That Delivers

Customers report 40-67% cost cuts, 99% downtime reduction

Zero Risk to Your Rep

Free tier lets them try before they buy

AI Support Assistant, Available 24/7

Nedi has access to all official documentation, source code, and resources. Ask any question about Netdata—responds in your language.

Deployment & configuration
Troubleshooting & sizing
Alerts & notifications
Evidence-based answers
> Ask Nedi now

Never Fight Fires Alone

Docs, community, and expert help—pick your path to resolution.

Learn.netdata.cloud docs
Discord, Forums, GitHub
Premium support available
> Get answers now

60 Seconds to First Dashboard

One command to install. Zero config. 850+ integrations documented.

Linux, Windows, K8s, Docker
Auto-discovers your stack
> Read our documentation

Level Up Your Monitoring

Real problems. Real solutions. 112+ guides from basic monitoring to AI observability.

76,000+ Engineers Strong

615+ contributors. 1.5M daily downloads. One mission: simplify observability.

Per-Second. 90% Cheaper. Data Stays Home.

Side-by-side comparisons: costs, real-time granularity, and data sovereignty for every major tool.

See why teams switch from Datadog, Prometheus, Grafana, and more.

> Browse all comparisons
Edge-Native Observability, Born Open Source
Per-second visibility, ML on every metric, and data that never leaves your infrastructure.
Founded in 2016
615+ contributors worldwide
Remote-first, engineering-driven
Open source first
> Read our story
Promises We Publish—and Prove
12 principles backed by open code, independent validation, and measurable outcomes.
Open source, peer-reviewed
Zero config, instant value
Data sovereignty by design
Aligned pricing, no surprises
> See all 12 principles
Edge-Native, AI-Ready, 100% Open
76k+ stars. Full ML, AI, and automation—GPLv3+, not premium add-ons.
76,000+ GitHub stars
GPLv3+ licensed forever
ML on every metric, included
Zero vendor lock-in
> Explore our open source
Build Real-Time Observability for the World
Remote-first team shipping per-second monitoring with ML on every metric.
Remote-first, fully distributed
Open source (76k+ stars)
Challenging technical problems
Your code on millions of systems
> See open roles
Meet the Team Behind Netdata
Conferences, meetups, and tradeshows where you can see Netdata in action and talk to the engineers who build it.
Live demos and deep dives
Book 1-on-1 meetings
Talks and panel sessions
Event recaps and photos
> See all events
Talk to a Netdata Human in <24 Hours
Sales, partnerships, press, or professional services—real engineers, fast answers.
Discuss your observability needs
Pricing and volume discounts
Partnership opportunities
Media and press inquiries
> Book a conversation
Your Data. Your Rules.
On-prem data, cloud control plane, transparent terms.
Trust & Scale
76,000+ GitHub Stars

One of the most popular open-source monitoring projects

SOC 2 Type 2 Certified

Enterprise-grade security and compliance

Data Sovereignty

Your metrics stay on your infrastructure

Validated
University of Amsterdam

"Most energy-efficient monitoring solution" — ICSOC 2023, peer-reviewed

ADASTEC (Autonomous Driving)

"Doesn't miss alerts—mission-critical trust for safety software"

Community Stats
615+ Contributors

Global community improving monitoring for everyone

1.5M+ Downloads/Day

Trusted by teams worldwide

GPLv3+ Licensed

Free forever, fully open source agent

Why Join?
Remote-First

Work from anywhere, async-friendly culture

Impact at Scale

Your work helps millions of systems

$ guides / network
NETWORK · OPERATIONS PLAYBOOK

The link is up, the dashboard is green, and the data is already gone

SNMP polling, flow telemetry, BGP, traps, and topology — how a monitoring pipeline really works, the places it silently loses data, the signals worth watching, and a runbook for each incident.

"

Network monitoring fails differently from the things it watches. The network can be perfectly healthy while your monitoring quietly goes blind.

Almost every network signal arrives over UDP — SNMP on 161, traps on 162, syslog on 514, flow records on 2055/6343 — and UDP drops silently. When a collector's socket buffer fills, the kernel discards datagrams and increments a counter almost nobody watches. Your flow charts dip, an operator assumes traffic fell, and the truth is that the data never made it off the wire. The same blind spot hides in a dozen places: an SNMP poller that falls behind and reports devices as down, a BGP session that stays Established long after it stopped carrying routes, a NetFlow v9 template that desyncs after an exporter reboot and decodes every field wrong, a counter that rolls over and paints a 4-billion-packet spike that never happened.

These guides are for engineers who already run a network and the monitoring around it — not an introduction to subnetting. The goal is the mental model of how the monitoring pipeline actually behaves, the failure patterns that keep recurring, the signals that catch them before an outage, and the runbooks you wish you'd had during the last 2 a.m. incident where everything was green and nothing worked.

How network monitoring actually works in production

Network monitoring is not one tool. It is a stack of collectors, each speaking a different protocol to a different layer of the estate, fused into one picture. Most failures live in the seams between these layers — in the transport that goes silent, not in the device being watched.

01
time synchronization
NTP/PTP across every collector and device. Cross-collector correlation — a flow drop paired with a BGP NOTIFICATION, a license window, a trap timestamp — depends on monotonic, aligned clocks. A few seconds of drift makes postmortems unreconstructable.
TIME
02
polling transport
ICMP, UDP/161 (SNMP), TCP/22 (CLI scrape), and HTTPS (vendor APIs) reaching each managed endpoint. Without working transport, every higher signal is simply absent — and absence looks like silence, not an error.
TRANSPORT
03
SNMP polling engine
A scheduler fanning OID requests across devices and worker threads, holding the counter table and sysUpTime anchors used for every rate calculation, and tracking each device's UP / STALE / UNKNOWN / DOWN state.
POLLER
04
flow collection & templates
NetFlow v5/v9 and IPFIX collectors maintaining template caches, decoding records, normalizing sampling rates, and writing flow storage. sFlow is sample-datagram oriented — a different failure profile entirely.
FLOW
05
traps & syslog ingestion
UDP/162 trap listener and UDP/TCP/TLS syslog pipeline with MIB-resolved varbinds, RFC 3164/5424 framing, and parser backpressure. Push-based, lossy, and the only signal for many event-driven conditions.
EVENTS
06
BGP & routing monitoring
Active, passive, or BMP sessions tracking FSM state, prefix announcements, AS-path and RPKI validity, and per-prefix reachability. A session can be Established and carrying nothing.
ROUTING
07
topology inference
A graph builder fusing CDP/LLDP, FDB, ARP, STP, and routing tables into Layer-2/Layer-3 topology and endpoint positioning. Probabilistic — it degrades as input freshness degrades.
TOPOLOGY
08
storage & retention
Counter TSDB, full-resolution flow store, topology graph, raw syslog, and event log — each with its own disk, CPU, and IOPS profile. Slow storage backs up through the parser and becomes upstream packet loss.
STORAGE

Why this matters: 'traffic dropped' can mean the traffic actually dropped, or a full socket buffer, a poller fall-behind, a desynced flow template, a counter rollover, an SNMP timeout, a stale BGP session, or a disk too slow to drain the parser. Same symptom, eight different layers, eight different signals and fixes.

The failures you'll actually see

Real network-monitoring incidents fall into a small set of recurring shapes. Most of them are failures of the monitoring pipeline, not the network. Recognise the shape and triage gets much faster.

CRITICAL

The silent UDP flow-loss cascade

A collector stops draining its socket buffer fast enough — slow parser, slow disk, single-core RSS pinning — and the kernel silently discards flow datagrams. UdpRcvbufErrors climbs; the flow charts dip; everyone assumes traffic fell. The data was lost at the collector, not on the network.

  • UdpRcvbufErrors / Udp InErrors incrementing
  • flow records/sec drops with no device-side change
  • multiple exporters declining at once
  • one collector CPU core pinned at 100%
Investigate
ACTIVE

The poller fall-behind

The SNMP scheduler can't complete its poll cycle within the interval. Polls queue, timeouts rise, and devices flap to UNKNOWN/DOWN even though they're perfectly healthy. The most common cause of a 'is the network down?' false alarm — and it gets worse exactly when the network is busiest.

  • poll cycle time exceeding the poll interval
  • SNMP timeout/retry rate climbing
  • devices oscillating UP/UNKNOWN with no real outage
  • poller worker pool saturated
Investigate
IMMINENT

BGP Established but stale

The BGP FSM still reads Established, so every up/down check passes — but the session stopped exchanging updates. Routes age out or freeze, traffic blackholes for a prefix, and the one signal everyone trusts is lying. State alone is not health; you have to watch prefix counts and update activity.

  • Established session with frozen prefix counts
  • no UPDATE activity for an unusually long window
  • reachability loss for prefixes the peer should advertise
  • hold-timer near expiry without a state change
Investigate
ACTIVE

The trap & syslog flood

A flapping link or a reconvergence event makes hundreds of devices emit traps and syslog at once. The UDP/162 receiver and the syslog parser saturate, drop events under burst, and the one record you needed — the root-cause linkDown — is the one that got dropped. The storm hides its own cause.

  • trap/syslog receipt rate spiking orders of magnitude
  • trap receiver UDP drops under burst
  • syslog parser backpressure / queue growth
  • correlated with an interface or STP flap
Investigate
WATCHFUL

Fake spikes from counter rollover

A 32-bit interface counter wraps past 4.29 billion, or a device reboot resets it, and the naive delta calculation paints an impossible traffic spike. Alerts fire on traffic that never happened; capacity reports are poisoned. The fix is 64-bit ifHC counters and a sysUpTime discontinuity check, not a higher threshold.

  • instantaneous spike to an implausible rate
  • spike coincides with a counter reset or reboot
  • 32-bit counters still in use on fast links
  • sysUpTime discontinuity around the spike
Investigate
IMMINENT

The vendor-API silent gap

A Meraki, Cato, or PAN-OS pull returns HTTP 200 with an empty or truncated payload — a throttle, an expired token scope, or a pagination bug — and the collector records 'success' while ingesting nothing. The dashboard goes flat and nobody is paged, because 200 is not an error.

  • HTTP 200 with empty/short payload from a vendor API
  • metrics flat-line for an API-sourced device set
  • 429 rate or rate-limit-remaining near zero
  • no corresponding SNMP/flow gap for the same devices
Investigate

Network monitoring maturity levels

Network observability works in four practical levels. Each is a complete operation, not a stepping stone. Pick the level that matches how much the network matters. Most production networks should land at the second level.

Level 1: Survival

Know that something is wrong

Survival monitoring is the floor: is the device reachable and is the link up? You won't learn why anything broke, but you'll learn that it broke before users phone in. Enough for lab and low-stakes segments.

  • Device reachability (ICMP / SNMP) Does the device answer a ping and an SNMP get?
  • Interface operational status Is ifOperStatus up on the links that matter?
  • Device uptime / unexpected reboot Did sysUpTime reset without your permission?
  • Collector process alive Is the poller / flow / trap collector actually running?
  • Interface utilization on uplinks Is a critical link near saturation?
  • Environment: temperature / PSU / fan Is hardware about to fail?

Level 2: Operational

Diagnose most incidents on your own

Operational monitoring is what most production networks should target. Survival says something is wrong; operational says what. With this coverage your team can usually localize an incident: errors vs discards, poller health, flow receipt, trap/syslog rates, BGP state.

  • Interface errors and discards ifInErrors/ifOutErrors and ifInDiscards/ifOutDiscards per link.
  • Interface utilization vs ifHighSpeed Real percent-of-capacity, not raw bits.
  • SNMP poll success / timeout / retry Is the poller keeping up with its cycle?
  • Flow UDP receipt rate + UdpRcvbufErrors Are flow datagrams arriving and being kept?
  • Trap & syslog receipt rate and severity Is the event pipeline flowing and not dropping?
  • BGP session state per peer Established, and for how long?
  • Collector CPU, memory, and disk Is the monitoring host itself healthy?
  • NTP offset on collectors and devices Are timestamps aligned enough to correlate?

Level 3: Mature

Catch problems before they become incidents

Mature monitoring catches the slow bleeds: a BGP session established but stale, a flow template drifting toward desync, sampling rates that aren't normalized, a license inching toward expiry, FDB/ARP tables going stale. None pages you today; each becomes an incident in a month.

  • BGP prefix counts + UPDATE activity Is an Established session actually carrying routes?
  • Flow template freshness / desync Did an exporter reboot break decoding?
  • Sampling-rate normalization Are sFlow/NetFlow totals scaled correctly?
  • License days-to-expiry per feature Months of headroom before a feature silently disables?
  • Counter discontinuity / rollover Are rate calcs anchored to sysUpTime?
  • FDB / ARP / topology freshness Is endpoint position based on current data?
  • Vendor API 429 rate + payload validity Are pull-mode collectors getting real data?
  • Per-core collector CPU + NIC RX drops Is one RSS core silently dropping packets?

Level 4: Expert

Reactive instrumentation after real incidents

Expert signals enter your stack the day after an incident proved you needed them: RPKI validity, asymmetric-path detection, NAT/session-table headroom, SD-WAN data-plane vs control-plane, audit-log gap detection. Most teams don't need every one — add the ones your incident history demands.

  • RPKI validity + AS-path change alerts Is a route being leaked or hijacked?
  • SD-WAN data-plane loss/latency per tunnel Tunnel up, but is the path actually healthy?
  • NAT / session-table utilization Headroom before connections start failing?
  • Asymmetric-routing detection Are path/latency measurements even valid?
  • Audit-log gap detection Did syslog/trap loss create a blind window?
  • STP topology-change rate Is the Layer-2 fabric reconverging repeatedly?
  • Cloud + on-prem flow correlation Does traffic stay visible across the boundary?
  • Flow export-to-ingest latency How stale is the flow picture you're trusting?

Operating mistakes worth avoiding

The traps network teams keep falling into. Each has a clear fix that most teams only learn after an incident.

Not monitoring UdpRcvbufErrors on collectors

Flow, trap, and syslog data all arrive over UDP and drop silently when the socket buffer fills. <code>UdpRcvbufErrors</code> is the only direct signal, and it's the one counter most teams never graph. Alert on any nonzero increment rate and size <code>net.core.rmem_max</code> to 16&nbsp;MB+ at deployment.

Treating BGP session state as session health

An <code>Established</code> peer can stop carrying routes and your up/down check stays green. Watch prefix counts and UPDATE activity, not just FSM state — 'Established but stale' is a silent blackhole.

Leaving 32-bit counters on fast links

A 32-bit ifInOctets wraps in seconds on a 10G link and paints a fake multi-billion-packet spike. Use 64-bit ifHC counters and check sysUpTime for discontinuity before trusting any rate.

Not watching NTP drift on devices

Every cross-collector correlation — flow + syslog, BGP + flow drop, license windows — depends on aligned clocks. A few seconds of drift makes incident reconstruction impossible. Monitor offset on collectors and managed devices.

Invisible trap receiver drops

UDP/162 drops thousands of traps a minute under a link-flap storm and no one notices. Monitor the trap receiver's drop counter and socket-buffer fill, and rate-limit at the source — the trap you lose is usually the root cause.

NetFlow v9 / IPFIX template desync goes undetected

After an exporter reboot the collector can decode records against a stale template and silently produce garbage fields. Correlate decode-error rate with exporter reboots and alert when a template cache misses.

Skipping sampling-rate normalization

sFlow and sampled NetFlow report 1-in-N; if analytics don't multiply by N (and N changes per exporter) your totals are off by orders of magnitude. Normalize at ingest and alert when an exporter's sampling rate changes.

Monitoring license expiry only after a feature dies

Many platforms silently disable features — flow export, advanced routing, threat prevention — the day a license lapses, with no SNMP trap. Track days-to-expiry per feature and alert weeks ahead, not on the outage.

Network runbooks in this section

Each guide is a focused runbook for one symptom or topic. Pick one when you have an incident, or use the categories to learn the area.

WHERE TO GO NEXT

Setting up network monitoring, or putting out a fire?

If you're starting from scratch, the monitoring checklist is the path of least regret. If you're mid-incident, jump straight to the symptom that matches what you're seeing.