The only agent that thinks for itself

Autonomous Monitoring with self-learning AI built-in, operating independently across your entire stack.

Unlimited Metrics & Logs
Machine learning & MCP
5% CPU, 150MB RAM
3GB disk, >1 year retention
800+ integrations, zero config
Dashboards, alerts out of the box
> Discover Netdata Agents

Centralized metrics streaming and storage

Aggregate metrics from multiple agents into centralized Parent nodes for unified monitoring across your infrastructure.

Stream from unlimited agents
Long-term data retention
High availability clustering
Data replication & backup
Scalable architecture
Enterprise-grade security
> Learn about Parents

Fully managed cloud platform

Access your monitoring data from anywhere with our SaaS platform. No infrastructure to manage, automatic updates, and global availability.

Zero infrastructure management
99.9% uptime SLA
Global data centers
Automatic updates & patches
Enterprise SSO & RBAC
SOC2 & ISO certified
> Explore Netdata Cloud

Deploy Netdata Cloud in your infrastructure

Run the full Netdata Cloud platform on-premises for complete data sovereignty and compliance with your security policies.

Complete data sovereignty
Air-gapped deployment
Custom compliance controls
Private network integration
Dedicated support team
Kubernetes & Docker support
> Learn about Cloud On-Premises

Powerful, intuitive monitoring interface

Modern, responsive UI built for real-time troubleshooting with customizable dashboards and advanced visualization capabilities.

Real-time chart updates
Customizable dashboards
Dark & light themes
Advanced filtering & search
Responsive on all devices
Collaboration features
> Explore Netdata UI

Monitor on the go

Native iOS and Android apps bring full monitoring capabilities to your mobile device with real-time alerts and notifications.

iOS & Android apps
Push notifications
Touch-optimized interface
Offline data access
Biometric authentication
Widget support
> Download apps

The future of infrastructure observability

See our strategic direction across AI-native observability, full-stack signals, operational intelligence, and enterprise platform maturity.

AI-native observability
Full-stack signal coverage
Operational intelligence
Enterprise platform maturity
Agent releases every 6 weeks
Cloud continuous delivery
> Explore Product Roadmap

Best energy efficiency

True real-time per-second

100% automated zero config

Centralized observability

Multi-year retention

High availability built-in

Zero maintenance

Always up-to-date

Enterprise security

Complete data control

Air-gap ready

Compliance certified

Millisecond responsiveness

Infinite zoom & pan

Works on any device

Native performance

Instant alerts

Monitor anywhere

AI-native observability

Continuous delivery

Open source foundation

80% Faster Incident Resolution

AI-powered troubleshooting from detection, to root cause and blast radius identification, to reporting.

True Real-Time and Simple, even at Scale

Linearly and infinitely scalable full-stack observability, that can be deployed even mid-crisis.

90% Cost Reduction, Full Fidelity

Instead of centralizing the data, Netdata distributes the code, eliminating pipelines and complexity.

Control Without Surrender

SOC 2 Type 2 certified with every metric kept on your infrastructure.

Integrations

800+ collectors and notification channels, auto-discovered and ready out of the box.

800+ data collectors
Auto-discovery & zero config
Cloud, infra, app protocols
Notifications out of the box
> Explore integrations
Real Results
46% Cost Reduction

Reduced monitoring costs by 46% while cutting staff overhead by 67%.

— Leonardo Antunez, Codyas

Zero Pipeline

No data shipping. No central storage costs. Query at the edge.

From Our Users
"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

No Query Language

Point-and-click troubleshooting. No PromQL, no LogQL, no learning curve.

Enterprise Ready
67% Less Staff, 46% Cost Cut

Enterprise efficiency without enterprise complexity—real ROI from day one.

— Leonardo Antunez, Codyas

SOC 2 Type 2 Certified

Zero data egress. Only metadata reaches the cloud. Your metrics stay on your infrastructure.

Full Coverage
800+ Collectors

Auto-discovered and configured. No manual setup required.

Any Notification Channel

Slack, PagerDuty, Teams, email, webhooks—all built-in.

Built for the People Who Get Paged

Because 3am alerts deserve instant answers, not hour-long hunts.

Every Industry Has Rules. We Master Them.

See how healthcare, finance, and government teams cut monitoring costs 90% while staying audit-ready.

Monitor Any Technology. Configure Nothing.

Install the agent. It already knows your stack.
From Our Users
"A Rare Unicorn"

Netdata gives more than you invest in it. A rare unicorn that obeys the Pareto rule.

— Eduard Porquet Mateu, TMB Barcelona

99% Downtime Reduction

Reduced website downtime by 99% and cloud bill by 30% using Netdata alerts.

— Falkland Islands Government

Real Savings
30% Cloud Cost Reduction

Optimized resource allocation based on Netdata alerts cut cloud spending by 30%.

— Falkland Islands Government

46% Cost Cut

Reduced monitoring staff by 67% while cutting operational costs by 46%.

— Codyas

Real Coverage
"Plugin for Everything"

Netdata has agent capacity or a plugin for everything, including Windows and Kubernetes.

— Eduard Porquet Mateu, TMB Barcelona

"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

Real Speed
Troubleshooting in 30 Seconds

From 2-3 minutes to 30 seconds—instant visibility into any node issue.

— Matthew Artist, Nodecraft

20% Downtime Reduction

20% less downtime and 40% budget optimization from out-of-the-box monitoring.

— Simon Beginn, LANCOM Systems

Pay per Node. Unlimited Everything Else.

One price per node. Unlimited metrics, logs, users, and retention. No per-GB surprises.

Free tier—forever
No metric limits or caps
Retention you control
Cancel anytime
> See pricing plans

What's Your Monitoring Really Costing You?

Most teams overpay by 40-60%. Let's find out why.

Expose hidden metric charges
Calculate tool consolidation
Customers report 30-67% savings
Results in under 60 seconds
> See what you're really paying

Your Infrastructure Is Unique. Let's Talk.

Because monitoring 10 nodes is different from monitoring 10,000.

On-prem & air-gapped deployment
Volume pricing & agreements
Architecture review for your scale
Compliance & security support
> Start a conversation

Monitoring That Sells Itself

Deploy in minutes. Impress clients in hours. Earn recurring revenue for years.

30-second live demos close deals
Zero config = zero support burden
Competitive margins & deal protection
Response in 48 hours
> Apply to partner

Per-Second Metrics at Homelab Prices

Same engine, same dashboards, same ML. Just priced for tinkerers.

Community: Free forever · 5 nodes · non-commercial
Homelab: $90/yr · unlimited nodes · fair usage
> Get the Homelab Plan

$1,000 Per Referral. Unlimited Referrals.

Your colleagues get 10% off. You get 10% commission. Everyone wins.

10% of subscriptions, up to $1,000 each
Track earnings inside Netdata Cloud
PayPal/Venmo payouts in 3-4 weeks
No caps, no complexity
> Get your referral link
Cost Proof
40% Budget Optimization

"Netdata's significant positive impact" — LANCOM Systems

Calculate Your Savings

Compare vs Datadog, Grafana, Dynatrace

Savings Proof
46% Cost Reduction

"Cut costs by 46%, staff by 67%" — Codyas

30% Cloud Bill Savings

"Reduced cloud bill by 30%" — Falkland Islands Gov

Enterprise Proof
"Better Than Combined Alternatives"

"Better observability with Netdata than combining other tools." — TMB Barcelona

Real Engineers, <24h Response

DPA, SLAs, on-prem, volume pricing

Why Partners Win
Demo Live Infrastructure

One command, 30 seconds, real data—no sandbox needed

Zero Tickets, High Margins

Auto-config + per-node pricing = predictable profit

Homelab Ready
Free Video Course

8-episode Netdata tutorial by LearnLinux.tv

76k+ GitHub Stars

3rd most starred monitoring project

Worth Recommending
Product That Delivers

Customers report 40-67% cost cuts, 99% downtime reduction

Zero Risk to Your Rep

Free tier lets them try before they buy

AI Support Assistant, Available 24/7

Nedi has access to all official documentation, source code, and resources. Ask any question about Netdata—responds in your language.

Deployment & configuration
Troubleshooting & sizing
Alerts & notifications
Evidence-based answers
> Ask Nedi now

Never Fight Fires Alone

Docs, community, and expert help—pick your path to resolution.

Learn.netdata.cloud docs
Discord, Forums, GitHub
Premium support available
> Get answers now

60 Seconds to First Dashboard

One command to install. Zero config. 850+ integrations documented.

Linux, Windows, K8s, Docker
Auto-discovers your stack
> Read our documentation

Level Up Your Monitoring

Real problems. Real solutions. 112+ guides from basic monitoring to AI observability.

76,000+ Engineers Strong

615+ contributors. 1.5M daily downloads. One mission: simplify observability.

Per-Second. 90% Cheaper. Data Stays Home.

Side-by-side comparisons: costs, real-time granularity, and data sovereignty for every major tool.

See why teams switch from Datadog, Prometheus, Grafana, and more.

> Browse all comparisons
Edge-Native Observability, Born Open Source
Per-second visibility, ML on every metric, and data that never leaves your infrastructure.
Founded in 2016
615+ contributors worldwide
Remote-first, engineering-driven
Open source first
> Read our story
Promises We Publish—and Prove
12 principles backed by open code, independent validation, and measurable outcomes.
Open source, peer-reviewed
Zero config, instant value
Data sovereignty by design
Aligned pricing, no surprises
> See all 12 principles
Edge-Native, AI-Ready, 100% Open
76k+ stars. Full ML, AI, and automation—GPLv3+, not premium add-ons.
76,000+ GitHub stars
GPLv3+ licensed forever
ML on every metric, included
Zero vendor lock-in
> Explore our open source
Build Real-Time Observability for the World
Remote-first team shipping per-second monitoring with ML on every metric.
Remote-first, fully distributed
Open source (76k+ stars)
Challenging technical problems
Your code on millions of systems
> See open roles
Meet the Team Behind Netdata
Conferences, meetups, and tradeshows where you can see Netdata in action and talk to the engineers who build it.
Live demos and deep dives
Book 1-on-1 meetings
Talks and panel sessions
Event recaps and photos
> See all events
Talk to a Netdata Human in <24 Hours
Sales, partnerships, press, or professional services—real engineers, fast answers.
Discuss your observability needs
Pricing and volume discounts
Partnership opportunities
Media and press inquiries
> Book a conversation
Your Data. Your Rules.
On-prem data, cloud control plane, transparent terms.
Trust & Scale
76,000+ GitHub Stars

One of the most popular open-source monitoring projects

SOC 2 Type 2 Certified

Enterprise-grade security and compliance

Data Sovereignty

Your metrics stay on your infrastructure

Validated
University of Amsterdam

"Most energy-efficient monitoring solution" — ICSOC 2023, peer-reviewed

ADASTEC (Autonomous Driving)

"Doesn't miss alerts—mission-critical trust for safety software"

Community Stats
615+ Contributors

Global community improving monitoring for everyone

1.5M+ Downloads/Day

Trusted by teams worldwide

GPLv3+ Licensed

Free forever, fully open source agent

Why Join?
Remote-First

Work from anywhere, async-friendly culture

Impact at Scale

Your work helps millions of systems

$ guides / postgres
POSTGRESQL · OPERATIONS PLAYBOOK

Running PostgreSQL in production, without the 3 a.m. surprises

MVCC, WAL, autovacuum, replication. The mental model of how the server actually works, where it tends to break, what to monitor as your operation matures, and the runbooks for the incidents you'll see.

"

PostgreSQL is famously easy to run for the first year, and famously hard to run for the fifth.

The defaults work. Until autovacuum cannot keep up with a high-churn table and dead tuples pile up. Until a forgotten replication slot retains WAL forever and fills the disk. Until age(datfrozenxid) crosses 2 billion and the database refuses writes to avoid wraparound corruption. Until a long-running transaction silently blocks every vacuum across the cluster. Until one slow query takes an AccessExclusiveLock that blocks every other transaction. Until a checkpoint storm turns a steady write workload into a stop-the-world I/O spike.

These guides are written for engineers who already run PostgreSQL, not for people learning what an index is. The goal is to give you the mental model of how the server actually behaves under load, the failure patterns that keep recurring, the monitoring story that catches problems before they page anyone, and the runbooks you wish someone had handed you before your last incident.

How PostgreSQL actually runs in production

PostgreSQL is not a single process. It is a postmaster supervising a per-connection backend, several background processes, a chunk of shared memory, and a strict contract with the storage layer. Most production failures live between these layers, not inside any one of them.

01
applications / ORMs
Whatever opens connections: application servers, batch jobs, CI scripts, BI tools, replication consumers. Each connection eventually becomes one Postgres backend process.
USER
02
connection pooler
PgBouncer, Pgpool-II, Odyssey. Multiplexes thousands of client connections onto a small server pool. Architecturally mandatory at scale.
POOL
03
postmaster + backends
One backend process per server connection. Each backend uses ~5–10 MB of memory even when idle. <code>work_mem</code> is per-operation, not per-backend, so a single complex query can multiply allocations.
BACKEND
04
shared memory
<code>shared_buffers</code>, WAL buffers, the lock table, and the procarray. The piece of PostgreSQL that survives across queries.
SHARED
05
background workers
Autovacuum launcher + workers, walwriter, bgwriter, checkpointer, walsender, walreceiver, logical replication apply workers. They run the server's hygiene and replication contracts.
BACKGROUND
06
storage layout
Heap files, indexes, TOAST tables, pg_wal, temp files, replication slots. The on-disk shape of the database.
STORAGE
07
OS page cache
The kernel caches PostgreSQL data files. PostgreSQL double-caches deliberately. Above ~40% of RAM in <code>shared_buffers</code> you starve this cache and lose more than you gain.
KERNEL
08
block storage
Local NVMe, EBS, ZFS, or whatever sits under the data directory. WAL fsync latency on this layer sets the ceiling on commit throughput.
DISK

Why this matters: a query can be slow because of a missing index, a stale plan, a lock wait, a temp-file spill, a checkpoint flush, an autovacuum I/O storm, an OS page-cache miss, or a slow disk fsync. The symptom is the same — slow query — but each layer has a different signal and a different fix.

The failures you'll actually see

Most PostgreSQL incidents fall into a small set of recurring patterns. Recognise the shape, and triage gets dramatically faster.

CRITICAL

The connection exhaustion cliff

FATAL: sorry, too many clients already. Applications fail to acquire connections; new sessions are refused. Underneath it is usually max_connections set too low for the workload, an application leak, idle-in-transaction sessions piling up, or no PgBouncer in front of the database.

  • too many connections errors at the driver
  • pg_stat_activity hits max_connections
  • idle in transaction sessions piling up
  • PgBouncer waiting_client_count climbs
Investigate
IMMINENT

The lock cascade

One slow transaction takes a lock; everything else queues behind it. A migration takes AccessExclusiveLock on a hot table; the entire app stalls. A row-level lock contends; deadlock detector fires every deadlock_timeout. The database keeps running while the workload grinds to a halt.

  • active sessions climbing without throughput
  • pg_blocking_pids shows a deep wait chain
  • deadlock_timeout logs spike
  • AccessExclusiveLock held by a DDL session
Investigate
ACTIVE

The autovacuum starvation spiral

A long-running transaction prevents dead tuple cleanup. Bloat accumulates on hot tables. Sequential scans get slower. Indexes balloon. Autovacuum eventually catches up — at the worst possible time, competing with peak load. The fix is rarely "tune autovacuum harder"; it is "find the long transaction."

  • n_dead_tup growing without n_live_tup matching
  • pg_stat_activity has a transaction older than 30 minutes
  • table size grows faster than row count
  • VACUUM runs that don't reclaim dead tuples
Investigate
CRITICAL

The transaction ID wraparound emergency

PostgreSQL stops accepting writes when transaction IDs come within ~3 million of wraparound. WARNING: database must be vacuumed within X transactions escalates to ERROR: database is not accepting commands. Recovery is single-user mode and VACUUM FREEZE. Prevention is monitoring age(datfrozenxid) long before it matters.

  • log warnings about transaction ID wraparound
  • age(datfrozenxid) above 1 billion
  • autovacuum_freeze_max_age frequently triggered
  • anti-wraparound vacuums running against multiple tables
Investigate
IMMINENT

The replication slot disk-fill

A logical or physical replication slot stops being consumed. The primary cannot recycle WAL because the slot retains it. pg_wal grows without bound until the disk fills. The primary then refuses writes. The fix in the moment is to drop the slot; the prevention is alerting on slot lag and max_slot_wal_keep_size.

  • pg_wal directory growing steadily
  • pg_replication_slots shows active=false on a retained slot
  • slot_lag_bytes > a few GB
  • checkpoints occurring but WAL not recycling
Investigate
WATCHFUL

The checkpoint storm

A burst of dirty pages forces a checkpoints_req ahead of schedule. Buffered writes drain to disk in a spike; fsync latency climbs; query latency follows. Logs show checkpoints are occurring too frequently. The fix is almost always max_wal_size, not checkpoint_timeout.

  • checkpoints_req >> checkpoints_timed
  • log warning: checkpoints are occurring too frequently
  • I/O spikes aligned with checkpoint completion
  • p99 commit latency climbs during checkpoints
Investigate

PostgreSQL monitoring maturity levels

PostgreSQL observability works in four practical levels. Each is a complete operation, not a stepping stone. Pick the level that matches how much your database matters. Most production databases should land at the second level.

Level 1: Survival

Know that something is wrong

Survival monitoring is the floor. With these signals you can answer one question: is the database still functioning? You will not learn what broke, but you will learn that something broke before users do. Survival is enough for dev environments and hobby clusters.

  • Database reachability Can a probe connect and run SELECT 1?
  • Server uptime / unexpected restarts Did the postmaster restart without your permission?
  • Disk free on the data directory Is the volume hosting pg_wal and base/ near full?
  • Connection count vs max_connections Are you within the connection ceiling?
  • Replication: replicas connected Are the expected replicas attached to the primary?
  • Backup last-success age When did pg_basebackup or pgBackRest last succeed?

Level 2: Operational

Diagnose most incidents on your own

Operational monitoring is what most production databases should target. Survival tells you something is wrong; operational tells you what. With this coverage your team can usually diagnose an incident on its own: bloat, replication lag, slow queries, checkpoint pressure, lock waits.

  • Transactions per second (commits + rollbacks) Is the workload doing what it should?
  • Cache hit ratio per database Are reads served from shared_buffers?
  • Replication lag (write/flush/replay) How far behind is each replica, in bytes and seconds?
  • Dead tuples and table bloat Is autovacuum keeping up with churn?
  • Active vs idle vs waiting sessions What is pg_stat_activity actually doing?
  • Lock waits and blocking sessions Is anything in a multi-second wait?
  • Long-running transactions (>5 min) Anything holding xmin back from cleanup?
  • Checkpoints: timed vs requested Is max_wal_size sized correctly?
  • WAL generation rate Is the write workload growing?
  • pg_stat_statements top by total_time Which queries actually cost the most?

Level 3: Mature

Catch problems before they become incidents

Mature monitoring catches problems before they wake anyone up. age(datfrozenxid) climbing, replication slot lag drifting, statistics going stale, plan cache regressing to a generic plan, temp file rate creeping. None of these will page you on day one. They become page-out incidents on day thirty.

  • age(datfrozenxid) per database Months of headroom against wraparound?
  • Replication slot lag (bytes retained) Is a stale slot accumulating WAL?
  • Autovacuum worker utilisation Are workers saturated? Is anything blocked?
  • Temp file generation rate and size Is work_mem too small for real queries?
  • Buffer eviction rate (bgwriter + backend writes) Is shared_buffers thrashing?
  • Heap fetches per index-only scan Is the visibility map stale?
  • WAL fsync p99 latency How fast does the underlying disk really fsync?
  • Connection age distribution Are pgbouncer transaction-pool connections rotating?
  • Plan cache hit ratio (prepared stmts) Is the planner using generic vs custom plans correctly?

Level 4: Expert

Reactive instrumentation after real incidents

Expert signals enter your stack the day after a specific incident proved you needed them. wait-event sampling, autovacuum I/O accounting per table, btree split rates, ProcArray contention, replication apply conflicts on hot_standby. Most teams never need every signal here. Add the ones your incident history says you do.

  • wait_event sampling from pg_stat_activity Where is the server spending its waiting time?
  • Per-table autovacuum I/O and duration Which tables consume vacuum budget?
  • B-tree split and fillfactor effectiveness Are HOT updates winning, or are indexes bloating?
  • Hot standby recovery conflicts Is replay being interrupted by replica queries?
  • Logical replication apply latency by table Which subscriber tables fall behind?
  • shared_buffer dirty rate vs flush rate Are checkpoints flushing what bgwriter should?
  • Page cache pressure on the data volume Is the OS evicting Postgres pages?
  • auto_explain captures of slow queries Plan + actual rows for every slow path.

Operating mistakes worth avoiding

The traps PostgreSQL teams keep falling into. Each has a clear, well-known fix. Most teams only learn it after an incident.

max_connections set to 500+ instead of using a pooler

PostgreSQL is process-per-connection. Each backend costs ~5–10 MB even idle. Five hundred backends is 5 GB of memory and serious context-switch overhead. PgBouncer in transaction mode lets you serve thousands of clients with 50 server connections.

Not monitoring age(datfrozenxid)

Wraparound is the silent killer. Default <code>autovacuum_freeze_max_age</code> is 200M. The hardcoded shutdown threshold is around 2.147B. Alert at 500M and 1B; ignore both and you will eventually meet a database that refuses writes.

Replication slots without monitoring

A slot retains WAL until consumed. A forgotten or stalled slot is the #1 root cause of pg_wal filling the disk. Alert on slot lag bytes and active=false on any persistent slot.

fsync = off "for performance"

fsync is what makes PostgreSQL durable. Disabling it can corrupt the cluster on any unclean shutdown. If you genuinely need extra write performance, tune synchronous_commit, not fsync.

pg_basebackup or pgBackRest backups never restore-tested

An untested backup is not a backup. Schedule a quarterly restore drill on a separate host. The first time you discover that backups don't restore must not be during an incident.

Treating autovacuum as something to disable

Disabling autovacuum on "hot" tables to "avoid I/O" is how teams meet wraparound emergencies. Tune <code>autovacuum_vacuum_scale_factor</code> and <code>autovacuum_vacuum_cost_delay</code> per table; never set <code>autovacuum_enabled = off</code> in production.

Ignoring idle in transaction sessions

An idle-in-transaction session holds xmin and prevents cleanup of any tuple newer than its snapshot. Set <code>idle_in_transaction_session_timeout</code> on every production cluster (60s–5min depending on workload).

Tuning shared_buffers to 80% of RAM

The OS page cache also caches Postgres pages. Above ~40% of RAM in shared_buffers, the kernel cache starves and you pay double for the same data. 25–40% is the well-known sweet spot.

PostgreSQL runbooks in this section

Each guide is a focused runbook for one symptom or topic. Pick one when you have an incident, or use the categories to learn the area.

WHERE TO GO NEXT

Setting up PostgreSQL monitoring, or putting out a fire?

If you're starting from scratch, the monitoring checklist is the path of least regret. If you're mid-incident, jump straight to the symptom that matches what you're seeing.