The only agent that thinks for itself

Autonomous Monitoring with self-learning AI built-in, operating independently across your entire stack.

Unlimited Metrics & Logs
Machine learning & MCP
5% CPU, 150MB RAM
3GB disk, >1 year retention
800+ integrations, zero config
Dashboards, alerts out of the box
> Discover Netdata Agents

Centralized metrics streaming and storage

Aggregate metrics from multiple agents into centralized Parent nodes for unified monitoring across your infrastructure.

Stream from unlimited agents
Long-term data retention
High availability clustering
Data replication & backup
Scalable architecture
Enterprise-grade security
> Learn about Parents

Fully managed cloud platform

Access your monitoring data from anywhere with our SaaS platform. No infrastructure to manage, automatic updates, and global availability.

Zero infrastructure management
99.9% uptime SLA
Global data centers
Automatic updates & patches
Enterprise SSO & RBAC
SOC2 & ISO certified
> Explore Netdata Cloud

Deploy Netdata Cloud in your infrastructure

Run the full Netdata Cloud platform on-premises for complete data sovereignty and compliance with your security policies.

Complete data sovereignty
Air-gapped deployment
Custom compliance controls
Private network integration
Dedicated support team
Kubernetes & Docker support
> Learn about Cloud On-Premises

Powerful, intuitive monitoring interface

Modern, responsive UI built for real-time troubleshooting with customizable dashboards and advanced visualization capabilities.

Real-time chart updates
Customizable dashboards
Dark & light themes
Advanced filtering & search
Responsive on all devices
Collaboration features
> Explore Netdata UI

Monitor on the go

Native iOS and Android apps bring full monitoring capabilities to your mobile device with real-time alerts and notifications.

iOS & Android apps
Push notifications
Touch-optimized interface
Offline data access
Biometric authentication
Widget support
> Download apps

The future of infrastructure observability

See our strategic direction across AI-native observability, full-stack signals, operational intelligence, and enterprise platform maturity.

AI-native observability
Full-stack signal coverage
Operational intelligence
Enterprise platform maturity
Agent releases every 6 weeks
Cloud continuous delivery
> Explore Product Roadmap

Best energy efficiency

True real-time per-second

100% automated zero config

Centralized observability

Multi-year retention

High availability built-in

Zero maintenance

Always up-to-date

Enterprise security

Complete data control

Air-gap ready

Compliance certified

Millisecond responsiveness

Infinite zoom & pan

Works on any device

Native performance

Instant alerts

Monitor anywhere

AI-native observability

Continuous delivery

Open source foundation

80% Faster Incident Resolution

AI-powered troubleshooting from detection, to root cause and blast radius identification, to reporting.

True Real-Time and Simple, even at Scale

Linearly and infinitely scalable full-stack observability, that can be deployed even mid-crisis.

90% Cost Reduction, Full Fidelity

Instead of centralizing the data, Netdata distributes the code, eliminating pipelines and complexity.

Control Without Surrender

SOC 2 Type 2 certified with every metric kept on your infrastructure.

Integrations

800+ collectors and notification channels, auto-discovered and ready out of the box.

800+ data collectors
Auto-discovery & zero config
Cloud, infra, app protocols
Notifications out of the box
> Explore integrations
Real Results
46% Cost Reduction

Reduced monitoring costs by 46% while cutting staff overhead by 67%.

— Leonardo Antunez, Codyas

Zero Pipeline

No data shipping. No central storage costs. Query at the edge.

From Our Users
"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

No Query Language

Point-and-click troubleshooting. No PromQL, no LogQL, no learning curve.

Enterprise Ready
67% Less Staff, 46% Cost Cut

Enterprise efficiency without enterprise complexity—real ROI from day one.

— Leonardo Antunez, Codyas

SOC 2 Type 2 Certified

Zero data egress. Only metadata reaches the cloud. Your metrics stay on your infrastructure.

Full Coverage
800+ Collectors

Auto-discovered and configured. No manual setup required.

Any Notification Channel

Slack, PagerDuty, Teams, email, webhooks—all built-in.

Built for the People Who Get Paged

Because 3am alerts deserve instant answers, not hour-long hunts.

Every Industry Has Rules. We Master Them.

See how healthcare, finance, and government teams cut monitoring costs 90% while staying audit-ready.

Monitor Any Technology. Configure Nothing.

Install the agent. It already knows your stack.
From Our Users
"A Rare Unicorn"

Netdata gives more than you invest in it. A rare unicorn that obeys the Pareto rule.

— Eduard Porquet Mateu, TMB Barcelona

99% Downtime Reduction

Reduced website downtime by 99% and cloud bill by 30% using Netdata alerts.

— Falkland Islands Government

Real Savings
30% Cloud Cost Reduction

Optimized resource allocation based on Netdata alerts cut cloud spending by 30%.

— Falkland Islands Government

46% Cost Cut

Reduced monitoring staff by 67% while cutting operational costs by 46%.

— Codyas

Real Coverage
"Plugin for Everything"

Netdata has agent capacity or a plugin for everything, including Windows and Kubernetes.

— Eduard Porquet Mateu, TMB Barcelona

"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

Real Speed
Troubleshooting in 30 Seconds

From 2-3 minutes to 30 seconds—instant visibility into any node issue.

— Matthew Artist, Nodecraft

20% Downtime Reduction

20% less downtime and 40% budget optimization from out-of-the-box monitoring.

— Simon Beginn, LANCOM Systems

Pay per Node. Unlimited Everything Else.

One price per node. Unlimited metrics, logs, users, and retention. No per-GB surprises.

Free tier—forever
No metric limits or caps
Retention you control
Cancel anytime
> See pricing plans

What's Your Monitoring Really Costing You?

Most teams overpay by 40-60%. Let's find out why.

Expose hidden metric charges
Calculate tool consolidation
Customers report 30-67% savings
Results in under 60 seconds
> See what you're really paying

Your Infrastructure Is Unique. Let's Talk.

Because monitoring 10 nodes is different from monitoring 10,000.

On-prem & air-gapped deployment
Volume pricing & agreements
Architecture review for your scale
Compliance & security support
> Start a conversation

Monitoring That Sells Itself

Deploy in minutes. Impress clients in hours. Earn recurring revenue for years.

30-second live demos close deals
Zero config = zero support burden
Competitive margins & deal protection
Response in 48 hours
> Apply to partner

Per-Second Metrics at Homelab Prices

Same engine, same dashboards, same ML. Just priced for tinkerers.

Community: Free forever · 5 nodes · non-commercial
Homelab: $90/yr · unlimited nodes · fair usage
> Get the Homelab Plan

$1,000 Per Referral. Unlimited Referrals.

Your colleagues get 10% off. You get 10% commission. Everyone wins.

10% of subscriptions, up to $1,000 each
Track earnings inside Netdata Cloud
PayPal/Venmo payouts in 3-4 weeks
No caps, no complexity
> Get your referral link
Cost Proof
40% Budget Optimization

"Netdata's significant positive impact" — LANCOM Systems

Calculate Your Savings

Compare vs Datadog, Grafana, Dynatrace

Savings Proof
46% Cost Reduction

"Cut costs by 46%, staff by 67%" — Codyas

30% Cloud Bill Savings

"Reduced cloud bill by 30%" — Falkland Islands Gov

Enterprise Proof
"Better Than Combined Alternatives"

"Better observability with Netdata than combining other tools." — TMB Barcelona

Real Engineers, <24h Response

DPA, SLAs, on-prem, volume pricing

Why Partners Win
Demo Live Infrastructure

One command, 30 seconds, real data—no sandbox needed

Zero Tickets, High Margins

Auto-config + per-node pricing = predictable profit

Homelab Ready
Free Video Course

8-episode Netdata tutorial by LearnLinux.tv

76k+ GitHub Stars

3rd most starred monitoring project

Worth Recommending
Product That Delivers

Customers report 40-67% cost cuts, 99% downtime reduction

Zero Risk to Your Rep

Free tier lets them try before they buy

AI Support Assistant, Available 24/7

Nedi has access to all official documentation, source code, and resources. Ask any question about Netdata—responds in your language.

Deployment & configuration
Troubleshooting & sizing
Alerts & notifications
Evidence-based answers
> Ask Nedi now

Never Fight Fires Alone

Docs, community, and expert help—pick your path to resolution.

Learn.netdata.cloud docs
Discord, Forums, GitHub
Premium support available
> Get answers now

60 Seconds to First Dashboard

One command to install. Zero config. 850+ integrations documented.

Linux, Windows, K8s, Docker
Auto-discovers your stack
> Read our documentation

Level Up Your Monitoring

Real problems. Real solutions. 112+ guides from basic monitoring to AI observability.

76,000+ Engineers Strong

615+ contributors. 1.5M daily downloads. One mission: simplify observability.

Per-Second. 90% Cheaper. Data Stays Home.

Side-by-side comparisons: costs, real-time granularity, and data sovereignty for every major tool.

See why teams switch from Datadog, Prometheus, Grafana, and more.

> Browse all comparisons
Edge-Native Observability, Born Open Source
Per-second visibility, ML on every metric, and data that never leaves your infrastructure.
Founded in 2016
615+ contributors worldwide
Remote-first, engineering-driven
Open source first
> Read our story
Promises We Publish—and Prove
12 principles backed by open code, independent validation, and measurable outcomes.
Open source, peer-reviewed
Zero config, instant value
Data sovereignty by design
Aligned pricing, no surprises
> See all 12 principles
Edge-Native, AI-Ready, 100% Open
76k+ stars. Full ML, AI, and automation—GPLv3+, not premium add-ons.
76,000+ GitHub stars
GPLv3+ licensed forever
ML on every metric, included
Zero vendor lock-in
> Explore our open source
Build Real-Time Observability for the World
Remote-first team shipping per-second monitoring with ML on every metric.
Remote-first, fully distributed
Open source (76k+ stars)
Challenging technical problems
Your code on millions of systems
> See open roles
Meet the Team Behind Netdata
Conferences, meetups, and tradeshows where you can see Netdata in action and talk to the engineers who build it.
Live demos and deep dives
Book 1-on-1 meetings
Talks and panel sessions
Event recaps and photos
> See all events
Talk to a Netdata Human in <24 Hours
Sales, partnerships, press, or professional services—real engineers, fast answers.
Discuss your observability needs
Pricing and volume discounts
Partnership opportunities
Media and press inquiries
> Book a conversation
Your Data. Your Rules.
On-prem data, cloud control plane, transparent terms.
Trust & Scale
76,000+ GitHub Stars

One of the most popular open-source monitoring projects

SOC 2 Type 2 Certified

Enterprise-grade security and compliance

Data Sovereignty

Your metrics stay on your infrastructure

Validated
University of Amsterdam

"Most energy-efficient monitoring solution" — ICSOC 2023, peer-reviewed

ADASTEC (Autonomous Driving)

"Doesn't miss alerts—mission-critical trust for safety software"

Community Stats
615+ Contributors

Global community improving monitoring for everyone

1.5M+ Downloads/Day

Trusted by teams worldwide

GPLv3+ Licensed

Free forever, fully open source agent

Why Join?
Remote-First

Work from anywhere, async-friendly culture

Impact at Scale

Your work helps millions of systems

$ guides / cockroachdb
COCKROACHDB · OPERATIONS PLAYBOOK

Running CockroachDB in production, without the 3 a.m. surprises

A distributed SQL layer over a transactional key-value store, replicated by Raft across ranges, persisted to a Pebble LSM tree, and coordinated by synchronized clocks. The mental model of how CockroachDB actually behaves under load, where it tends to break, what to monitor as your operation matures, and the runbooks for the incidents you'll see.

"

CockroachDB is famously easy to scale horizontally and famously unforgiving once write rate, range count, or clock drift pushes past what the defaults assumed.

The defaults work. Until writes outpace compaction, the L0 sublevel count climbs past 20, and Pebble stalls writes while read latency goes exponential. Until a node's Go GC pause exceeds the liveness heartbeat interval and it loses its leases, then recovers, then loses them again. Until NTP drifts and a node logs clock synchronization error: this node is more than 500ms away from at least half of the known nodes and self-terminates. Until a sequential primary key funnels every write through one leaseholder while the rest of the cluster idles. Until a stalled changefeed holds a protected timestamp that silently blocks MVCC garbage collection and the disk fills with dead data.

These guides are written for engineers who already run CockroachDB, not for people learning what a range is. The goal is to give you the mental model of how the cluster actually behaves under load, the failure patterns that keep recurring, the monitoring story that catches problems before they page anyone, and the runbooks you wish someone had handed you before your last incident.

How CockroachDB actually runs in production

CockroachDB is not just a SQL database. It is a distributed SQL engine over a transactional KV store, where every write travels through Raft consensus to a quorum of replicas, lands in a Pebble LSM tree, and is ordered by clocks that must stay synchronized. Most production failures live between these layers, not inside any one of them.

01
SQL gateway + DistSQL
Application connections terminate on a gateway node over the PostgreSQL wire protocol; each costs a goroutine and memory. The gateway parses and plans the statement, then may distribute it across nodes as a flow of processors. A single analytical query can saturate inter-node bandwidth and consume <code>--max-sql-memory</code> on several nodes, spilling to disk or failing with <code>53200</code>.
GATEWAY
02
transaction layer (MVCC)
Serializable snapshot isolation over MVCC timestamps. Conflicting transactions are pushed or restarted with <code>TransactionRetryWithProtoRefreshError</code>. Uncommitted writes leave intents that other transactions must resolve — abandoned intents accumulate and add latency cluster-wide.
TXN
03
ranges + leaseholder + Raft
The keyspace is split into ~512 MiB ranges, each replicated (default 3x). One replica holds the lease and serves reads; one is the Raft leader. Every write is proposed through Raft and committed by a quorum. A node with 10,000 ranges runs 10,000 Raft state machines — a non-obvious CPU multiplier.
RAFT
04
node liveness
Each node renews a liveness record on a short heartbeat. Miss the expiry — because of a GC pause, disk stall, or CPU starvation — and the cluster declares it dead, redistributing its leases. Flapping liveness is worse than a clean failure: it creates oscillating availability.
LIVENESS
05
Hybrid Logical Clocks
HLC combines wall-clock time with a logical counter and enforces a maximum offset (default 500 ms). Skew within the window causes <code>readwithinuncertainty</code> restarts; skew past 80% of max-offset makes a node self-terminate. Shared NTP failure can drift a quorum at once.
CLOCK
06
admission control
An internal flow-control system queues SQL, KV, and storage-write work to prevent overload. The <code>store-write</code> queue is tied directly to LSM L0 health — it begins shaping regular traffic at 5 sublevels. Sustained queuing means zero burst headroom.
ADMISSION
07
Pebble / LSM storage
Writes hit an in-memory memtable, flush to L0 SSTables, and compact down through L6. When compaction falls behind ingestion, L0 sublevels grow, read amplification rises nonlinearly, and Pebble eventually stalls writes — the single most common performance cliff.
STORAGE
08
disk (WAL + compaction)
Local NVMe, EBS, or a PD volume holds the WAL, SSTables, and snapshots. WAL <code>fsync</code> latency sits on the critical path of every write; a detected disk stall makes the node self-terminate. Free space below ~15% can starve compaction and trigger a death spiral.
DISK

Why this matters: a latency spike can come from LSM read amplification, lock contention, a hot range, a DistSQL shuffle, a clock-skew uncertainty restart, an admission-control queue, or a saturated disk. The symptom is the same — CockroachDB is slow — but each layer has a different signal and a different fix.

The failures you'll actually see

Most CockroachDB incidents fall into a small set of recurring patterns. Recognise the shape, and triage gets dramatically faster.

CRITICAL

The LSM compaction death spiral

Write rate exceeds disk compaction throughput. L0 SSTables accumulate, storage_l0_sublevels climbs past 10 then 20+, and read amplification rises — which makes compaction itself slower, a positive feedback loop. Eventually Pebble stalls writes, the node can't service its Raft log, loses leases, and appears partially unavailable. If several nodes hit this at once, the cluster goes down.

  • storage_l0_sublevels rising past 20 and not decreasing
  • storage_write_stalls incrementing (rate above 1/second)
  • KV write latency climbing from milliseconds to seconds
  • admission store-write queue deep, disk I/O pinned at 100%
Investigate
CRITICAL

Memory pressure to GC thrashing to liveness loss

The Go heap grows from large queries or misconfigured memory budgets. GC runs more often and longer. During a pause the node can't process Raft heartbeats; if a pause exceeds the liveness heartbeat interval, the node loses liveness and its leases redistribute. It recovers, regains leases, and the cycle repeats — oscillating availability that's hard to pin down.

  • Go GC pause durations above 500 ms, GC CPU above 15%
  • node liveness flapping in lockstep with GC pauses
  • lease transfers spiking each time liveness drops
  • sys_rss approaching the cgroup limit
Investigate
IMMINENT

The clock-skew crisis

NTP fails or a VM drifts. First the uncertainty interval widens, so readwithinuncertainty restarts climb and tail latency rises. If drift passes 80% of --max-offset (over 400 ms by default) the node logs clock synchronization error: this node is more than 500ms away from at least half of the known nodes and self-terminates — and crash-loops until the clock is fixed. Shared NTP can drift a quorum at once.

  • clock_offset_meannanos rising toward 400 ms on one or more nodes
  • readwithinuncertainty restart rate climbing (near-diagnostic)
  • a node self-terminating, then failing to rejoin
  • multiple nodes in the same NTP domain drifting together
Investigate
ACTIVE

The transaction contention storm

Infrastructure is healthy — good liveness, zero unavailable ranges — but the workload creates serialized hot paths. Transactions collide on the same keys, retry with RETRY_WRITE_TOO_OLD, leave intents others must resolve, and the cluster spends its time waiting and retrying rather than doing work. Under load it becomes a positive feedback loop.

  • txn_restarts rising, dominated by writetooold
  • intentcount and intentbytes growing
  • SQL P99 latency rising while CPU stays moderate
  • contention isolated to specific tables or indexes
Investigate
CRITICAL

Lost quorum and unavailable ranges

Simultaneous node failures, a partition bisecting a replica group, or a stuck Raft group leave ranges with no leaseholder or no quorum. ranges_unavailable goes nonzero and clients see replica unavailable for the affected keyspace. If the unavailable ranges back system metadata (meta, liveness, jobs), the impact is cluster-wide even though the count is small.

  • ranges_unavailable nonzero, sustained beyond brief lease transfers
  • replica unavailable and context deadline exceeded errors to clients
  • node liveness changes preceding the unavailability
  • ranges_underreplicated elevated and not healing
Investigate
IMMINENT

The disk-full death spiral

A store runs low on space — from data growth, MVCC garbage, or a protected timestamp that blocks GC. Below ~15% free, compaction can't stage its output and starts failing, which drives L0 growth and write stalls. Deleting data does not help immediately: tombstones only clear through compaction, which is exactly what's now broken. The store reports store is full.

  • capacity_available below 10% of total and still falling
  • MVCC garbage bytes diverging from live bytes
  • compaction failing or stalled, L0 sublevels climbing
  • protected timestamp records present with no active backup or CDC
Investigate

CockroachDB monitoring maturity levels

CockroachDB observability works in four practical levels. Each is a complete operation, not a stepping stone. Pick the level that matches how much your cluster matters. Most production clusters should land at the second level.

Level 1: Survival

Know that something is wrong

Survival monitoring is the floor. With these signals you can answer one question: is the cluster still serving? You will not learn what broke, but you will learn that something broke before users do. Survival is enough for dev clusters and low-stakes workloads.

  • Node liveness Does the cluster consider every node alive and renewing its heartbeat?
  • ranges_unavailable Is any part of the keyspace unable to serve reads or writes?
  • Disk space per store Is any store below 20% free (capacity_available)?
  • SELECT 1 synthetic probe Can a client actually connect and execute over pgwire?
  • Certificate expiration Will mutual TLS break with no grace period?

Level 2: Operational

Diagnose most incidents on your own

Operational monitoring is what most production clusters should target. Survival tells you something is wrong; operational tells you what. With this coverage your team can usually diagnose an incident on its own: storage debt, contention, clock skew, replication risk, connection pressure.

  • SQL statement latency (P50/P99) Per-node, ideally per fingerprint — a new slow query barely moves P99.
  • Transaction restart rate by cause writetooold (schema), readwithinuncertainty (clock), txnpush (app).
  • SQL error rate by code class XX000 is a genuine fault; 40001, 53200, 08006 mean different things.
  • storage_l0_sublevels per store The earliest predictor of write stalls — 10–30 min of warning.
  • clock_offset_meannanos How close is any node to the self-termination threshold?
  • Under-replicated range count Is the cluster's replication safety margin holding?
  • round_trip_latency between nodes Is inter-node RPC health slowing Raft and DistSQL?
  • CPU and RSS per node Headroom to absorb the loss of one node, per node not aggregate.
  • WAL fsync latency per store The most direct write-path health signal.
  • Admission control queue depth Is internal flow control throttling — i.e. are you at capacity?

Level 3: Mature

Catch problems before they become incidents

Mature monitoring catches problems before they wake anyone up. L0 creeping a sublevel a week, MVCC garbage diverging from live bytes, a protected timestamp aging for days, intents accumulating, a changefeed falling behind. None of these will page you on day one. They become page-out incidents on day thirty.

  • LSM read amplification per store rocksdb_read_amplification above 25 means compaction debt.
  • Pebble write stall count Any stall during normal workload is abnormal.
  • Block cache hit ratio Has the working set outgrown --cache?
  • MVCC garbage bytes Is GC keeping pace, or is dead data accumulating silently?
  • Intent count and bytes Are abandoned transactions leaving unresolved work?
  • Protected timestamp count / age Is a stalled job blocking GC and filling the disk?
  • Go GC pause duration / CPU Is GC pressure trending toward a liveness threat?
  • Range count per node Is Raft ticking overhead growing as a scaling dimension?
  • Raft snapshot and lease transfer rate Is the cluster churning or healing cleanly?
  • Changefeed lag (if using CDC) Are consumers behind, and is GC at risk downstream?

Level 4: Expert

Reactive instrumentation after real incidents

Expert signals enter your stack the day after a specific incident proved you needed them. Raft proposal drops, per-range request distribution, intent resolution throughput, closed timestamp lag, queue processor errors. Most teams never need every signal here. Add the ones your incident history says you do.

  • Raft proposal drop rate Dropped proposals are silently retried writes with latency cost.
  • Per-range request distribution Which range is hot — over 10x the average QPS?
  • Intent resolution throughput Is cleanup keeping pace during an intent cascade?
  • Closed timestamp lag How fresh can follower reads be?
  • Queue processor error counts Split, merge, replicate, and GC queue failures.
  • Disk stall detection metrics storage_disk_stalled and storage_disk_slow before utilization reacts.
  • Admission token exhaustion Are tokens running out under sustained overload?
  • SQL plan cache hit rate Are plan regressions or churn hurting specific endpoints?

Operating mistakes worth avoiding

The traps CockroachDB teams keep falling into. Each has a clear, well-known fix. Most teams only learn it after an incident.

Ignoring L0 sublevel count until write stalls hit

Teams watch disk utilization and IOPS but not LSM tree health. <code>storage_l0_sublevels</code> gives 10–30 minutes of warning before Pebble stalls writes, and it's almost never instrumented. Alert when sublevels climb past 10, and treat 20+ and rising as an emergency.

Not monitoring clock offset proactively

NTP is set-and-forget for most teams, so the first sign of drift is a node self-terminating at 3 a.m. Meanwhile <code>readwithinuncertainty</code> restarts have been quietly inflating tail latency for days. Watch <code>clock_offset_meannanos</code> and ticket at 250 ms, well before the 400 ms self-termination threshold.

Alarming on total retry rate without breaking down the cause

<code>writetooold</code> means contention (a schema problem), <code>readwithinuncertainty</code> means clock skew (an infra problem), and <code>txnpush</code> means transaction conflicts (an application problem). Treating <code>txn_restarts</code> as one number wastes diagnosis time — each cause needs a different response.

Watching only aggregate latency instead of per fingerprint

A new slow query among fast ones barely moves cluster P99 but kills the endpoint that runs it. Per-statement-fingerprint tracking catches plan regressions and missing indexes that aggregate <code>sql_service_latency</code> hides.

Not monitoring MVCC garbage and protected timestamps

Data gets deleted but nobody checks that GC actually runs. A stalled changefeed or hung backup holds a protected timestamp that blocks GC entirely, and the disk fills with several times the live data in tombstones — completely silent until the store is full.

Using TCP health checks instead of /health?ready=1

A plain TCP check keeps routing traffic to nodes that are draining, write-stalled, or GC-thrashing. CockroachDB exposes <code>/health?ready=1</code>, which returns 503 when a node is impaired. Most deployments never wire their load balancer to it.

Trusting cluster averages over per-node signals

One hot leaseholder or one overloaded region hides under healthy global metrics. A single node can be melting at 95% CPU while the cluster average looks fine. Always alert per node, never on the aggregate alone.

Treating recovery activity as automatically good

Snapshot and rebalance storms during healing compete with foreground traffic and can degrade the cluster further. Confirm that under-replication is actually decreasing, and throttle background work if recovery I/O is starving live queries.

CockroachDB runbooks in this section

Each guide is a focused runbook for one symptom or topic. Pick one when you have an incident, or use the categories to learn the area.

WHERE TO GO NEXT

Setting up CockroachDB monitoring, or putting out a fire?

If you're starting from scratch, the monitoring checklist is the path of least regret. If you're mid-incident, jump straight to the symptom that matches what you're seeing.