The only agent that thinks for itself

Autonomous Monitoring with self-learning AI built-in, operating independently across your entire stack.

Unlimited Metrics & Logs
Machine learning & MCP
5% CPU, 150MB RAM
3GB disk, >1 year retention
800+ integrations, zero config
Dashboards, alerts out of the box
> Discover Netdata Agents

Centralized metrics streaming and storage

Aggregate metrics from multiple agents into centralized Parent nodes for unified monitoring across your infrastructure.

Stream from unlimited agents
Long-term data retention
High availability clustering
Data replication & backup
Scalable architecture
Enterprise-grade security
> Learn about Parents

Fully managed cloud platform

Access your monitoring data from anywhere with our SaaS platform. No infrastructure to manage, automatic updates, and global availability.

Zero infrastructure management
99.9% uptime SLA
Global data centers
Automatic updates & patches
Enterprise SSO & RBAC
SOC2 & ISO certified
> Explore Netdata Cloud

Deploy Netdata Cloud in your infrastructure

Run the full Netdata Cloud platform on-premises for complete data sovereignty and compliance with your security policies.

Complete data sovereignty
Air-gapped deployment
Custom compliance controls
Private network integration
Dedicated support team
Kubernetes & Docker support
> Learn about Cloud On-Premises

Powerful, intuitive monitoring interface

Modern, responsive UI built for real-time troubleshooting with customizable dashboards and advanced visualization capabilities.

Real-time chart updates
Customizable dashboards
Dark & light themes
Advanced filtering & search
Responsive on all devices
Collaboration features
> Explore Netdata UI

Monitor on the go

Native iOS and Android apps bring full monitoring capabilities to your mobile device with real-time alerts and notifications.

iOS & Android apps
Push notifications
Touch-optimized interface
Offline data access
Biometric authentication
Widget support
> Download apps

The future of infrastructure observability

See our strategic direction across AI-native observability, full-stack signals, operational intelligence, and enterprise platform maturity.

AI-native observability
Full-stack signal coverage
Operational intelligence
Enterprise platform maturity
Agent releases every 6 weeks
Cloud continuous delivery
> Explore Product Roadmap

Best energy efficiency

True real-time per-second

100% automated zero config

Centralized observability

Multi-year retention

High availability built-in

Zero maintenance

Always up-to-date

Enterprise security

Complete data control

Air-gap ready

Compliance certified

Millisecond responsiveness

Infinite zoom & pan

Works on any device

Native performance

Instant alerts

Monitor anywhere

AI-native observability

Continuous delivery

Open source foundation

80% Faster Incident Resolution

AI-powered troubleshooting from detection, to root cause and blast radius identification, to reporting.

True Real-Time and Simple, even at Scale

Linearly and infinitely scalable full-stack observability, that can be deployed even mid-crisis.

90% Cost Reduction, Full Fidelity

Instead of centralizing the data, Netdata distributes the code, eliminating pipelines and complexity.

Control Without Surrender

SOC 2 Type 2 certified with every metric kept on your infrastructure.

Integrations

800+ collectors and notification channels, auto-discovered and ready out of the box.

800+ data collectors
Auto-discovery & zero config
Cloud, infra, app protocols
Notifications out of the box
> Explore integrations
Real Results
46% Cost Reduction

Reduced monitoring costs by 46% while cutting staff overhead by 67%.

— Leonardo Antunez, Codyas

Zero Pipeline

No data shipping. No central storage costs. Query at the edge.

From Our Users
"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

No Query Language

Point-and-click troubleshooting. No PromQL, no LogQL, no learning curve.

Enterprise Ready
67% Less Staff, 46% Cost Cut

Enterprise efficiency without enterprise complexity—real ROI from day one.

— Leonardo Antunez, Codyas

SOC 2 Type 2 Certified

Zero data egress. Only metadata reaches the cloud. Your metrics stay on your infrastructure.

Full Coverage
800+ Collectors

Auto-discovered and configured. No manual setup required.

Any Notification Channel

Slack, PagerDuty, Teams, email, webhooks—all built-in.

Built for the People Who Get Paged

Because 3am alerts deserve instant answers, not hour-long hunts.

Every Industry Has Rules. We Master Them.

See how healthcare, finance, and government teams cut monitoring costs 90% while staying audit-ready.

Monitor Any Technology. Configure Nothing.

Install the agent. It already knows your stack.
From Our Users
"A Rare Unicorn"

Netdata gives more than you invest in it. A rare unicorn that obeys the Pareto rule.

— Eduard Porquet Mateu, TMB Barcelona

99% Downtime Reduction

Reduced website downtime by 99% and cloud bill by 30% using Netdata alerts.

— Falkland Islands Government

Real Savings
30% Cloud Cost Reduction

Optimized resource allocation based on Netdata alerts cut cloud spending by 30%.

— Falkland Islands Government

46% Cost Cut

Reduced monitoring staff by 67% while cutting operational costs by 46%.

— Codyas

Real Coverage
"Plugin for Everything"

Netdata has agent capacity or a plugin for everything, including Windows and Kubernetes.

— Eduard Porquet Mateu, TMB Barcelona

"Out-of-the-Box"

So many out-of-the-box features! I mostly don't have to develop anything.

— Simon Beginn, LANCOM Systems

Real Speed
Troubleshooting in 30 Seconds

From 2-3 minutes to 30 seconds—instant visibility into any node issue.

— Matthew Artist, Nodecraft

20% Downtime Reduction

20% less downtime and 40% budget optimization from out-of-the-box monitoring.

— Simon Beginn, LANCOM Systems

Pay per Node. Unlimited Everything Else.

One price per node. Unlimited metrics, logs, users, and retention. No per-GB surprises.

Free tier—forever
No metric limits or caps
Retention you control
Cancel anytime
> See pricing plans

What's Your Monitoring Really Costing You?

Most teams overpay by 40-60%. Let's find out why.

Expose hidden metric charges
Calculate tool consolidation
Customers report 30-67% savings
Results in under 60 seconds
> See what you're really paying

Your Infrastructure Is Unique. Let's Talk.

Because monitoring 10 nodes is different from monitoring 10,000.

On-prem & air-gapped deployment
Volume pricing & agreements
Architecture review for your scale
Compliance & security support
> Start a conversation

Monitoring That Sells Itself

Deploy in minutes. Impress clients in hours. Earn recurring revenue for years.

30-second live demos close deals
Zero config = zero support burden
Competitive margins & deal protection
Response in 48 hours
> Apply to partner

Per-Second Metrics at Homelab Prices

Same engine, same dashboards, same ML. Just priced for tinkerers.

Community: Free forever · 5 nodes · non-commercial
Homelab: $90/yr · unlimited nodes · fair usage
> Get the Homelab Plan

$1,000 Per Referral. Unlimited Referrals.

Your colleagues get 10% off. You get 10% commission. Everyone wins.

10% of subscriptions, up to $1,000 each
Track earnings inside Netdata Cloud
PayPal/Venmo payouts in 3-4 weeks
No caps, no complexity
> Get your referral link
Cost Proof
40% Budget Optimization

"Netdata's significant positive impact" — LANCOM Systems

Calculate Your Savings

Compare vs Datadog, Grafana, Dynatrace

Savings Proof
46% Cost Reduction

"Cut costs by 46%, staff by 67%" — Codyas

30% Cloud Bill Savings

"Reduced cloud bill by 30%" — Falkland Islands Gov

Enterprise Proof
"Better Than Combined Alternatives"

"Better observability with Netdata than combining other tools." — TMB Barcelona

Real Engineers, <24h Response

DPA, SLAs, on-prem, volume pricing

Why Partners Win
Demo Live Infrastructure

One command, 30 seconds, real data—no sandbox needed

Zero Tickets, High Margins

Auto-config + per-node pricing = predictable profit

Homelab Ready
Free Video Course

8-episode Netdata tutorial by LearnLinux.tv

76k+ GitHub Stars

3rd most starred monitoring project

Worth Recommending
Product That Delivers

Customers report 40-67% cost cuts, 99% downtime reduction

Zero Risk to Your Rep

Free tier lets them try before they buy

AI Support Assistant, Available 24/7

Nedi has access to all official documentation, source code, and resources. Ask any question about Netdata—responds in your language.

Deployment & configuration
Troubleshooting & sizing
Alerts & notifications
Evidence-based answers
> Ask Nedi now

Never Fight Fires Alone

Docs, community, and expert help—pick your path to resolution.

Learn.netdata.cloud docs
Discord, Forums, GitHub
Premium support available
> Get answers now

60 Seconds to First Dashboard

One command to install. Zero config. 850+ integrations documented.

Linux, Windows, K8s, Docker
Auto-discovers your stack
> Read our documentation

Level Up Your Monitoring

Real problems. Real solutions. 112+ guides from basic monitoring to AI observability.

76,000+ Engineers Strong

615+ contributors. 1.5M daily downloads. One mission: simplify observability.

Per-Second. 90% Cheaper. Data Stays Home.

Side-by-side comparisons: costs, real-time granularity, and data sovereignty for every major tool.

See why teams switch from Datadog, Prometheus, Grafana, and more.

> Browse all comparisons
Edge-Native Observability, Born Open Source
Per-second visibility, ML on every metric, and data that never leaves your infrastructure.
Founded in 2016
615+ contributors worldwide
Remote-first, engineering-driven
Open source first
> Read our story
Promises We Publish—and Prove
12 principles backed by open code, independent validation, and measurable outcomes.
Open source, peer-reviewed
Zero config, instant value
Data sovereignty by design
Aligned pricing, no surprises
> See all 12 principles
Edge-Native, AI-Ready, 100% Open
76k+ stars. Full ML, AI, and automation—GPLv3+, not premium add-ons.
76,000+ GitHub stars
GPLv3+ licensed forever
ML on every metric, included
Zero vendor lock-in
> Explore our open source
Build Real-Time Observability for the World
Remote-first team shipping per-second monitoring with ML on every metric.
Remote-first, fully distributed
Open source (76k+ stars)
Challenging technical problems
Your code on millions of systems
> See open roles
Meet the Team Behind Netdata
Conferences, meetups, and tradeshows where you can see Netdata in action and talk to the engineers who build it.
Live demos and deep dives
Book 1-on-1 meetings
Talks and panel sessions
Event recaps and photos
> See all events
Talk to a Netdata Human in <24 Hours
Sales, partnerships, press, or professional services—real engineers, fast answers.
Discuss your observability needs
Pricing and volume discounts
Partnership opportunities
Media and press inquiries
> Book a conversation
Your Data. Your Rules.
On-prem data, cloud control plane, transparent terms.
Trust & Scale
76,000+ GitHub Stars

One of the most popular open-source monitoring projects

SOC 2 Type 2 Certified

Enterprise-grade security and compliance

Data Sovereignty

Your metrics stay on your infrastructure

Validated
University of Amsterdam

"Most energy-efficient monitoring solution" — ICSOC 2023, peer-reviewed

ADASTEC (Autonomous Driving)

"Doesn't miss alerts—mission-critical trust for safety software"

Community Stats
615+ Contributors

Global community improving monitoring for everyone

1.5M+ Downloads/Day

Trusted by teams worldwide

GPLv3+ Licensed

Free forever, fully open source agent

Why Join?
Remote-First

Work from anywhere, async-friendly culture

Impact at Scale

Your work helps millions of systems

$ guides / mongodb
MONGODB · OPERATIONS PLAYBOOK

Running MongoDB in production, without the 3 a.m. surprises

A WiredTiger cache that hates giving RAM back, 60-second checkpoints, a fixed-size oplog, semaphore-based admission tickets, and a replica set that elects a new primary the moment a heartbeat is missed. The mental model of how MongoDB actually behaves under load, where it tends to break, what to monitor as your operation matures, and the runbooks for the incidents you'll see.

"

MongoDB is famously easy to start and famously unforgiving once the working set, the write rate, or the secondary count grows past what the defaults assumed.

The defaults work. Until the WiredTiger cache fills, background eviction can't keep up, and application threads are forced to evict pages inline — adding latency to every operation at once. Until a write surge turns the oplog over faster than a secondary can consume it, the secondary falls off the window, and you read too stale to catch up on a node now stuck in RECOVERING. Until a checkpoint falls behind, the journal fills, and WiredTiger blocks every write with no error at all — just infinite latency. Until all 128 write tickets are held by operations waiting on slow disk, and every new query queues behind them. Until an election storm flips the primary back and forth and your driver answers writes with not master.

These guides are written for engineers who already run MongoDB, not for people learning what a document is. The goal is to give you the mental model of how the server actually behaves under load, the failure patterns that keep recurring, the monitoring story that catches problems before they page anyone, and the runbooks you wish someone had handed you before your last incident.

How MongoDB actually runs in production

MongoDB is not just a document store. It is a replica set tailing an oplog, fronted by a WiredTiger storage engine with a managed cache, periodic checkpoints, a write-ahead journal, and a fixed pool of admission tickets. Most production failures live between these layers, not inside any one of them.

01
drivers / connection pool
Application drivers, replica set peers, <code>mongos</code> routers, and monitoring tools. MongoDB uses one thread per connection (~1 MB stack each), so 10,000 connections is 10,000 threads. The total counts toward <code>maxIncomingConnections</code> and the file-descriptor limit — exceed either and new connections are refused.
CLIENT
02
mongos / replica set routing
In a sharded cluster, stateless <code>mongos</code> routers read chunk maps from config servers and fan out queries. In a replica set, the driver routes writes to the PRIMARY. A stale topology after an election sends writes to a former primary, which answers <code>not master</code>.
ROUTING
03
admission control (tickets)
Every operation touching storage must acquire a read or write ticket — 128 of each by default (≤6.x), dynamically tuned in 7.0+, surfaced as <code>queues.execution</code> in 8.0. When tickets run out, operations queue. Ticket exhaustion is the most under-monitored cause of MongoDB latency crises.
TICKETS
04
WiredTiger cache
A managed buffer pool, default 50% of RAM minus 1 GB, holding documents and indexes uncompressed. Background eviction starts at 80% fill; dirty pages evict aggressively past 20% dirty ratio. When background eviction falls behind, application threads evict inline and latency spikes 10–100x.
CACHE
05
checkpoints + journal
Every 60 seconds a checkpoint flushes dirty pages to disk; the journal (write-ahead log) syncs every ~100 ms for crash recovery. If a checkpoint takes longer than the interval, dirty data accumulates and the journal fills — and WiredTiger freezes all writes until it drains.
PERSIST
06
oplog + replication
The PRIMARY records every write to a capped <code>local.oplog.rs</code> collection. Secondaries tail it and apply entries. The oplog window is your safety margin: if a secondary falls behind it, it cannot catch up and needs a multi-hour full resync. Flow control (4.2+) throttles the primary to protect the window.
REPLICA
07
elections + heartbeats
Members heartbeat every 2 seconds. Miss them past <code>electionTimeoutMillis</code> (default 10s) and an election runs — a 2–12 second write outage. A node that accepted writes never replicated to a majority must <code>ROLLBACK</code> them at failover, writing lost data to a rollback directory.
ELECT
08
OS memory + page cache
RSS should be roughly cache + ~1 MB per connection + ~1 GB overhead. The Linux OOM killer judges by RSS and targets <code>mongod</code> first. Transparent Huge Pages and swap both wreck latency; MongoDB should never swap.
KERNEL
09
disk (checkpoints / journal / data)
Local NVMe, EBS, or a container volume holding data files, journal, and oplog. Journal sync latency and checkpoint duration set the floor on write durability. On cloud disks, depleted burst credits spike I/O latency 10–100x and stall everything above.
DISK

Why this matters: a latency spike can come from cache eviction, a checkpoint stall, a journal-sync delay, ticket exhaustion, a collection scan from a dropped index, replication lag, or a saturated disk. The symptom is the same — MongoDB is slow — but each layer has a different signal and a different fix.

The failures you'll actually see

Most MongoDB incidents fall into a small set of recurring patterns. Recognise the shape, and triage gets dramatically faster.

CRITICAL

The cache pressure cascade

Write volume exceeds the rate WiredTiger can flush dirty pages. The cache fills, background eviction can't keep up, and application threads start evicting pages inline — adding latency to every operation. Tickets are held longer, new operations queue, application timeouts trigger reconnections, and the reconnections create more threads competing for the same tickets. A self-reinforcing degradation spiral that affects reads and writes alike.

  • tracked dirty bytes / maximum bytes configured above 15%
  • pages evicted by application threads incrementing at a sustained rate
  • globalLock.currentQueue readers and writers growing
  • opLatencies rising on both reads and writes with connection count climbing
Investigate
IMMINENT

The oplog window collapse

A write surge turns the oplog over faster than a secondary can consume it. The window shrinks, the secondary has less and less time to catch up, and once its position wraps past the oldest oplog entry it falls off entirely, enters RECOVERING, and needs a multi-hour full initial sync. The cluster loses a member, the survivors absorb more load, and the risk of a second secondary falling off rises.

  • oplog window shrinking from hours toward minutes (rs.printReplicationInfo)
  • replication lag increasing linearly, not stabilizing
  • secondary metrics.repl.apply rate below the primary's write rate
  • too stale to catch up in the log; a member stuck in RECOVERING
Investigate
CRITICAL

The connection storm spiral

A trigger — an election, a deploy, a network blip, a DNS failure — causes connection pools across every application instance to reconnect at once. Each new connection spawns a ~1 MB thread, RSS spikes, tickets contend, existing operations slow, more timeouts fire, and more reconnections follow. MongoDB eventually refuses new connections at maxIncomingConnections or runs out of file descriptors entirely.

  • connections.totalCreated rate spiking (churn) with current climbing fast
  • memory RSS spiking in step with the connection count
  • connection refused / error accepting new connection in the log
  • current / (current + available) above 80%
Investigate
IMMINENT

The checkpoint stall write freeze

The checkpoint process falls critically behind. Dirty data accumulates, the journal reaches its size limit, and WiredTiger blocks every new write until the checkpoint drains enough to recycle the journal. Writes simply stop — no error, just infinite latency — while reads may still serve from cache. When the checkpoint finally completes, all queued writes execute at once and can trigger the next stall.

  • transaction checkpoint most recent time (msecs) exceeding the 60s interval and growing
  • WiredTiger cache dirty ratio high and rising
  • journal sync latency spiking then flatlining
  • opLatencies.writes climbing toward infinity while reads continue
Investigate
WATCHFUL

The silent index regression

An index is accidentally dropped, a background build fails, or the planner picks a worse plan. Queries silently switch to collection scans. Latency rises gradually, proportional to collection growth, until scan I/O reaches a tipping point and overwhelms the storage subsystem. Nothing errors — the same queries that were fast last month are now the slowest thing on the box.

  • slow query log showing COLLSCAN on collections over 10,000 documents
  • metrics.queryExecutor.scanned / scannedObjects rate increasing
  • docsExamined / docsReturned ratio climbing for specific queries
  • $indexStats showing a previously-busy index with zero recent ops
Investigate
ACTIVE

The election storm

The primary repeatedly steps down or loses elections — from resource exhaustion delaying heartbeats, network instability, or a priority-takeover loop. Each election is a 2–12 second write outage, members flip between PRIMARY and SECONDARY, and the driver answers writes with not master. Worst case, a node that accepted writes before stepping down must roll them back, losing data.

  • rs.status() showing different members claiming PRIMARY over time
  • Starting an election / Stepping down repeating in the log
  • more than 2 elections within 10 minutes outside maintenance
  • intermittent write failures and connection resets during transitions
Investigate

MongoDB monitoring maturity levels

MongoDB observability works in four practical levels. Each is a complete operation, not a stepping stone. Pick the level that matches how much your cluster matters. Most production replica sets should land at the second level.

Level 1: Survival

Know that something is wrong

Survival monitoring is the floor. With these signals you can answer one question: is the cluster still functioning? You will not learn what broke, but you will learn that something broke before users do. Survival is enough for dev environments and non-critical workloads.

  • Process liveness (ping) Does mongod answer db.adminCommand({ping:1}) within a couple of seconds?
  • Replica set member state Is there exactly one PRIMARY, and is every member PRIMARY or SECONDARY?
  • Replication lag How far behind is each secondary, in seconds and as a fraction of the oplog window?
  • Oplog window (hours of coverage) How long can a secondary be offline before it needs a full resync?
  • Disk space on data + journal Are you about to fill the disk and freeze writes?
  • Connection count vs available Are you approaching maxIncomingConnections?
  • Slow query log enabled (slowms:100) Will a slow query actually be recorded when it happens?

Level 2: Operational

Diagnose most incidents on your own

Operational monitoring is what most production clusters should target. Survival tells you something is wrong; operational tells you what. With this coverage your team can usually diagnose an incident on its own: cache pressure, checkpoint stalls, replication lag, election churn, connection pressure.

  • WiredTiger cache fill AND dirty ratio Fill near 80% with dirty above 15% is the cache-pressure warning.
  • opcounters (insert/query/update/delete) A sudden drop means something is blocking operations.
  • opLatencies (reads / writes) Average and approximate p99, expressed as a multiple of baseline.
  • globalLock.currentQueue depth Are readers or writers queuing for the storage engine?
  • Oplog window trend Is your safety margin shrinking month over month?
  • Memory RSS vs expected Is RSS approaching the system limit (OOM-kill risk)?
  • Election events Is the primary stepping down more than it should?
  • Page faults rate Does the working set still fit in memory after warmup?
  • OS disk I/O latency Is the storage device keeping up with checkpoints and journal?

Level 3: Mature

Catch problems before they become incidents

Mature monitoring catches problems before they wake anyone up. Tickets depleting during peaks, checkpoint duration creeping toward the interval, journal sync latency drifting, a noTimeout cursor pinning a snapshot, an index quietly going unused. None of these pages you on day one. They become page-out incidents on day thirty.

  • WiredTiger ticket utilization (read + write) Available tickets below 25% means operations are about to queue.
  • Application-thread eviction rate Any sustained nonzero rate means users are feeling the cache.
  • Checkpoint duration A 55s checkpoint every 60s looks stable but has zero margin.
  • Journal sync latency A storage-health signal that warns 30–60s before app latency.
  • scanned / scannedObjects vs returned Rising ratios mean inefficient or missing indexes.
  • Connection churn (totalCreated delta) Stable count can still hide expensive create/destroy churn.
  • Cursor counts (especially noTimeout) noTimeout cursors pin snapshots and cause silent cache pressure.
  • currentOp longest-running operation One runaway query holding a ticket makes the whole server look slow.
  • Flow control status (isLagged) Is the primary throttling itself to protect the oplog window?

Level 4: Expert

Reactive instrumentation after real incidents

Expert signals enter your stack the day after a specific incident proved you needed them. History-store activity, plan-cache evictions, jumbo-chunk growth, per-shard contention heat maps, tcmalloc fragmentation, oplog entry-size distribution. Most teams never need every signal here. Add the ones your incident history says you do.

  • Plan cache eviction events Sudden plan changes that turn a fast query into a collection scan.
  • WiredTiger history store activity Old-version retention pressure from long snapshots (replaced cache overflow in ~4.4).
  • Ticket utilization trended per-minute A declining peak-time minimum forecasts the next ticket crisis.
  • Jumbo chunk count and growth Chunks that can't split or migrate cause permanent shard imbalance.
  • Chunk migration latency / failures moveChunk I/O pressure and range locks on busy shards.
  • Config server operation latency Slow config servers stall splits and migrations cluster-wide.
  • tcmalloc fragmentation ratio heap_size / current_allocated_bytes inflating RSS over time.
  • Oplog entry size distribution Large multi-document transactions producing oversized oplog entries.

Operating mistakes worth avoiding

The traps MongoDB teams keep falling into. Each has a clear, well-known fix. Most teams only learn it after an incident.

Not monitoring the WiredTiger dirty ratio

Teams watch cache fill percentage and ignore the dirty ratio — yet dirty is the stronger leading indicator. It reveals checkpoint-stall risk 10–30 minutes before any latency degradation. Alert on <code>tracked dirty bytes / maximum bytes configured</code> above 15%, not on fill alone (75–80% fill is normal and healthy).

Never watching ticket utilization

Ticket exhaustion is a top cause of MongoDB latency crises and is the single most under-monitored signal. Teams debug "slow queries" when the real cause is all 128 write tickets held by operations waiting on slow disk. Graph available read and write tickets; alert below 25%. Raising the ticket limit is almost never the right fix.

Sizing the oplog once and never trending it

Teams size the oplog at deployment and forget. As write volume grows organically, the window shrinks month by month, and the failure is discovered when a secondary needs maintenance and can't catch up. Trend the minimum window during peak writes; keep it above 2x your longest expected secondary downtime.

Trusting w:1 writes as durable

<code>w:1</code> means the primary acknowledged the write in memory — not that it reached disk or replicated. A crash or a rollback at failover loses those writes silently. Use <code>w:"majority"</code> for data you can't lose, and monitor <code>wtimeouts</code> so you know when durability isn't being met.

Leaving Transparent Huge Pages enabled and allowing swap

THP causes latency spikes and fragmentation; swap turns the process into a 1/1000th-speed zombie. Both are easy to miss. Confirm <code>cat /sys/kernel/mm/transparent_hugepage/enabled</code> shows <code>[never]</code>, set <code>vm.swappiness=1</code>, and protect <code>mongod</code> from the OOM killer with <code>oom_score_adj</code>.

Monitoring connection count but not churn

A count of 500/10,000 looks fine — but if those 500 connections are created and destroyed 100 times a minute, thread-creation overhead is devastating. The <code>totalCreated</code> delta is more informative than <code>current</code> in many failure modes. Fix the pool, don't just raise the ceiling.

Assuming secondaries are healthy because lag is zero

Teams watch the primary obsessively and assume zero lag means secondaries are fine. But a secondary can have degraded storage, memory pressure, or competing read traffic that won't surface as lag until a load spike pushes it over — taking both the secondary and your redundancy at once. Verify each secondary's own resource health.

Monitoring a sharded cluster blind to per-shard skew

Aggregate dashboards hide a hot shard sitting at 90% I/O while the others idle at 20%. Poor shard keys and jumbo chunks cause skew the balancer can't fix. Compare latency, throughput, and resource use per shard, and watch jumbo chunk count — aggregate alarms will never catch it.

MongoDB runbooks in this section

Each guide is a focused runbook for one symptom or topic. Pick one when you have an incident, or use the categories to learn the area.

WHERE TO GO NEXT

Setting up MongoDB monitoring, or putting out a fire?

If you're starting from scratch, the monitoring checklist is the path of least regret. If you're mid-incident, jump straight to the symptom that matches what you're seeing.