MongoDB page faults high: working set exceeding memory after warmup

Hard page faults long after startup mean the active data set exceeds resident memory. On Linux, extra_info.page_faults counts major faults: the OS read data from disk because the page was missing from both the WiredTiger cache and the OS page cache. A brief spike after restart is normal during warmup, but sustained faults mean the working set does not fit. On EBS gp3, 50 faults per second can degrade latency. On NVMe, hundreds per second may be tolerable, but neither is free. Confirm the cause, distinguish warmup from pressure, and reduce the fault rate without guessing.

What this means

MongoDB uses a two-tier memory hierarchy. WiredTiger maintains its own uncompressed cache, defaulting to roughly 50% of RAM minus 1 GB. When a document is not in the WiredTiger cache, WiredTiger may still find the compressed on-disk page in the OS page cache. A page fault only fires when neither layer holds the data, forcing a physical disk read. Sustained faults after warmup mean the active data set exceeds the combined memory of both tiers. This is worse than a WiredTiger cache miss served by the OS page cache. It is an OS-level signal that the node is memory-bound, and every fault adds disk I/O latency directly to the operation.

flowchart TD
    A[Query requests page] --> B{In WiredTiger cache?}
    B -->|Yes| C[Serve from WT cache]
    B -->|No| D{In OS page cache?}
    D -->|Yes| E[Read into WT cache]
    D -->|No| F[Major page fault
disk I/O required] E --> C F --> C

Common causes

CauseWhat it looks likeFirst thing to check
Working set growth or unindexed queriesFaults rise with disk read IOPS; docsExamined far exceeds docsReturned in slow queries.WiredTiger cache fill ratio and db.currentOp() for collection scans.
WiredTiger cache undersized or container limit ignoredFaults are high despite a modest active set; cache is capped far below available RAM.wiredTiger.cache.maximum bytes configured against host or container memory limit.
Long-running snapshots pinning old versionsCache fill is high but dirty ratio is low; faults persist with few new writes.db.currentOp() for open transactions and metrics.cursor.open.noTimeout count.
External memory pressure or swapFaults spike alongside system-level memory exhaustion; mongod RSS is stable but available memory is low.free -m and vmstat 1 for swap activity and system reclaim.
Inadequate storage for unavoidable faultsFault rate is acceptable for NVMe but painful on EBS gp3; latency spikes correlate with fault spikes.Storage device type and iostat -x 1 for await and utilization.

Quick checks

Run these read-only commands to baseline the current state.

# Check system memory and swap pressure
free -m && vmstat 1 3
# Major page faults per mongod process
pgrep mongod | while read pid; do
  awk '{print "pid "$1" majflt:", $12}' /proc/$pid/stat
done
// Check WiredTiger cache fill, dirty ratio, and configured size
var c = db.serverStatus().wiredTiger.cache;
var max = c["maximum bytes configured"];
var used = c["bytes currently in the cache"];
var dirty = c["tracked dirty bytes in the cache"];
print("Cache used: " + (100 * used / max).toFixed(1) + "%");
print("Cache dirty: " + (100 * dirty / max).toFixed(1) + "%");
print("Max configured: " + (max / 1024 / 1024 / 1024).toFixed(1) + " GB");
// Check cumulative page faults (compute delta over 60s for a rate)
db.serverStatus().extra_info.page_faults
// Check for long-running operations and open transactions
db.currentOp({ "active": true, "secs_running": { "$gt": 60 } }).inprog.forEach(function(op) {
  print(op.opid + " | " + op.op + " | " + op.secs_running + "s | " + op.ns);
});
// Check for cursors that never time out and can pin snapshots
printjson(db.serverStatus().metrics.cursor)
# Check disk I/O latency and utilization
iostat -x 1 5
// Check resident memory vs expected baseline
var mem = db.serverStatus().mem;
var conn = db.serverStatus().connections;
print("RSS MB: " + mem.resident);
print("Connections: " + conn.current);

How to diagnose it

  1. Confirm the fault rate is abnormal. Sample extra_info.page_faults twice over 60 seconds and compute the delta. If the node recently restarted, high faults are expected while the cache warms. Wait until the working set should have loaded before treating faults as abnormal.

  2. Check the two-tier memory state. Inspect WiredTiger cache fill ratio. If it is below 70% and faults are high, the working set likely exceeds the OS page cache because other processes are consuming RAM or the OS is reclaiming cache aggressively. If cache fill is above 80%, WiredTiger itself is under pressure.

  3. Identify snapshot retention. Run db.currentOp() filtered for transactions and aggregations running longer than 60 seconds. Check metrics.cursor.open.noTimeout. If either is elevated, old snapshots are preventing WiredTiger from evicting historical versions, reducing the effective cache available for the working set.

  4. Correlate with query efficiency. Scan the slow query log for COLLSCAN or queries where docsExamined vastly exceeds docsReturned. A new unindexed query can pull far more data into memory than necessary, displacing the real working set and causing faults on subsequent accesses.

  5. Validate the cache sizing. Compare maximum bytes configured to the host’s physical RAM. In containers, set the cache size explicitly based on the container limit , because the default formula may use host RAM rather than the container limit. A container with a 4 GB limit on a 64 GB host can experience OOM kills if the cache is sized to host RAM, or suffer cache pressure if capped too low.

  6. Check storage backend latency. Run iostat -x 1. If await is high during fault spikes, the disk subsystem is the bottleneck. On EBS gp2, check burst balance. On gp3, verify provisioned IOPS and throughput are not saturated.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
extra_info.page_faults rateHard faults mean disk I/O on every miss.Sustained rate above 50/s on EBS gp3, or trending upward after warmup.
WiredTiger cache fill ratioShows if the working set exceeds the internal cache.Above 80% sustained, especially with rising eviction rates.
WiredTiger cache dirty ratioDirty data accumulation can displace clean pages and worsen faults.Above 10% sustained; above 20% risks checkpoint stalls.
metrics.cursor.open.noTimeoutEach cursor can hold a snapshot, pinning old versions.Above zero is a risk; above 10 strongly indicates cache pressure from snapshots.
currentOp max ageOne runaway query can flood the cache with irrelevant pages.Any non-background operation above 300 seconds.
System available memory / swapExternal memory pressure steals page cache from MongoDB.Available memory near zero or any swap activity.
Disk read await (iostat)Confirms whether faults are actually causing queueing.await above 20 ms sustained during fault spikes.

Fixes

Reduce the working set or improve locality

Add missing indexes or optimize queries so MongoDB touches fewer pages. Use db.collection.aggregate([{ $indexStats: {} }]) to verify indexes are being used. A single new collection scan can displace a previously stable working set. Tradeoff: write amplification from additional indexes and the I/O cost of background builds.

Right-size the WiredTiger cache

If the cache is too small for the working set, increase it with --wiredTigerCacheSizeGB or storage.wiredTiger.engineConfig.cacheSizeGB in the configuration file. Do not exceed roughly 80% of available RAM; the OS page cache and connection thread stacks also need space. In containers, set this explicitly based on the container limit, not the host’s. Tradeoff: less RAM for the OS page cache, which can paradoxically increase faults if overdone.

Free pinned snapshots

Kill unnecessarily long-running operations via db.killOp(). Identify applications leaving noCursorTimeout cursors open and close them. This immediately increases the pool of evictable pages. Warning: killing operations is disruptive to clients and can interrupt in-flight transactions or ETL jobs.

Reduce memory competition

Shrink application connection pool sizes to reduce thread stack overhead, or move non-MongoDB workloads off the node. Ensure vm.swappiness is set to 1 so the OS prefers reclaiming page cache over swapping. If swap is active, faults become far more expensive.

Scale out or archive cold data

If the working set exceeds what can fit in memory economically, shard the collection to spread the working set across nodes, or archive cold data to reduce the active set. Tradeoff: operational complexity.

Upgrade storage if faults are unavoidable

If the working set cannot be reduced and memory cannot be increased, ensure the storage layer can absorb the fault rate. Moving from EBS gp3 to NVMe-backed instances turns a latency crisis into manageable background noise.

Prevention

  • Trend cache fill and dirty ratio over weeks. A steady climb from 60% to 75% gives early warning that the working set is approaching limits.
  • Audit index usage monthly. Unused indexes consume cache and write bandwidth. Missing indexes cause scans that bloat the effective working set.
  • Monitor connection churn, not just connection count. High totalCreated rates increase memory fragmentation and RSS pressure.
  • Gate alerts on uptime. Suppress page fault alerts during the first 30 minutes after restart to avoid false positives during warmup.
  • Track currentOp max age continuously. Catching a runaway query at 60 seconds prevents it from flooding cache and causing a fault storm.

How Netdata helps

Netdata correlates extra_info.page_faults with WiredTiger cache fill, dirty ratio, and eviction rates. OS-level disk latency and mongod RSS on the same dashboard distinguish external memory pressure from internal cache saturation. Historical tracking of long-running operation age and cursor counts shows which query or noTimeout cursor preceded a fault spike. Connection churn is shown as a rate, surfacing thread-creation overhead that competes with the page cache. Second-granularity collection catches brief fault bursts that slower tools average away.