MongoDB exceeded memory limit for $group — aggregation spills and allowDiskUse
Application logs show error 16945, or an aggregation pipeline slows by an order of magnitude. In MongoDB, every aggregation stage not backed by an index is limited to 100 megabytes of RAM. When a stage exceeds this limit and disk spilling is not enabled, the operation fails immediately. If spilling is enabled, MongoDB writes temporary files to disk, which keeps the pipeline alive but adds unpredictable latency and extra I/O load.
Before MongoDB 6.0, you had to explicitly opt in to disk spilling with { allowDiskUse: true }. Starting in 6.0, the allowDiskUseByDefault server parameter is true, so eligible stages spill automatically. That removes the hard failure for many pipelines, but it also makes it easier for heavy workloads to hide behind disk I/O instead of failing fast. Some stages and accumulators, such as $graphLookup and the $push and $addToSet accumulators inside $group, cannot spill to disk at all regardless of the setting.
What this means
The 100MB limit applies per stage. When a $group, $sort, $bucket, $bucketAuto, $setWindowFields, or $sortByCount stage accumulates more than 100MB of data in memory, MongoDB has two choices: abort the operation or write intermediate results to temporary files on disk. The decision is controlled by allowDiskUse. In MongoDB 6.0 and later, the server default is true, so most eligible stages spill automatically. In earlier versions, the default is false, and the pipeline aborts unless the client explicitly requests disk spilling.
Not everything can spill. The $graphLookup stage ignores allowDiskUse entirely and is hard-capped at 100MB. If it exceeds the limit, it throws its own memory error. Inside a $group stage, the $push and $addToSet accumulators also cannot spill to disk. Even with allowDiskUse: true, unbounded arrays in these accumulators will hit the memory ceiling and fail.
When spilling does occur, MongoDB marks the operation with usedDisk: true in the profiler and slow-query log. This marks the pipeline as not running purely in memory. Temporary spill files exist for the duration of the pipeline execution and consume disk space on the instance’s storage volume. On small root-volume instances, aggressive spilling can fill the disk and trigger secondary failures. In a sharded cluster, enabling allowDiskUse for a sorting or grouping stage causes the merge step to run on a randomly selected shard rather than the originating shard, which can concentrate disk I/O on an unexpected node.
flowchart TD
A[Stage exceeds 100MB RAM] --> B{allowDiskUse?}
B -->|Yes| C{Stage supports spill?}
B -->|No| D[Error 16945]
C -->|Yes| E[Write temp files
usedDisk true]
C -->|No| F[Error cannot spill]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Missing allowDiskUse on pre-6.0 or explicit opt-out | Error 16945 in driver logs; pipeline fails immediately | db.adminCommand({ getParameter: 1, allowDiskUseByDefault: 1 }) and the command options |
$push or $addToSet with unbounded arrays | Memory error even when allowDiskUse: true | Accumulators inside the $group stage |
High-cardinality $group or large blocking $sort | Pipeline slows dramatically; usedDisk: true in profiler | db.currentOp() for active aggregations |
Missing early $match before heavy stage | Spikes in scanned objects; temp files grow | explain("executionStats") for docsExamined |
Blocking $sort without index support | $sort stage hits limit when not index-backed | explain("executionStats") sort stage stage value |
Quick checks
Run these read-only commands to assess the current state without risking further impact.
# Check active aggregations and their run time
mongosh --quiet --eval 'db.currentOp({ active: true, "secs_running": { $gt: 10 } }).inprog.forEach(o => { if(o.command && o.command.aggregate) print(o.ns + " | " + o.secs_running + "s") })'
// Check the server default for disk spilling (6.0+)
db.adminCommand({ getParameter: 1, allowDiskUseByDefault: 1 })
# Search slow query log for memory limit errors or disk use
grep -iE "Exceeded memory limit|usedDisk" /var/log/mongodb/mongod.log | tail -10
// Query the system profiler for recent disk spills
db.system.profile.find({ usedDisk: true }).sort({ ts: -1 }).limit(5)
// Check command latency for aggregation tail latency
var l = db.serverStatus().opLatencies;
print("Command avg µs:", Math.floor(l.commands.latency / l.commands.ops));
// Check WiredTiger cache dirty ratio for I/O pressure
var c = db.serverStatus().wiredTiger.cache;
var dirty = 100 * c["tracked dirty bytes in the cache"] / c["maximum bytes configured"];
print("Cache dirty %:", dirty.toFixed(1));
# Check disk space on the data volume
df -h /data/db
// Check queue depth to see if spills are causing contention
var q = db.serverStatus().globalLock.currentQueue;
print("Queued readers:", q.readers, "writers:", q.writers);
How to diagnose it
- Confirm the error and the failing stage. Check application logs and the MongoDB log for “Exceeded memory limit”. Identify whether the failing stage is
$group,$sort, or another stage. Note the error code: 16945 for$group, 16819 for$sort. - Check whether
allowDiskUseis enabled. RungetParameterforallowDiskUseByDefaulton 6.0+. On earlier versions, the server default isfalse. If the client explicitly passed{ allowDiskUse: false }, the pipeline will abort instead of spilling. - Review the pipeline shape. Use
db.currentOp()to find active aggregations. Look for stages before$groupor$sort. If there is no selective$matchnear the start, the stage may be processing the entire collection. - Run
explain("executionStats"). Look forCOLLSCAN, highdocsExamined, and whether a$sortis using an index. An index-backed sort does not count toward the 100MB limit. If the sort stage shows a blockingSORTinstead of an index stage, it is consuming the memory budget. - Inspect the profiler for
usedDisk. IfusedDisk: trueappears, the pipeline is spilling. Correlate the timestamp with disk I/O latency and WiredTiger cache pressure to determine if the spill is saturating storage. - Check for non-spill accumulators. If the error persists despite
allowDiskUse: true, inspect the$groupstage for$pushor$addToSet. These accumulators cannot spill. The fix is pipeline restructuring, not a flag change. - Verify driver and framework defaults. Some drivers and ORMs override the server default.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
system.profile usedDisk | Detects aggregations writing temporary spill files | Any usedDisk: true on production query patterns |
opLatencies.commands | Aggregations run as commands; rising latency reveals spill cost | Command avg or tail latency spikes during batch windows |
wiredTiger.cache dirty ratio | Disk spills compete with checkpoint I/O and raise dirty data | Dirty ratio sustained above 10% during aggregation peaks |
wiredTiger.log sync latency | Temp-file I/O and journal pressure show up as sync latency | Average sync latency above 30 ms sustained |
globalLock.currentQueue | Slow disk spills hold tickets and cause queuing | readers + writers sustained above 20 |
metrics.document.returned | Unfiltered pipelines examine and return excessive documents | returned rate spikes without a matching rise in opcounters.query |
Fixes
Enable or verify allowDiskUse
Before MongoDB 6.0, pass { allowDiskUse: true } in the aggregation command to allow eligible stages to spill. On MongoDB 6.0+, the server defaults to true, but explicitly passing { allowDiskUse: false } opts out. Some application frameworks override the default to false. Check your driver documentation and set the flag explicitly to match your intent.
In a sharded cluster, remember that allowDiskUse: true can move the merge stage to a randomly selected shard, concentrating I/O there.
Reduce stage input with early filtering
Move $match stages as early as possible in the pipeline to reduce the document count entering $group or $sort. If your use case allows it, add a $limit before the heavy stage. Every document that reaches the stage consumes part of the 100MB budget.
Ensure $sort is index-backed
An index-backed $sort bypasses the 100MB memory restriction entirely because the sort is performed during the index traversal. If explain shows a blocking SORT stage, add an index that matches the sort fields and direction. If the query also has equality filters, place those fields before the sort fields in the index.
Replace accumulators that cannot spill
If you are using $push or $addToSet inside $group and the grouped arrays grow large, allowDiskUse will not help. Restructure the pipeline to return grouped identifiers without building the full arrays server-side, or move the array assembly to the application layer. For very large grouping operations, consider whether map-reduce or an out-of-band process is more appropriate.
Address disk spill latency
If usedDisk is true and latency is unacceptable, enabling the flag is only a stopgap. The real fix is reducing the data volume the stage must process. If the pipeline is already optimized, the bottleneck is disk throughput. Ensure your storage layer can handle the temporary I/O without degrading journal sync latency or checkpoint performance. On cloud block storage, watch for burst-credit depletion during spill-heavy windows.
Prevention
- Restructure pipelines to filter early. A
$matchat the start of the pipeline reduces the document count entering$groupand$sort. - Index sort fields. An index-backed
$sortdoes not count toward the 100MB limit. - Monitor profiler for
usedDisk. Any production aggregation writing temp files is a candidate for pipeline optimization. - Set explicit
allowDiskUsein driver code. Do not rely on driver defaults; declare the intent explicitly to avoid framework overrides. - Size disks for temp file overhead. Spilled aggregations write temporary files to the instance storage. Ensure the volume has enough free space to avoid secondary failures.
How Netdata helps
Netdata collects serverStatus metrics including opLatencies.commands and wiredTiger.cache dirty ratio. Disk latency and utilization charts show temp-file I/O saturation. Alerts on WiredTiger cache dirty ratio and application-thread evictions flag cache pressure cascades. Queue depth and ticket utilization metrics reveal when slow aggregations consume concurrency tickets.
Related guides
- How MongoDB actually works in production: a mental model for operators
- MongoDB pages evicted by application threads: when eviction becomes user latency
- MongoDB WiredTiger cache dirty ratio high: the leading indicator nobody watches
- MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes
- MongoDB cache too small: sizing the WiredTiger cache for your working set
- MongoDB checkpoint duration climbing: diagnosing slow WiredTiger checkpoints
- MongoDB checkpoint stall write freeze: when all writes stop with no error
- MongoDB connection churn: high totalCreated rate and thread creation overhead
- MongoDB connection refused at maxIncomingConnections: hitting the connection ceiling
- MongoDB connection storm spiral: reconnection floods after an election or deploy
- MongoDB flow control throttling writes: when the primary slows itself down
- MongoDB journal sync latency high: the storage signal that warns 60 seconds early







