MongoDB exceeded memory limit for $group — aggregation spills and allowDiskUse

Application logs show error 16945, or an aggregation pipeline slows by an order of magnitude. In MongoDB, every aggregation stage not backed by an index is limited to 100 megabytes of RAM. When a stage exceeds this limit and disk spilling is not enabled, the operation fails immediately. If spilling is enabled, MongoDB writes temporary files to disk, which keeps the pipeline alive but adds unpredictable latency and extra I/O load.

Before MongoDB 6.0, you had to explicitly opt in to disk spilling with { allowDiskUse: true }. Starting in 6.0, the allowDiskUseByDefault server parameter is true, so eligible stages spill automatically. That removes the hard failure for many pipelines, but it also makes it easier for heavy workloads to hide behind disk I/O instead of failing fast. Some stages and accumulators, such as $graphLookup and the $push and $addToSet accumulators inside $group, cannot spill to disk at all regardless of the setting.

What this means

The 100MB limit applies per stage. When a $group, $sort, $bucket, $bucketAuto, $setWindowFields, or $sortByCount stage accumulates more than 100MB of data in memory, MongoDB has two choices: abort the operation or write intermediate results to temporary files on disk. The decision is controlled by allowDiskUse. In MongoDB 6.0 and later, the server default is true, so most eligible stages spill automatically. In earlier versions, the default is false, and the pipeline aborts unless the client explicitly requests disk spilling.

Not everything can spill. The $graphLookup stage ignores allowDiskUse entirely and is hard-capped at 100MB. If it exceeds the limit, it throws its own memory error. Inside a $group stage, the $push and $addToSet accumulators also cannot spill to disk. Even with allowDiskUse: true, unbounded arrays in these accumulators will hit the memory ceiling and fail.

When spilling does occur, MongoDB marks the operation with usedDisk: true in the profiler and slow-query log. This marks the pipeline as not running purely in memory. Temporary spill files exist for the duration of the pipeline execution and consume disk space on the instance’s storage volume. On small root-volume instances, aggressive spilling can fill the disk and trigger secondary failures. In a sharded cluster, enabling allowDiskUse for a sorting or grouping stage causes the merge step to run on a randomly selected shard rather than the originating shard, which can concentrate disk I/O on an unexpected node.

flowchart TD
    A[Stage exceeds 100MB RAM] --> B{allowDiskUse?}
    B -->|Yes| C{Stage supports spill?}
    B -->|No| D[Error 16945]
    C -->|Yes| E[Write temp files
usedDisk true] C -->|No| F[Error cannot spill]

Common causes

CauseWhat it looks likeFirst thing to check
Missing allowDiskUse on pre-6.0 or explicit opt-outError 16945 in driver logs; pipeline fails immediatelydb.adminCommand({ getParameter: 1, allowDiskUseByDefault: 1 }) and the command options
$push or $addToSet with unbounded arraysMemory error even when allowDiskUse: trueAccumulators inside the $group stage
High-cardinality $group or large blocking $sortPipeline slows dramatically; usedDisk: true in profilerdb.currentOp() for active aggregations
Missing early $match before heavy stageSpikes in scanned objects; temp files growexplain("executionStats") for docsExamined
Blocking $sort without index support$sort stage hits limit when not index-backedexplain("executionStats") sort stage stage value

Quick checks

Run these read-only commands to assess the current state without risking further impact.

# Check active aggregations and their run time
mongosh --quiet --eval 'db.currentOp({ active: true, "secs_running": { $gt: 10 } }).inprog.forEach(o => { if(o.command && o.command.aggregate) print(o.ns + " | " + o.secs_running + "s") })'
// Check the server default for disk spilling (6.0+)
db.adminCommand({ getParameter: 1, allowDiskUseByDefault: 1 })
# Search slow query log for memory limit errors or disk use
grep -iE "Exceeded memory limit|usedDisk" /var/log/mongodb/mongod.log | tail -10
// Query the system profiler for recent disk spills
db.system.profile.find({ usedDisk: true }).sort({ ts: -1 }).limit(5)
// Check command latency for aggregation tail latency
var l = db.serverStatus().opLatencies;
print("Command avg µs:", Math.floor(l.commands.latency / l.commands.ops));
// Check WiredTiger cache dirty ratio for I/O pressure
var c = db.serverStatus().wiredTiger.cache;
var dirty = 100 * c["tracked dirty bytes in the cache"] / c["maximum bytes configured"];
print("Cache dirty %:", dirty.toFixed(1));
# Check disk space on the data volume
df -h /data/db
// Check queue depth to see if spills are causing contention
var q = db.serverStatus().globalLock.currentQueue;
print("Queued readers:", q.readers, "writers:", q.writers);

How to diagnose it

  1. Confirm the error and the failing stage. Check application logs and the MongoDB log for “Exceeded memory limit”. Identify whether the failing stage is $group, $sort, or another stage. Note the error code: 16945 for $group, 16819 for $sort.
  2. Check whether allowDiskUse is enabled. Run getParameter for allowDiskUseByDefault on 6.0+. On earlier versions, the server default is false. If the client explicitly passed { allowDiskUse: false }, the pipeline will abort instead of spilling.
  3. Review the pipeline shape. Use db.currentOp() to find active aggregations. Look for stages before $group or $sort. If there is no selective $match near the start, the stage may be processing the entire collection.
  4. Run explain("executionStats"). Look for COLLSCAN, high docsExamined, and whether a $sort is using an index. An index-backed sort does not count toward the 100MB limit. If the sort stage shows a blocking SORT instead of an index stage, it is consuming the memory budget.
  5. Inspect the profiler for usedDisk. If usedDisk: true appears, the pipeline is spilling. Correlate the timestamp with disk I/O latency and WiredTiger cache pressure to determine if the spill is saturating storage.
  6. Check for non-spill accumulators. If the error persists despite allowDiskUse: true, inspect the $group stage for $push or $addToSet. These accumulators cannot spill. The fix is pipeline restructuring, not a flag change.
  7. Verify driver and framework defaults. Some drivers and ORMs override the server default.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
system.profile usedDiskDetects aggregations writing temporary spill filesAny usedDisk: true on production query patterns
opLatencies.commandsAggregations run as commands; rising latency reveals spill costCommand avg or tail latency spikes during batch windows
wiredTiger.cache dirty ratioDisk spills compete with checkpoint I/O and raise dirty dataDirty ratio sustained above 10% during aggregation peaks
wiredTiger.log sync latencyTemp-file I/O and journal pressure show up as sync latencyAverage sync latency above 30 ms sustained
globalLock.currentQueueSlow disk spills hold tickets and cause queuingreaders + writers sustained above 20
metrics.document.returnedUnfiltered pipelines examine and return excessive documentsreturned rate spikes without a matching rise in opcounters.query

Fixes

Enable or verify allowDiskUse

Before MongoDB 6.0, pass { allowDiskUse: true } in the aggregation command to allow eligible stages to spill. On MongoDB 6.0+, the server defaults to true, but explicitly passing { allowDiskUse: false } opts out. Some application frameworks override the default to false. Check your driver documentation and set the flag explicitly to match your intent.

In a sharded cluster, remember that allowDiskUse: true can move the merge stage to a randomly selected shard, concentrating I/O there.

Reduce stage input with early filtering

Move $match stages as early as possible in the pipeline to reduce the document count entering $group or $sort. If your use case allows it, add a $limit before the heavy stage. Every document that reaches the stage consumes part of the 100MB budget.

Ensure $sort is index-backed

An index-backed $sort bypasses the 100MB memory restriction entirely because the sort is performed during the index traversal. If explain shows a blocking SORT stage, add an index that matches the sort fields and direction. If the query also has equality filters, place those fields before the sort fields in the index.

Replace accumulators that cannot spill

If you are using $push or $addToSet inside $group and the grouped arrays grow large, allowDiskUse will not help. Restructure the pipeline to return grouped identifiers without building the full arrays server-side, or move the array assembly to the application layer. For very large grouping operations, consider whether map-reduce or an out-of-band process is more appropriate.

Address disk spill latency

If usedDisk is true and latency is unacceptable, enabling the flag is only a stopgap. The real fix is reducing the data volume the stage must process. If the pipeline is already optimized, the bottleneck is disk throughput. Ensure your storage layer can handle the temporary I/O without degrading journal sync latency or checkpoint performance. On cloud block storage, watch for burst-credit depletion during spill-heavy windows.

Prevention

  • Restructure pipelines to filter early. A $match at the start of the pipeline reduces the document count entering $group and $sort.
  • Index sort fields. An index-backed $sort does not count toward the 100MB limit.
  • Monitor profiler for usedDisk. Any production aggregation writing temp files is a candidate for pipeline optimization.
  • Set explicit allowDiskUse in driver code. Do not rely on driver defaults; declare the intent explicitly to avoid framework overrides.
  • Size disks for temp file overhead. Spilled aggregations write temporary files to the instance storage. Ensure the volume has enough free space to avoid secondary failures.

How Netdata helps

Netdata collects serverStatus metrics including opLatencies.commands and wiredTiger.cache dirty ratio. Disk latency and utilization charts show temp-file I/O saturation. Alerts on WiredTiger cache dirty ratio and application-thread evictions flag cache pressure cascades. Queue depth and ticket utilization metrics reveal when slow aggregations consume concurrency tickets.