MongoDB operation exceeded time limit (MaxTimeMSExpired): maxTimeMS and killed operations

Error code 50, MaxTimeMSExpired, means the server killed an operation that exceeded its processing budget. Raising the timeout without fixing the root cause turns acute failures into chronic resource exhaustion. The operation was already pathologically slow; maxTimeMS ended it before it consumed more resources or held locks and tickets indefinitely.

maxTimeMS sets a cumulative processing budget in milliseconds. MongoDB enforces it using the same interrupt mechanism as killOp, terminating the operation only at designated interrupt points. Idle time between cursor batches does not count toward the limit, and on direct connections network latency is excluded from the server-side clock. On sharded clusters, however, latency between mongos and shard mongod instances counts against the limit. Distinguish a true MaxTimeMSExpired from a client-side socket timeout, where the client gives up before the server responds.

What this means

MaxTimeMSExpired releases whatever resources the operation held: WiredTiger read or write tickets, cache space, and locks. The operation may have been scanning millions of documents or pinning an old snapshot.

This error is a symptom. Raising maxTimeMS without fixing the underlying slowness converts acute failures into chronic resource exhaustion. Long-running operations can block eviction and trigger cache pressure cascades. Find why the operation was slow and fix that.

flowchart TD
    A[Client sees MaxTimeMSExpired] --> B{Server-side or socket timeout?}
    B -->|Error code 50| C[Server killed operation]
    B -->|Network exception| D[Client timed out first]
    C --> E[currentOp shows long-running op]
    E --> F{Slow query plan?}
    F -->|COLLSCAN or bad IXSCAN| G[Missing index or plan regression]
    F -->|Plan is good| H[System saturation]
    H --> I[Cache dirty ratio high or tickets exhausted]
    D --> J[Raise socketTimeoutMS above maxTimeMS]
    G --> K[Build index or fix query]
    I --> L[Kill runaway ops or reduce load]

Common causes

CauseWhat it looks likeFirst thing to check
Missing or dropped indexSlow query log shows COLLSCAN or keysExamined:docsReturned > 100:1db.collection.getIndexes() and compare to query predicates
Query plan regressionQuery was fast yesterday, slow today; same shape, different planexplain("executionStats") or plan cache state
Cache pressure or ticket exhaustionopLatencies spiking for all operations, not just one; app-thread evictions risingserverStatus().wiredTiger.cache and concurrentTransactions
Runaway aggregation or large $lookupcurrentOp shows aggregate with huge docsReturned or long secs_runningdb.currentOp({ "active": true, "secs_running": { "$gt": 60 } })
Heavy load on secondaryTimeouts appear only on secondary reads while primary is healthyrs.status() for lag and serverStatus().flowControl
Long-lived cursor with high maxTimeMSCursor killed after extended runtime despite high limitSession idle lifetime and metrics.cursor

Quick checks

These are read-only unless otherwise noted.

# Check operations running longer than 10 seconds
mongosh --quiet --eval 'db.currentOp({ "active": true, "secs_running": { "$gt": 10 } }).inprog.forEach(function(op) { print(op.opid + " | " + op.op + " | " + op.secs_running + "s | " + op.ns + " | " + JSON.stringify(op.command || {}).substring(0,120)); })'
# Tail the slow query log for recent timeouts
grep -E "MaxTimeMSExpired|Slow query" /var/log/mongodb/mongod.log | tail -20
// Check WiredTiger cache pressure and dirty ratio
var c = db.serverStatus().wiredTiger.cache;
print("Cache fill: " + (100 * c["bytes currently in the cache"] / c["maximum bytes configured"]).toFixed(1) + "%");
print("Dirty ratio: " + (100 * c["tracked dirty bytes in the cache"] / c["maximum bytes configured"]).toFixed(1) + "%");
// Check available read and write tickets
var t = db.serverStatus().wiredTiger.concurrentTransactions;
print("Read tickets available: " + t.read.available + " / " + t.read.totalTickets);
print("Write tickets available: " + t.write.available + " / " + t.write.totalTickets);
// Check replication lag if secondaries are timing out
rs.printSecondaryReplicationInfo()
// Sample the system profiler for slow operations
db.system.profile.find().sort({ ts: -1 }).limit(5).forEach(function(doc) { print(doc.ts + " | " + doc.ns + " | " + doc.millis + "ms | " + doc.planSummary); });

How to diagnose it

  1. Confirm it is server-side. A MaxTimeMSExpired response includes error code 50 and "codeName": "MaxTimeMSExpired". Client-side socket timeouts manifest as network exceptions in the driver without a MongoDB error code. If socketTimeoutMS equals maxTimeMS, the client may give up before the server returns the error, masking the root cause.

  2. Capture the operation in currentOp. Run the currentOp query from the quick checks. Look for:

    • High secs_running
    • waitingForLock: true
    • op: "query" or "command" with aggregation stages
    • Large docsExamined vs docsReturned ratios in the slow log
  3. Correlate with the slow query log. Filter for the same ns (namespace) and time window. Key ratios:

    • keysExamined / docsReturned should be near 1:1 for indexed queries. A ratio of 100:1 indicates a badly targeted index scan.
    • docsExamined / docsReturned near 1:1 is healthy. 1000:1 means nearly every document examined was discarded, typical of a missing index or a collection scan.
  4. Check for system-wide pressure. If many unrelated operations are timing out, look at:

    • WiredTiger cache dirty ratio > 15%
    • Application-thread evictions incrementing
    • Available tickets below 25% of total
    • Queue depths (globalLock.currentQueue) sustained above 20 If these are elevated, the root cause is saturation, not a single bad query.
  5. Check replication state for secondary timeouts. If reads with secondaryPreferred are failing while primary reads succeed, check replication lag. A secondary under heavy oplog application load may be slow to respond. Also verify the secondary is not in RECOVERING.

  6. Inspect cursors. If the timed-out operation is a long-running analytical cursor, check db.serverStatus().metrics.cursor. If noTimeout cursors are high, or if the session has been idle, the operation may have been killed by the session idle timeout rather than maxTimeMS.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Slow query rateDirectly precedes MaxTimeMSExpired spikesSustained increase above baseline
docsExamined:docsReturned ratioReveals wasted work per operationRatio > 100:1 for OLTP queries
WiredTiger cache dirty ratioDirty data accumulation causes checkpoint stalls and global slowdown> 10% sustained
Application-thread evictionsIndicates background eviction cannot keep up; latency spikes followAny sustained nonzero rate
Available read/write ticketsTicket exhaustion makes all operations queue< 25% of total available
currentOp max operation ageCatches runaway queries before they cascadeAny non-background op > 300s
Replication lagExplains secondary-only timeouts> 10s sustained or > 25% of oplog window
opcounters throughputSudden drop suggests global blocking> 50% drop from baseline

Fixes

Fix the query, not the timeout

If currentOp and the slow log show a collection scan or an inefficient index scan, add or restore the correct index. Use background builds to avoid locking:

// Safe: builds in background
db.collection.createIndex({ field: 1 }, { background: true });

If the query planner has regressed, evict the bad plan from the cache or force an index with hint() as a temporary measure. Compare the winning plan in explain("executionStats") to the expected index.

Reduce resource consumption

For aggregations that time out due to data volume:

  • Push $match stages as early as possible in the pipeline.
  • Use $project or aggregation $unset to reduce document size.
  • Add $limit if the application only needs a subset.
  • For large $lookup operations, ensure the foreign collection has an index on the localField/foreignField.

Kill and reroute

If an operation is already running and blocking others, kill it:

db.killOp(<opid>)

Warning: killOp is best-effort and may not terminate immediately. Killing a write operation may leave multi-document writes partially completed. After killing a long-running write, verify data consistency in the affected collection.

If the workload is legitimate but heavy, move it to a hidden secondary or an analytics node, or schedule it during low-traffic windows.

Address saturation

If the root cause is cache pressure or ticket exhaustion:

  • Pause batch jobs or bulk imports to reduce write pressure.
  • Kill unnecessary long-running transactions or noCursorTimeout cursors that pin snapshots.
  • Check storage health with iostat -x 1 for elevated await or %util.
  • If storage is degraded, step down the primary to shift writes to a healthier member.

Warning: Stepping down the primary triggers an election and interrupts writes. Use only during a maintenance window or confirmed storage degradation.

Prevention

  • Monitor slow query trends, not just max age. A query that drifts from 10 ms to 500 ms over a week will eventually hit any reasonable maxTimeMS. Trend the 95th percentile of the slow query log.
  • Set operation-class timeouts. OLTP reads should have a tight maxTimeMS (for example, 5 seconds). Long analytical queries can have a higher limit, but only if the query is efficient and the infrastructure can support it.
  • Audit indexes after every deployment. Use $indexStats to confirm critical indexes are being used. If a key index shows zero operations after restart, investigate before the plan cache warms with a bad plan.
  • Keep headroom in cache and tickets. Operate WiredTiger cache below 80% fill and below 5% dirty during peak. Keep available tickets above 25% of total. These margins absorb transient slowdowns without cascading into timeouts.

How Netdata helps

  • Correlate MaxTimeMSExpired spikes with per-second opLatencies, scanned/returned ratios, and slow query rates to distinguish a single bad query from global pressure.
  • Alert on WiredTiger cache dirty ratio and application-thread evictions before they drive operations into timeout.
  • Surface ticket utilization and queue depth to catch storage engine saturation before queries start dying.
  • Track currentOp age and replication lag to catch secondary-side timeouts early.