MongoDB long-running operations: finding and killing the query holding a ticket
Your application latency just spiked. opLatencies show reads and writes climbing. globalLock.currentQueue is no longer zero. You check db.serverStatus().wiredTiger.concurrentTransactions: available tickets are near zero, but throughput has not increased. An operation is holding a ticket without making progress.
A collection scan, an unbounded aggregation, or a stalled write can hold a WiredTiger read or write ticket for minutes. The default is 128 read and 128 write tickets in most versions, so one long-running operation can cascade into system-wide queuing, connection pileup, and application timeouts. Find it and kill it, but killing the wrong operation can crash a node or leave data inconsistent.
What this means
WiredTiger uses ticket-based admission control: every storage operation must acquire a ticket. The default is 128 read and 128 write tickets in most versions. When one operation holds a ticket for minutes because of a missing index or an oversized aggregation, it starves every other operation in that class. The symptoms look like general saturation, but the root cause is usually one or two specific operations.
db.currentOp() and $currentOp show active operations, age, lock status, and ticket ownership. db.killOp() terminates by opid. The challenge is picking the right target, confirming it is safe, and knowing what happens after.
flowchart TD
A[Latency spikes and tickets near zero] --> B[currentOp: find ops >60s]
B --> C{waitingForLock?}
C -->|true| D[Victim, find the holder]
C -->|false| E[Holder, inspect command]
E --> F{Internal op?}
F -->|yes| G[Do not kill]
F -->|no| H{Write or read?}
H -->|read| I[killOp generally safe]
H -->|write| J[killOp with caution
check transaction status]
D --> BCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Missing index causing collection scan | secs_running growing, slow query log shows COLLSCAN, docsExamined far exceeds docsReturned | db.currentOp() filtered by secs_running, then explain() on the same query shape |
| Large aggregation pipeline | op: "command", command.aggregate present, no progress fields, high microsecs_running | Namespace and pipeline stages in currentOp output |
| Runaway bulk write | op in ["insert", "update", "remove"], waitingForLock: false, secs_running high | currentOp output for write operations and metrics.document counters |
| Lock contention from DDL | waitingForLock: true on many ops, one op holding Database or Collection lock | currentOp filtered by waitingForLock: false to find the holder |
| Background index build or backup | desc contains index build details or backup cursor, progress indicators present | currentOp msg or progress fields; these are expected to be long-running |
Quick checks
Run these read-only commands to confirm the pattern before taking any destructive action.
// Active operations running longer than 10 seconds
db.currentOp({ active: true, secs_running: { $gt: 10 } }).inprog.forEach(function(op) {
print(op.opid + " | " + op.op + " | " + op.secs_running + "s | " + op.ns);
});
// Available WiredTiger tickets (most versions)
var t = db.serverStatus().wiredTiger.concurrentTransactions;
print("Read: " + t.read.available + "/" + t.read.totalTickets);
print("Write: " + t.write.available + "/" + t.write.totalTickets);
// Queue depths
db.serverStatus().globalLock.currentQueue
// Server-side latency averages
var lat = db.serverStatus().opLatencies;
print("Read avg µs: " + (lat.reads.latency / lat.reads.ops));
print("Write avg µs: " + (lat.writes.latency / lat.writes.ops));
// Operations waiting for locks
db.currentOp({ waitingForLock: true }).inprog.forEach(function(op) {
print(op.opid + " | " + op.op + " | " + op.ns);
});
// MongoDB 6.2+ $currentOp aggregation
db.getSiblingDB("admin").aggregate([
{ $currentOp: { allUsers: true } },
{ $match: { active: true, secs_running: { $gt: 10 } } }
]).forEach(function(op) {
print(op.opid + " | " + op.type + " | " + op.secs_running + "s");
});
How to diagnose it
Confirm ticket exhaustion. Check
db.serverStatus().wiredTiger.concurrentTransactions. If available tickets in either class are below 10, operations are queuing.Find the longest-running operations. Use
db.currentOp()or$currentOpfiltered toactive: trueand sort bysecs_runningdescending. Operations running longer than 300 seconds are suspicious unless they are index builds, backups, or validated maintenance tasks.Distinguish holders from waiters.
waitingForLock: falsemeans the operation holds its locks and may be consuming the ticket.waitingForLock: truemeans it is a victim, not the cause.Inspect the query shape. Read the
commandfield for namespace and predicate. IfplanSummaryor the slow query log showsCOLLSCANon a large collection, you have likely found the culprit.Correlate with queue depth. If
globalLock.currentQueueis growing while one operation holds a ticket, the link is clear.Account for sharding. On a
mongos,currentOpshows router-level operations. Rundb.adminCommand({ currentOp: 1, $all: true })on individual shards to see storage engine work. Shard-levelopidvalues appear as prefixed strings such asshardB:79214.Verify the operation is not internal. Cross-reference
opanddesc. Internal operations must not be killed. Visual confirmation incurrentOpis required.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
currentOp longest-running operation age | Catches runaway queries before they exhaust tickets and cascade | >60 seconds in OLTP workloads; >300 seconds for any non-maintenance operation |
| WiredTiger available tickets | Direct measure of storage engine admission control saturation | Available drops below 25% of total sustained, or below 10 absolute |
globalLock.currentQueue depth | Shows operations blocked waiting for resources | Sustained >20 with an upward trend |
opLatencies reads and writes | User-visible latency degradation | Sustained >2x baseline for >5 minutes |
| Slow query log rate | Reveals inefficient query plans that will eventually hold tickets | Sudden spike from rolling baseline |
| Application-thread evictions | Indicates cache pressure often caused by long-lived snapshots | Any sustained nonzero rate |
Fixes
Kill a read-only operation
Killing a read operation (find, aggregate, count) is the lowest-risk fix. It releases its ticket and locks at the next yield point. The client receives a cursor-killed or interrupted error.
// Identify the opid from currentOp, then:
db.killOp(12345)
Kill a write operation
Killing a write carries more risk. Single-document writes are atomic, but a killed multi-document write (for example, updateMany) stops after modifying some documents. There is no automatic rollback for non-transactional partial progress. If the write is inside a multi-document transaction, the transaction aborts and rolls back.
Only kill writes when:
- The write is not inside a multi-document transaction, or you accept the abort
- You have verified the
opidis a client-initiated operation, not internal
Kill operations in sharded clusters
On a mongos, db.killOp() propagates to shards for many operations, but for writes inside a session use killSessions with the session lsid. For writes without a session, run db.killOp() on each affected shard using the shard-prefixed opid string, such as shardB:79214.
Do not kill replication internal operations
Avoid killing any operation whose desc or op fields indicate replication internals.
Reduce load if killing is not safe
If the long-running operation is a legitimate bulk job and killing it would corrupt application state, reduce pressure instead:
- Pause or throttle the application job
- Add a missing index to prevent the collection scan on future runs
- Temporarily redirect reads to secondaries if the primary is overloaded
Prevention
- Track the maximum
secs_runningcontinuously. A metric that exposes the age of the oldest active operation catches runaway queries before ticket exhaustion occurs. - Set driver-side
maxTimeMSso queries cannot run indefinitely if the server is slow. - Monitor ticket availability as a first-class signal. Declining available tickets during peak hours means operations are slowing inside the storage engine.
- Review slow query logs for
COLLSCAN. Every collection scan on a production collection is a candidate for an index or a rewrite. - Kill abandoned transactions and
noCursorTimeoutcursors. Long-running transactions and cursors pin WiredTiger snapshots and indirectly cause ticket contention by preventing eviction.
How Netdata helps
- Netdata charts longest-running operation age from
currentOpalongside available WiredTiger tickets. You can see when one query starves the system. opLatenciesandglobalLock.currentQueueare collected automatically. Use them to see whether latency is server-wide or tied to a specific ticket class.- WiredTiger cache dirty ratio and application-thread eviction counters provide context to distinguish a single bad query from a checkpoint stall or cache pressure cascade.
- Alerts on available tickets dropping below thresholds fire before applications time out, giving time to investigate instead of reacting to an outage.
Related guides
- How MongoDB actually works in production: a mental model for operators: /guides/mongodb/how-mongodb-works-in-production/
- MongoDB pages evicted by application threads: when eviction becomes user latency: /guides/mongodb/mongodb-application-thread-evictions/
- MongoDB WiredTiger cache dirty ratio high: the leading indicator nobody watches: /guides/mongodb/mongodb-cache-dirty-ratio-high/
- MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes: /guides/mongodb/mongodb-cache-pressure-cascade/
- MongoDB cache too small: sizing the WiredTiger cache for your working set: /guides/mongodb/mongodb-cache-undersized-working-set/
- MongoDB checkpoint duration climbing: diagnosing slow WiredTiger checkpoints: /guides/mongodb/mongodb-checkpoint-duration-high/
- MongoDB checkpoint stall write freeze: when all writes stop with no error: /guides/mongodb/mongodb-checkpoint-stall-write-freeze/
- MongoDB connection churn: high totalCreated rate and thread creation overhead: /guides/mongodb/mongodb-connection-churn/
- MongoDB connection refused at maxIncomingConnections: hitting the connection ceiling: /guides/mongodb/mongodb-connection-limit-reached/
- MongoDB connection storm spiral: reconnection floods after an election or deploy: /guides/mongodb/mongodb-connection-storm-spiral/
- MongoDB flow control throttling writes: when the primary slows itself down: /guides/mongodb/mongodb-flow-control-throttling-writes/
- MongoDB journal sync latency high: the storage signal that warns 60 seconds early: /guides/mongodb/mongodb-journal-sync-latency-high/







