MongoDB long-running operations: finding and killing the query holding a ticket

Your application latency just spiked. opLatencies show reads and writes climbing. globalLock.currentQueue is no longer zero. You check db.serverStatus().wiredTiger.concurrentTransactions: available tickets are near zero, but throughput has not increased. An operation is holding a ticket without making progress.

A collection scan, an unbounded aggregation, or a stalled write can hold a WiredTiger read or write ticket for minutes. The default is 128 read and 128 write tickets in most versions, so one long-running operation can cascade into system-wide queuing, connection pileup, and application timeouts. Find it and kill it, but killing the wrong operation can crash a node or leave data inconsistent.

What this means

WiredTiger uses ticket-based admission control: every storage operation must acquire a ticket. The default is 128 read and 128 write tickets in most versions. When one operation holds a ticket for minutes because of a missing index or an oversized aggregation, it starves every other operation in that class. The symptoms look like general saturation, but the root cause is usually one or two specific operations.

db.currentOp() and $currentOp show active operations, age, lock status, and ticket ownership. db.killOp() terminates by opid. The challenge is picking the right target, confirming it is safe, and knowing what happens after.

flowchart TD
    A[Latency spikes and tickets near zero] --> B[currentOp: find ops >60s]
    B --> C{waitingForLock?}
    C -->|true| D[Victim, find the holder]
    C -->|false| E[Holder, inspect command]
    E --> F{Internal op?}
    F -->|yes| G[Do not kill]
    F -->|no| H{Write or read?}
    H -->|read| I[killOp generally safe]
    H -->|write| J[killOp with caution
check transaction status] D --> B

Common causes

CauseWhat it looks likeFirst thing to check
Missing index causing collection scansecs_running growing, slow query log shows COLLSCAN, docsExamined far exceeds docsReturneddb.currentOp() filtered by secs_running, then explain() on the same query shape
Large aggregation pipelineop: "command", command.aggregate present, no progress fields, high microsecs_runningNamespace and pipeline stages in currentOp output
Runaway bulk writeop in ["insert", "update", "remove"], waitingForLock: false, secs_running highcurrentOp output for write operations and metrics.document counters
Lock contention from DDLwaitingForLock: true on many ops, one op holding Database or Collection lockcurrentOp filtered by waitingForLock: false to find the holder
Background index build or backupdesc contains index build details or backup cursor, progress indicators presentcurrentOp msg or progress fields; these are expected to be long-running

Quick checks

Run these read-only commands to confirm the pattern before taking any destructive action.

// Active operations running longer than 10 seconds
db.currentOp({ active: true, secs_running: { $gt: 10 } }).inprog.forEach(function(op) {
  print(op.opid + " | " + op.op + " | " + op.secs_running + "s | " + op.ns);
});
// Available WiredTiger tickets (most versions)
var t = db.serverStatus().wiredTiger.concurrentTransactions;
print("Read: " + t.read.available + "/" + t.read.totalTickets);
print("Write: " + t.write.available + "/" + t.write.totalTickets);
// Queue depths
db.serverStatus().globalLock.currentQueue
// Server-side latency averages
var lat = db.serverStatus().opLatencies;
print("Read avg µs: " + (lat.reads.latency / lat.reads.ops));
print("Write avg µs: " + (lat.writes.latency / lat.writes.ops));
// Operations waiting for locks
db.currentOp({ waitingForLock: true }).inprog.forEach(function(op) {
  print(op.opid + " | " + op.op + " | " + op.ns);
});
// MongoDB 6.2+ $currentOp aggregation
db.getSiblingDB("admin").aggregate([
  { $currentOp: { allUsers: true } },
  { $match: { active: true, secs_running: { $gt: 10 } } }
]).forEach(function(op) {
  print(op.opid + " | " + op.type + " | " + op.secs_running + "s");
});

How to diagnose it

  1. Confirm ticket exhaustion. Check db.serverStatus().wiredTiger.concurrentTransactions . If available tickets in either class are below 10, operations are queuing.

  2. Find the longest-running operations. Use db.currentOp() or $currentOp filtered to active: true and sort by secs_running descending. Operations running longer than 300 seconds are suspicious unless they are index builds, backups, or validated maintenance tasks.

  3. Distinguish holders from waiters. waitingForLock: false means the operation holds its locks and may be consuming the ticket. waitingForLock: true means it is a victim, not the cause.

  4. Inspect the query shape. Read the command field for namespace and predicate. If planSummary or the slow query log shows COLLSCAN on a large collection, you have likely found the culprit.

  5. Correlate with queue depth. If globalLock.currentQueue is growing while one operation holds a ticket, the link is clear.

  6. Account for sharding. On a mongos, currentOp shows router-level operations. Run db.adminCommand({ currentOp: 1, $all: true }) on individual shards to see storage engine work. Shard-level opid values appear as prefixed strings such as shardB:79214.

  7. Verify the operation is not internal. Cross-reference op and desc. Internal operations must not be killed. Visual confirmation in currentOp is required.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
currentOp longest-running operation ageCatches runaway queries before they exhaust tickets and cascade>60 seconds in OLTP workloads; >300 seconds for any non-maintenance operation
WiredTiger available ticketsDirect measure of storage engine admission control saturationAvailable drops below 25% of total sustained, or below 10 absolute
globalLock.currentQueue depthShows operations blocked waiting for resourcesSustained >20 with an upward trend
opLatencies reads and writesUser-visible latency degradationSustained >2x baseline for >5 minutes
Slow query log rateReveals inefficient query plans that will eventually hold ticketsSudden spike from rolling baseline
Application-thread evictionsIndicates cache pressure often caused by long-lived snapshotsAny sustained nonzero rate

Fixes

Kill a read-only operation

Killing a read operation (find, aggregate, count) is the lowest-risk fix. It releases its ticket and locks at the next yield point. The client receives a cursor-killed or interrupted error.

// Identify the opid from currentOp, then:
db.killOp(12345)

Kill a write operation

Killing a write carries more risk. Single-document writes are atomic, but a killed multi-document write (for example, updateMany) stops after modifying some documents. There is no automatic rollback for non-transactional partial progress. If the write is inside a multi-document transaction, the transaction aborts and rolls back.

Only kill writes when:

  • The write is not inside a multi-document transaction, or you accept the abort
  • You have verified the opid is a client-initiated operation, not internal

Kill operations in sharded clusters

On a mongos, db.killOp() propagates to shards for many operations, but for writes inside a session use killSessions with the session lsid. For writes without a session, run db.killOp() on each affected shard using the shard-prefixed opid string, such as shardB:79214.

Do not kill replication internal operations

Avoid killing any operation whose desc or op fields indicate replication internals.

Reduce load if killing is not safe

If the long-running operation is a legitimate bulk job and killing it would corrupt application state, reduce pressure instead:

  • Pause or throttle the application job
  • Add a missing index to prevent the collection scan on future runs
  • Temporarily redirect reads to secondaries if the primary is overloaded

Prevention

  • Track the maximum secs_running continuously. A metric that exposes the age of the oldest active operation catches runaway queries before ticket exhaustion occurs.
  • Set driver-side maxTimeMS so queries cannot run indefinitely if the server is slow.
  • Monitor ticket availability as a first-class signal. Declining available tickets during peak hours means operations are slowing inside the storage engine.
  • Review slow query logs for COLLSCAN. Every collection scan on a production collection is a candidate for an index or a rewrite.
  • Kill abandoned transactions and noCursorTimeout cursors. Long-running transactions and cursors pin WiredTiger snapshots and indirectly cause ticket contention by preventing eviction.

How Netdata helps

  • Netdata charts longest-running operation age from currentOp alongside available WiredTiger tickets. You can see when one query starves the system.
  • opLatencies and globalLock.currentQueue are collected automatically. Use them to see whether latency is server-wide or tied to a specific ticket class.
  • WiredTiger cache dirty ratio and application-thread eviction counters provide context to distinguish a single bad query from a checkpoint stall or cache pressure cascade.
  • Alerts on available tickets dropping below thresholds fire before applications time out, giving time to investigate instead of reacting to an outage.
  • How MongoDB actually works in production: a mental model for operators: /guides/mongodb/how-mongodb-works-in-production/
  • MongoDB pages evicted by application threads: when eviction becomes user latency: /guides/mongodb/mongodb-application-thread-evictions/
  • MongoDB WiredTiger cache dirty ratio high: the leading indicator nobody watches: /guides/mongodb/mongodb-cache-dirty-ratio-high/
  • MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes: /guides/mongodb/mongodb-cache-pressure-cascade/
  • MongoDB cache too small: sizing the WiredTiger cache for your working set: /guides/mongodb/mongodb-cache-undersized-working-set/
  • MongoDB checkpoint duration climbing: diagnosing slow WiredTiger checkpoints: /guides/mongodb/mongodb-checkpoint-duration-high/
  • MongoDB checkpoint stall write freeze: when all writes stop with no error: /guides/mongodb/mongodb-checkpoint-stall-write-freeze/
  • MongoDB connection churn: high totalCreated rate and thread creation overhead: /guides/mongodb/mongodb-connection-churn/
  • MongoDB connection refused at maxIncomingConnections: hitting the connection ceiling: /guides/mongodb/mongodb-connection-limit-reached/
  • MongoDB connection storm spiral: reconnection floods after an election or deploy: /guides/mongodb/mongodb-connection-storm-spiral/
  • MongoDB flow control throttling writes: when the primary slows itself down: /guides/mongodb/mongodb-flow-control-throttling-writes/
  • MongoDB journal sync latency high: the storage signal that warns 60 seconds early: /guides/mongodb/mongodb-journal-sync-latency-high/