$ guides / mongodb / mongodb-long-running-operations ▌

Operations Guides

MongoDB long-running operations: finding and killing the query holding a ticket

Your application latency just spiked. opLatencies show reads and writes climbing. globalLock.currentQueue is no longer zero. You check db.serverStatus().wiredTiger.concurrentTransactions: available tickets are near zero, but throughput has not increased. An operation is holding a ticket without making progress.

A collection scan, an unbounded aggregation, or a stalled write can hold a WiredTiger read or write ticket for minutes. The default is 128 read and 128 write tickets in most versions, so one long-running operation can cascade into system-wide queuing, connection pileup, and application timeouts. Find it and kill it, but killing the wrong operation can crash a node or leave data inconsistent.

What this means

WiredTiger uses ticket-based admission control: every storage operation must acquire a ticket. The default is 128 read and 128 write tickets in most versions. When one operation holds a ticket for minutes because of a missing index or an oversized aggregation, it starves every other operation in that class. The symptoms look like general saturation, but the root cause is usually one or two specific operations.

db.currentOp() and $currentOp show active operations, age, lock status, and ticket ownership. db.killOp() terminates by opid. The challenge is picking the right target, confirming it is safe, and knowing what happens after.

flowchart TD
    A[Latency spikes and tickets near zero] --> B[currentOp: find ops >60s]
    B --> C{waitingForLock?}
    C -->|true| D[Victim, find the holder]
    C -->|false| E[Holder, inspect command]
    E --> F{Internal op?}
    F -->|yes| G[Do not kill]
    F -->|no| H{Write or read?}
    H -->|read| I[killOp generally safe]
    H -->|write| J[killOp with caution
check transaction status]
    D --> B

Common causes

Cause	What it looks like	First thing to check
Missing index causing collection scan	`secs_running` growing, slow query log shows `COLLSCAN`, `docsExamined` far exceeds `docsReturned`	`db.currentOp()` filtered by `secs_running`, then `explain()` on the same query shape
Large aggregation pipeline	`op: "command"`, `command.aggregate` present, no progress fields, high `microsecs_running`	Namespace and pipeline stages in `currentOp` output
Runaway bulk write	`op` in `["insert", "update", "remove"]`, `waitingForLock: false`, `secs_running` high	`currentOp` output for write operations and `metrics.document` counters
Lock contention from DDL	`waitingForLock: true` on many ops, one op holding `Database` or `Collection` lock	`currentOp` filtered by `waitingForLock: false` to find the holder
Background index build or backup	`desc` contains index build details or backup cursor, progress indicators present	`currentOp` `msg` or `progress` fields; these are expected to be long-running

Quick checks

Run these read-only commands to confirm the pattern before taking any destructive action.

// Active operations running longer than 10 seconds
db.currentOp({ active: true, secs_running: { $gt: 10 } }).inprog.forEach(function(op) {
  print(op.opid + " | " + op.op + " | " + op.secs_running + "s | " + op.ns);
});

// Available WiredTiger tickets (most versions)
var t = db.serverStatus().wiredTiger.concurrentTransactions;
print("Read: " + t.read.available + "/" + t.read.totalTickets);
print("Write: " + t.write.available + "/" + t.write.totalTickets);

// Queue depths
db.serverStatus().globalLock.currentQueue

// Server-side latency averages
var lat = db.serverStatus().opLatencies;
print("Read avg µs: " + (lat.reads.latency / lat.reads.ops));
print("Write avg µs: " + (lat.writes.latency / lat.writes.ops));

// Operations waiting for locks
db.currentOp({ waitingForLock: true }).inprog.forEach(function(op) {
  print(op.opid + " | " + op.op + " | " + op.ns);
});

// MongoDB 6.2+ $currentOp aggregation
db.getSiblingDB("admin").aggregate([
  { $currentOp: { allUsers: true } },
  { $match: { active: true, secs_running: { $gt: 10 } } }
]).forEach(function(op) {
  print(op.opid + " | " + op.type + " | " + op.secs_running + "s");
});

How to diagnose it

Confirm ticket exhaustion. Check db.serverStatus().wiredTiger.concurrentTransactions . If available tickets in either class are below 10, operations are queuing.
Find the longest-running operations. Use db.currentOp() or $currentOp filtered to active: true and sort by secs_running descending. Operations running longer than 300 seconds are suspicious unless they are index builds, backups, or validated maintenance tasks.
Distinguish holders from waiters. waitingForLock: false means the operation holds its locks and may be consuming the ticket. waitingForLock: true means it is a victim, not the cause.
Inspect the query shape. Read the command field for namespace and predicate. If planSummary or the slow query log shows COLLSCAN on a large collection, you have likely found the culprit.
Correlate with queue depth. If globalLock.currentQueue is growing while one operation holds a ticket, the link is clear.
Account for sharding. On a mongos, currentOp shows router-level operations. Run db.adminCommand({ currentOp: 1, $all: true }) on individual shards to see storage engine work. Shard-level opid values appear as prefixed strings such as shardB:79214.
Verify the operation is not internal. Cross-reference op and desc. Internal operations must not be killed. Visual confirmation in currentOp is required.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`currentOp` longest-running operation age	Catches runaway queries before they exhaust tickets and cascade	>60 seconds in OLTP workloads; >300 seconds for any non-maintenance operation
WiredTiger available tickets	Direct measure of storage engine admission control saturation	Available drops below 25% of total sustained, or below 10 absolute
`globalLock.currentQueue` depth	Shows operations blocked waiting for resources	Sustained >20 with an upward trend
`opLatencies` reads and writes	User-visible latency degradation	Sustained >2x baseline for >5 minutes
Slow query log rate	Reveals inefficient query plans that will eventually hold tickets	Sudden spike from rolling baseline
Application-thread evictions	Indicates cache pressure often caused by long-lived snapshots	Any sustained nonzero rate

Fixes

Kill a read-only operation

Killing a read operation (find, aggregate, count) is the lowest-risk fix. It releases its ticket and locks at the next yield point. The client receives a cursor-killed or interrupted error.

// Identify the opid from currentOp, then:
db.killOp(12345)

Kill a write operation

Killing a write carries more risk. Single-document writes are atomic, but a killed multi-document write (for example, updateMany) stops after modifying some documents. There is no automatic rollback for non-transactional partial progress. If the write is inside a multi-document transaction, the transaction aborts and rolls back.

Only kill writes when:

The write is not inside a multi-document transaction, or you accept the abort
You have verified the opid is a client-initiated operation, not internal

Kill operations in sharded clusters

On a mongos, db.killOp() propagates to shards for many operations, but for writes inside a session use killSessions with the session lsid. For writes without a session, run db.killOp() on each affected shard using the shard-prefixed opid string, such as shardB:79214.

Do not kill replication internal operations

Avoid killing any operation whose desc or op fields indicate replication internals.

Reduce load if killing is not safe

If the long-running operation is a legitimate bulk job and killing it would corrupt application state, reduce pressure instead:

Pause or throttle the application job
Add a missing index to prevent the collection scan on future runs
Temporarily redirect reads to secondaries if the primary is overloaded

Prevention

Track the maximum secs_running continuously. A metric that exposes the age of the oldest active operation catches runaway queries before ticket exhaustion occurs.
Set driver-side maxTimeMS so queries cannot run indefinitely if the server is slow.
Monitor ticket availability as a first-class signal. Declining available tickets during peak hours means operations are slowing inside the storage engine.
Review slow query logs for COLLSCAN. Every collection scan on a production collection is a candidate for an index or a rewrite.
Kill abandoned transactions and noCursorTimeout cursors. Long-running transactions and cursors pin WiredTiger snapshots and indirectly cause ticket contention by preventing eviction.

How Netdata helps

Netdata charts longest-running operation age from currentOp alongside available WiredTiger tickets. You can see when one query starves the system.
opLatencies and globalLock.currentQueue are collected automatically. Use them to see whether latency is server-wide or tied to a specific ticket class.
WiredTiger cache dirty ratio and application-thread eviction counters provide context to distinguish a single bad query from a checkpoint stall or cache pressure cascade.
Alerts on available tickets dropping below thresholds fire before applications time out, giving time to investigate instead of reacting to an outage.

How MongoDB actually works in production: a mental model for operators: /guides/mongodb/how-mongodb-works-in-production/
MongoDB pages evicted by application threads: when eviction becomes user latency: /guides/mongodb/mongodb-application-thread-evictions/
MongoDB WiredTiger cache dirty ratio high: the leading indicator nobody watches: /guides/mongodb/mongodb-cache-dirty-ratio-high/
MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes: /guides/mongodb/mongodb-cache-pressure-cascade/
MongoDB cache too small: sizing the WiredTiger cache for your working set: /guides/mongodb/mongodb-cache-undersized-working-set/
MongoDB checkpoint duration climbing: diagnosing slow WiredTiger checkpoints: /guides/mongodb/mongodb-checkpoint-duration-high/
MongoDB checkpoint stall write freeze: when all writes stop with no error: /guides/mongodb/mongodb-checkpoint-stall-write-freeze/
MongoDB connection churn: high totalCreated rate and thread creation overhead: /guides/mongodb/mongodb-connection-churn/
MongoDB connection refused at maxIncomingConnections: hitting the connection ceiling: /guides/mongodb/mongodb-connection-limit-reached/
MongoDB connection storm spiral: reconnection floods after an election or deploy: /guides/mongodb/mongodb-connection-storm-spiral/
MongoDB flow control throttling writes: when the primary slows itself down: /guides/mongodb/mongodb-flow-control-throttling-writes/
MongoDB journal sync latency high: the storage signal that warns 60 seconds early: /guides/mongodb/mongodb-journal-sync-latency-high/

The Netdata solution

MongoDB monitoring with Netdata

Netdata monitors MongoDB with per-second metrics and automatic dashboards. Watch WiredTiger cache pressure, oplog window, connection counts, checkpoint stalls, and replication health in one place, correlated with the underlying host.

See MongoDB monitoring → Start monitoring free

MongoDB long-running operations: finding and killing the query holding a ticket

MongoDB long-running operations: finding and killing the query holding a ticket

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Kill a read-only operation

Kill a write operation

Kill operations in sharded clusters

Do not kill replication internal operations

Reduce load if killing is not safe

Prevention

How Netdata helps

Related guides

MongoDB monitoring with Netdata