MongoDB silent index regression: when a dropped index quietly becomes a collection scan
Read latency on the primary doubles while connection counts and write throughput stay flat. There are no election events or cache pressure alerts. Traffic is unchanged. Yet p99 read latency climbs until operations time out.
The slow query log shows queries that used to finish in milliseconds now taking seconds. The plans show COLLSCAN. An index that existed last week is gone, or the query planner switched to a less efficient index after a cache invalidation. Because queries still return correct results, the regression is silent until it becomes an outage.
What this means
MongoDB’s query planner evaluates available indexes for each query shape and caches the winning plan. When an index supporting that shape is dropped, or when a plan cache entry becomes stale, the planner may fall back to a collection scan (COLLSCAN). Unlike an outright failure, the query succeeds but examines every document to return a small result set.
Latency grows with collection size. A 10,000-document collection may tolerate a scan. A 10-million-document collection will not. Scans pull more data into the WiredTiger cache, displace the working set, and increase I/O pressure. The system reaches a tipping point where cache eviction and ticket contention amplify the regression into a cluster-wide latency spike.
flowchart TD
A[Index dropped or planner regression] --> B[Query replans to COLLSCAN]
B --> C[docsExamined rises with collection size]
C --> D[Read latency increases gradually]
D --> E[Cache fills with scanned pages]
E --> F[Tipping point: I/O and ticket saturation]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Accidental dropIndexes during maintenance | Slow query log shows COLLSCAN on collections that previously used an index | db.collection.getIndexes() compared to a known-good schema |
| Failed background index build | Index exists in catalog but is incomplete or ignored by planner | MongoDB logs for index build failures around the time latency changed |
| Query plan cache invalidation choosing a worse plan | Same query shape switches from IXSCAN to COLLSCAN or a less selective index | explain("executionStats") on the query shape to inspect the winning plan |
| Schema migration changing index selectivity | New field or type makes an existing index less effective; keysExamined:docsReturned degrades | Slow query log entries comparing keysExamined to docsReturned |
Quick checks
Run these safe, read-only checks to confirm the regression.
# Recent collection scans in the log (requires slow operation logging)
grep "COLLSCAN" /var/log/mongodb/mongod.log | tail -20
// Query efficiency counters since server start
var qe = db.serverStatus().metrics.queryExecutor;
print("scanned (keysExamined): " + qe.scanned);
print("scannedObjects (docsExamined): " + qe.scannedObjects);
// Index access counters since server start
db.collection.aggregate([{ $indexStats: {} }]).forEach(function(i) {
print(i.name + " | ops: " + i.accesses.ops + " | since: " + i.accesses.since);
});
// If profiling is enabled, compare docs examined to docs returned
db.system.profile.find().sort({ ts: -1 }).limit(20).forEach(function(p) {
var ratio = p.docsExamined / Math.max(p.nreturned || 1, 1);
print(p.ns + " | examined: " + p.docsExamined + " | returned: " + (p.nreturned || 0) + " | ratio: " + ratio.toFixed(0) + ":1");
});
// Current indexes on the affected collection
db.collection.getIndexes().forEach(function(i) {
print(i.name + " | " + JSON.stringify(i.key));
});
// Average read latency in microseconds
var lat = db.serverStatus().opLatencies.reads;
print("Read avg (us): " + (lat.latency / lat.ops).toFixed(0));
// Long-running read operations
db.currentOp({ active: true, secs_running: { $gt: 10 }, op: "query" })
// Check winning plan for a suspect query shape
db.collection.explain("executionStats").find({ field: "value" })
How to diagnose it
Isolate read efficiency from load. Verify
opLatencies.reads.latencyis rising whileopcounters.queryandconnections.currentremain stable. This rules out a traffic spike or connection storm.Inspect the slow query log. Look for
planSummary: COLLSCANor a highdocsExamined:docsReturnedratio. A ratio above 100:1 on a large collection indicates the executor is discarding most examined documents.Sample
metrics.queryExecutortwice. RecordscannedandscannedObjects, wait five minutes, then sample again. A rising ratio of scanned objects to queries means the executor is traversing more documents than it returns.Run
$indexStatson suspect collections. An index that previously served a query shape but shows zeroopssince restart is either missing or bypassed. Compareaccesses.sinceto the server uptime.Verify index existence and state. Use
db.collection.getIndexes(). If the expected index is gone, it was dropped. If it exists but is unused, check whether it is hidden or was left incomplete by a failed build.Inspect the winning plan. Run
explain("executionStats")on the affected query shape. If the plan showsCOLLSCANon a large collection, or anIXSCANon a low-selectivity index followed by a largedocsExamined, the planner has regressed. Note theexecutionTimeMillisandtotalDocsExamined.Check for plan cache pollution. Query the plan cache to see if a
COLLSCANplan is cached while a selective index exists. Clear the cache for the collection to force replanning before considering a restart.Correlate with DDL events. Search MongoDB logs for
dropIndexes,createIndexes, failed index builds, or collMod near the time the latency trend changed. DDL clears the plan cache and forces replanning.Validate collection size. MongoDB may legitimately choose
COLLSCANfor very small collections. Confirm the document count is high enough that a scan is pathological.db.collection.estimatedDocumentCount()
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
opLatencies.reads.latency / opLatencies.reads.ops | Server-side read latency trend | Average doubles from baseline for more than 5 minutes with no corresponding write or connection spike |
metrics.queryExecutor.scanned / scannedObjects | Query executor efficiency | Sustained increase in the ratio without a workload change |
Slow query log COLLSCAN rate | Direct evidence of collection scans | New COLLSCAN entries on collections with more than 100,000 documents |
wiredTiger.cache.pages read into cache | I/O volume driven by scans | Sustained increase correlating with COLLSCAN onset in logs |
$indexStats.ops per index | Whether expected indexes remain active | A previously used index drops to zero ops after a restart or DDL event |
globalLock.currentQueue.readers | Read queuing caused by slow scans | Sustained queue depth greater than 20 correlating with scan onset |
Fixes
Force a query hint (immediate mitigation)
If the correct index exists but the planner is ignoring it, apply a query hint to buy time while you fix the root cause.
db.collection.find({ field: "value" }).hint({ field: 1 })
Warning: Hints bypass the planner entirely. If the hinted index cannot support the query, the operation fails. Remove the hint after the root cause is resolved so future index additions are considered.
Recreate a missing index
If getIndexes() shows the index is gone, rebuild it with a background build.
db.collection.createIndex({ field: 1 }, { background: true })
Warning: Background builds consume CPU, I/O, and disk space. They generate oplog entries that secondaries apply, which can increase replication lag. Monitor lag with rs.printSecondaryReplicationInfo() during the build. Avoid starting builds during peak traffic.
Clear a stale plan cache
If the index exists but the planner avoids it, clear the plan cache for the collection. This forces replanning on the next execution.
db.runCommand({ planCacheClear: "collection" })
If the regression persists across the whole cluster and you cannot identify the offending entry, a rolling restart clears all cached plans. This is disruptive and should be a last resort.
Prevention
- Review all index changes in a staging environment that mirrors production data volume. Run
explain("executionStats")on critical query shapes before and after the change. - Monitor
$indexStatsdeltas weekly. An index that drops out of the top-used list warrants investigation. - Alert on the ratio of
metrics.queryExecutor.scannedtoscannedObjects. A sustained increase is an early warning of regressing query efficiency. - Keep the slow query threshold low enough to catch new
COLLSCANpatterns before they become outages. - Before dropping any index, verify it is not used by checking
$indexStatsacross a full business cycle and reviewing slow query logs for the query shapes it serves.
How Netdata helps
- Correlate rising
opLatencies.readswith flatopcountersandconnectionsto isolate query-efficiency regressions from load spikes. - Alert on
metrics.queryExecutor.scannedrate increasing whileopcounters.querystays flat, catching collection scans early. - Track per-node read latency. A silent index regression on a secondary shows up as elevated read latency for that node without primary impact.
- Surface slow query frequency and
planSummarypatterns from MongoDB logs, flaggingCOLLSCANas it appears. - Visualize
$indexStatsusage trends over time to spot index abandonment before latency degrades.
Related guides
- How MongoDB actually works in production: a mental model for operators
- MongoDB pages evicted by application threads: when eviction becomes user latency
- MongoDB WiredTiger cache dirty ratio high: the leading indicator nobody watches
- MongoDB WiredTiger cache pressure cascade: eviction stalls and latency spikes
- MongoDB cache too small: sizing the WiredTiger cache for your working set
- MongoDB checkpoint duration climbing: diagnosing slow WiredTiger checkpoints
- MongoDB checkpoint stall write freeze: when all writes stop with no error
- MongoDB connection churn: high totalCreated rate and thread creation overhead
- MongoDB connection refused at maxIncomingConnections: hitting the connection ceiling
- MongoDB connection storm spiral: reconnection floods after an election or deploy
- MongoDB flow control throttling writes: when the primary slows itself down
- MongoDB journal sync latency high: the storage signal that warns 60 seconds early







