MongoDB silent index regression: when a dropped index quietly becomes a collection scan

Read latency on the primary doubles while connection counts and write throughput stay flat. There are no election events or cache pressure alerts. Traffic is unchanged. Yet p99 read latency climbs until operations time out.

The slow query log shows queries that used to finish in milliseconds now taking seconds. The plans show COLLSCAN. An index that existed last week is gone, or the query planner switched to a less efficient index after a cache invalidation. Because queries still return correct results, the regression is silent until it becomes an outage.

What this means

MongoDB’s query planner evaluates available indexes for each query shape and caches the winning plan. When an index supporting that shape is dropped, or when a plan cache entry becomes stale, the planner may fall back to a collection scan (COLLSCAN). Unlike an outright failure, the query succeeds but examines every document to return a small result set.

Latency grows with collection size. A 10,000-document collection may tolerate a scan. A 10-million-document collection will not. Scans pull more data into the WiredTiger cache, displace the working set, and increase I/O pressure. The system reaches a tipping point where cache eviction and ticket contention amplify the regression into a cluster-wide latency spike.

flowchart TD
    A[Index dropped or planner regression] --> B[Query replans to COLLSCAN]
    B --> C[docsExamined rises with collection size]
    C --> D[Read latency increases gradually]
    D --> E[Cache fills with scanned pages]
    E --> F[Tipping point: I/O and ticket saturation]

Common causes

CauseWhat it looks likeFirst thing to check
Accidental dropIndexes during maintenanceSlow query log shows COLLSCAN on collections that previously used an indexdb.collection.getIndexes() compared to a known-good schema
Failed background index buildIndex exists in catalog but is incomplete or ignored by plannerMongoDB logs for index build failures around the time latency changed
Query plan cache invalidation choosing a worse planSame query shape switches from IXSCAN to COLLSCAN or a less selective indexexplain("executionStats") on the query shape to inspect the winning plan
Schema migration changing index selectivityNew field or type makes an existing index less effective; keysExamined:docsReturned degradesSlow query log entries comparing keysExamined to docsReturned

Quick checks

Run these safe, read-only checks to confirm the regression.

# Recent collection scans in the log (requires slow operation logging)
grep "COLLSCAN" /var/log/mongodb/mongod.log | tail -20
// Query efficiency counters since server start
var qe = db.serverStatus().metrics.queryExecutor;
print("scanned (keysExamined): " + qe.scanned);
print("scannedObjects (docsExamined): " + qe.scannedObjects);
// Index access counters since server start
db.collection.aggregate([{ $indexStats: {} }]).forEach(function(i) {
  print(i.name + " | ops: " + i.accesses.ops + " | since: " + i.accesses.since);
});
// If profiling is enabled, compare docs examined to docs returned
db.system.profile.find().sort({ ts: -1 }).limit(20).forEach(function(p) {
  var ratio = p.docsExamined / Math.max(p.nreturned || 1, 1);
  print(p.ns + " | examined: " + p.docsExamined + " | returned: " + (p.nreturned || 0) + " | ratio: " + ratio.toFixed(0) + ":1");
});
// Current indexes on the affected collection
db.collection.getIndexes().forEach(function(i) {
  print(i.name + " | " + JSON.stringify(i.key));
});
// Average read latency in microseconds
var lat = db.serverStatus().opLatencies.reads;
print("Read avg (us): " + (lat.latency / lat.ops).toFixed(0));
// Long-running read operations
db.currentOp({ active: true, secs_running: { $gt: 10 }, op: "query" })
// Check winning plan for a suspect query shape
db.collection.explain("executionStats").find({ field: "value" })

How to diagnose it

  1. Isolate read efficiency from load. Verify opLatencies.reads.latency is rising while opcounters.query and connections.current remain stable. This rules out a traffic spike or connection storm.

  2. Inspect the slow query log. Look for planSummary: COLLSCAN or a high docsExamined:docsReturned ratio. A ratio above 100:1 on a large collection indicates the executor is discarding most examined documents.

  3. Sample metrics.queryExecutor twice. Record scanned and scannedObjects, wait five minutes, then sample again. A rising ratio of scanned objects to queries means the executor is traversing more documents than it returns.

  4. Run $indexStats on suspect collections. An index that previously served a query shape but shows zero ops since restart is either missing or bypassed. Compare accesses.since to the server uptime.

  5. Verify index existence and state. Use db.collection.getIndexes(). If the expected index is gone, it was dropped. If it exists but is unused, check whether it is hidden or was left incomplete by a failed build.

  6. Inspect the winning plan. Run explain("executionStats") on the affected query shape. If the plan shows COLLSCAN on a large collection, or an IXSCAN on a low-selectivity index followed by a large docsExamined, the planner has regressed. Note the executionTimeMillis and totalDocsExamined.

  7. Check for plan cache pollution. Query the plan cache to see if a COLLSCAN plan is cached while a selective index exists. Clear the cache for the collection to force replanning before considering a restart.

  8. Correlate with DDL events. Search MongoDB logs for dropIndexes, createIndexes, failed index builds, or collMod near the time the latency trend changed. DDL clears the plan cache and forces replanning.

  9. Validate collection size. MongoDB may legitimately choose COLLSCAN for very small collections. Confirm the document count is high enough that a scan is pathological.

    db.collection.estimatedDocumentCount()
    

Metrics and signals to monitor

SignalWhy it mattersWarning sign
opLatencies.reads.latency / opLatencies.reads.opsServer-side read latency trendAverage doubles from baseline for more than 5 minutes with no corresponding write or connection spike
metrics.queryExecutor.scanned / scannedObjectsQuery executor efficiencySustained increase in the ratio without a workload change
Slow query log COLLSCAN rateDirect evidence of collection scansNew COLLSCAN entries on collections with more than 100,000 documents
wiredTiger.cache.pages read into cacheI/O volume driven by scansSustained increase correlating with COLLSCAN onset in logs
$indexStats.ops per indexWhether expected indexes remain activeA previously used index drops to zero ops after a restart or DDL event
globalLock.currentQueue.readersRead queuing caused by slow scansSustained queue depth greater than 20 correlating with scan onset

Fixes

Force a query hint (immediate mitigation)

If the correct index exists but the planner is ignoring it, apply a query hint to buy time while you fix the root cause.

db.collection.find({ field: "value" }).hint({ field: 1 })

Warning: Hints bypass the planner entirely. If the hinted index cannot support the query, the operation fails. Remove the hint after the root cause is resolved so future index additions are considered.

Recreate a missing index

If getIndexes() shows the index is gone, rebuild it with a background build.

db.collection.createIndex({ field: 1 }, { background: true })

Warning: Background builds consume CPU, I/O, and disk space. They generate oplog entries that secondaries apply, which can increase replication lag. Monitor lag with rs.printSecondaryReplicationInfo() during the build. Avoid starting builds during peak traffic.

Clear a stale plan cache

If the index exists but the planner avoids it, clear the plan cache for the collection. This forces replanning on the next execution.

db.runCommand({ planCacheClear: "collection" })

If the regression persists across the whole cluster and you cannot identify the offending entry, a rolling restart clears all cached plans. This is disruptive and should be a last resort.

Prevention

  • Review all index changes in a staging environment that mirrors production data volume. Run explain("executionStats") on critical query shapes before and after the change.
  • Monitor $indexStats deltas weekly. An index that drops out of the top-used list warrants investigation.
  • Alert on the ratio of metrics.queryExecutor.scanned to scannedObjects. A sustained increase is an early warning of regressing query efficiency.
  • Keep the slow query threshold low enough to catch new COLLSCAN patterns before they become outages.
  • Before dropping any index, verify it is not used by checking $indexStats across a full business cycle and reviewing slow query logs for the query shapes it serves.

How Netdata helps

  • Correlate rising opLatencies.reads with flat opcounters and connections to isolate query-efficiency regressions from load spikes.
  • Alert on metrics.queryExecutor.scanned rate increasing while opcounters.query stays flat, catching collection scans early.
  • Track per-node read latency. A silent index regression on a secondary shows up as elevated read latency for that node without primary impact.
  • Surface slow query frequency and planSummary patterns from MongoDB logs, flagging COLLSCAN as it appears.
  • Visualize $indexStats usage trends over time to spot index abandonment before latency degrades.