MongoDB chunk migration storms: moveChunk I/O pressure and range locks

Write latency spikes across multiple shards simultaneously. Queries time out while mongos logs show no election events. Donor shard primaries report growing queue depths, and config.changelog shows a wall of moveChunk.error entries interleaved with retries. The balancer is hammering the cluster instead of helping.

A single moveChunk operation is expensive. It copies documents to the recipient, enters a critical section with a shared range lock on the donor, then deletes orphaned documents. When the balancer triggers many migrations in quick succession, or individual migrations fail and retry in a tight loop, overlapping I/O pressure and lock contention compound into a cluster-wide latency event.

flowchart TD
  A[Hot shard or jumbo chunk] --> B[Chunk imbalance]
  B --> C[Balancer triggers moveChunk]
  C --> D[Clone phase: heavy I/O on recipient]
  C --> E[Critical section: range lock on donor]
  D --> F[Recipient cache dirty ratio rises]
  E --> G[Donor write latency spikes]
  F --> H{Catch-up fails or config server slow}
  G --> H
  H --> I[Migration fails or stalls]
  I --> J[Balancer retries]
  J --> K[Migration storm]

What this means

A chunk migration proceeds in three phases. Clone: the donor copies documents matching the chunk bounds to the recipient via a cursor. Recipient insertions flow through WiredTiger and can pressure cache if the working set is cold. Critical section: the donor holds a shared range lock over the chunk’s shard key interval and queues writes targeting that range. Cleanup: after the metadata commit, the donor deletes orphaned documents.

Under normal conditions the balancer runs migrations sequentially and impact is brief. A storm develops when:

  • The balancer retries a failed migration repeatedly.
  • Config server latency delays migration commits, extending the critical section.
  • A hot shard or jumbo chunks force the balancer to run continuously.
  • Recipient I/O saturation slows the clone, causing the donor to hold the range lock longer.

The result is latency spikes on both the donor (range lock queuing) and the recipient (clone-phase write I/O), with the config server gating the critical section.

Common causes

CauseWhat it looks likeFirst thing to check
Jumbo chunks blocking migrationBalancer is active but chunk counts never equalize; repeated moveChunk.error entriesdb.getSiblingDB("config").chunks.find({jumbo: true})
Config server latency gating commitsWrite latency spikes outlast the clone phase; critical sections hangdb.serverStatus().opLatencies on the config primary
Hot shard forcing aggressive rebalancingOne shard holds >20% more chunks than the average; donor and recipient I/O both saturatedPer-shard chunk aggregation from config.chunks
Recipient cache saturation during cloneRecipient WiredTiger dirty ratio climbs and application threads start evictingWiredTiger cache stats on the recipient primary
Sustained write load extending catch-upmoveChunk.start entries without matching commits; donor currentQueue.writers growsdb.serverStatus().globalLock.currentQueue on the donor

Quick checks

Run these read-only checks to confirm a migration storm is in progress.

// Check if the balancer is running
sh.isBalancerRunning()
// Review recent migration history and outcomes
db.getSiblingDB("config").changelog.find(
  { what: /moveChunk/ },
  { time: 1, what: 1, details: 1 }
).sort({ time: -1 }).limit(20)
// Count migration failures
db.getSiblingDB("config").changelog.find({ what: /moveChunk.error/ }).count()
// Check for unmigratable jumbo chunks
db.getSiblingDB("config").chunks.find({ jumbo: true }).count()
// See chunk distribution across shards
db.getSiblingDB("config").chunks.aggregate([
  { $group: { _id: "$shard", count: { $sum: 1 } } },
  { $sort: { count: -1 } }
])
// Check donor write queue depth
db.serverStatus().globalLock.currentQueue
# Check recipient disk health and I/O saturation
iostat -x 1 3
// Check config server primary latency
db.serverStatus().opLatencies

How to diagnose it

  1. Confirm the storm from config.changelog. Inspect the last 30 minutes. If moveChunk.error entries outnumber successful commits, the balancer is retrying faster than migrations complete. Note details.from and details.to to identify the affected shard pair, and details.errmsg for the failure reason.

  2. Correlate donor latency with lock queuing. On the donor primary, compare db.serverStatus().opLatencies.writes against baseline. If write latency spikes while globalLock.currentQueue.writers is elevated, the critical section is serializing writes.

  3. Check recipient cache pressure from the clone phase. On the recipient primary, inspect db.serverStatus().wiredTiger.cache. A rising dirty ratio or nonzero pages evicted by application threads means the clone write stream is overwhelming WiredTiger’s flush capacity. This slows catch-up and extends the donor lock.

  4. Measure config server command latency. On the config server primary, check opLatencies.commands. Elevated latency delays the metadata commit that ends the critical section. Even fast clone phases turn into long lock holds if the config server is slow.

  5. Identify the imbalance trigger. Run the chunk distribution aggregation. If skew exceeds 20% and jumbo chunks exist, the balancer is stuck attempting to move unmigratable chunks.

  6. Map the timeline. Overlay changelog timestamps with per-shard opcounters and disk I/O metrics. If migration start times correlate with I/O saturation and latency spikes, the causal chain is confirmed.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
config.changelog moveChunk error rateFailed migrations waste I/O and retry in tight loopsErrors exceed successes over any 10-minute window
Donor shard write opLatenciesThe critical section range lock blocks concurrent writes to the chunkp99 write latency spikes correlating with changelog entries
Recipient WiredTiger cache dirty ratioClone-phase writes hot data; if the disk cannot absorb it, dirty pages accumulateDirty ratio >15% or application-thread evictions incrementing
Config server primary command latencyMetadata commits gate how long the donor holds the range lockAverage command latency >2x baseline during balancing windows
Chunk distribution skewPersistent imbalance forces the balancer to run continuouslyMax-min chunk count >20% of the per-shard average
Jumbo chunk countUnmigratable chunks create permanent hot spots and balancer retry loopsAny jumbo chunks on actively growing collections

Fixes

Stop the balancer immediately

Run sh.stopBalancer() to halt new migrations. This breaks the retry loop and stops new range locks and clone I/O.

Warning: This pauses rebalancing. Chunk imbalance will persist until you re-enable the balancer, but it gives immediate relief.

Resolve jumbo chunks

Jumbo chunks cannot be migrated by the balancer. If your shard key cardinality supports it, manually split the chunk range. If the shard key itself is the problem, plan a reshard.

Warning: Resharding is heavy I/O. Schedule it outside peak traffic.

Tune the balancer window

Restrict balancing to off-peak hours so migration I/O does not compete with application traffic. Tradeoff: data distribution lags behind traffic shifts during the day, but you avoid peak-hour latency spikes.

Throttle writes during catch-up

If sustained write volume keeps extending the critical section, temporarily pause bulk ingestion or lower application write concurrency. Tradeoff: slower pipeline, but shorter range-lock hold time.

Fix config server storage latency

If the config server primary shows elevated opLatencies, investigate its underlying disk with iostat or host-level storage metrics. Do not restart config servers during a storm. Resolve the storage contention first so metadata commits can flow.

Prevention

  • Monitor chunk distribution trends. Alert when skew exceeds 15%, before the balancer storms.
  • Size the recipient WiredTiger cache for clone load. Ensure the cache can absorb migration writes without dirty-ratio spikes.
  • Choose a high-cardinality shard key. This minimizes jumbo chunks and reduces the frequency of rebalancing.
  • Watch config server latency as a leading indicator. Elevated command latency on the config primary predicts migration stalls before they cascade to shards.
  • Restrict the balancer window. Limit automatic balancing to maintenance or low-traffic periods.

How Netdata helps

  • WiredTiger cache dirty ratio and application-thread eviction charts on recipient shards reveal clone-phase I/O saturation.
  • opLatencies across shard primaries and config servers in one view surface cross-shard write latency patterns from range-lock queuing.
  • Disk latency and utilization alerts flag when migration I/O saturates donor or recipient storage.
  • Ticket utilization and queue-depth charts on donor shards highlight admission-control backlog from range locks.