$ guides / mongodb / mongodb-chunk-migration-storms ▌

Operations Guides

MongoDB chunk migration storms: moveChunk I/O pressure and range locks

Write latency spikes across multiple shards simultaneously. Queries time out while mongos logs show no election events. Donor shard primaries report growing queue depths, and config.changelog shows a wall of moveChunk.error entries interleaved with retries. The balancer is hammering the cluster instead of helping.

A single moveChunk operation is expensive. It copies documents to the recipient, enters a critical section with a shared range lock on the donor, then deletes orphaned documents. When the balancer triggers many migrations in quick succession, or individual migrations fail and retry in a tight loop, overlapping I/O pressure and lock contention compound into a cluster-wide latency event.

flowchart TD
  A[Hot shard or jumbo chunk] --> B[Chunk imbalance]
  B --> C[Balancer triggers moveChunk]
  C --> D[Clone phase: heavy I/O on recipient]
  C --> E[Critical section: range lock on donor]
  D --> F[Recipient cache dirty ratio rises]
  E --> G[Donor write latency spikes]
  F --> H{Catch-up fails or config server slow}
  G --> H
  H --> I[Migration fails or stalls]
  I --> J[Balancer retries]
  J --> K[Migration storm]

What this means

A chunk migration proceeds in three phases. Clone: the donor copies documents matching the chunk bounds to the recipient via a cursor. Recipient insertions flow through WiredTiger and can pressure cache if the working set is cold. Critical section: the donor holds a shared range lock over the chunk’s shard key interval and queues writes targeting that range. Cleanup: after the metadata commit, the donor deletes orphaned documents.

Under normal conditions the balancer runs migrations sequentially and impact is brief. A storm develops when:

The balancer retries a failed migration repeatedly.
Config server latency delays migration commits, extending the critical section.
A hot shard or jumbo chunks force the balancer to run continuously.
Recipient I/O saturation slows the clone, causing the donor to hold the range lock longer.

The result is latency spikes on both the donor (range lock queuing) and the recipient (clone-phase write I/O), with the config server gating the critical section.

Common causes

Cause	What it looks like	First thing to check
Jumbo chunks blocking migration	Balancer is active but chunk counts never equalize; repeated `moveChunk.error` entries	`db.getSiblingDB("config").chunks.find({jumbo: true})`
Config server latency gating commits	Write latency spikes outlast the clone phase; critical sections hang	`db.serverStatus().opLatencies` on the config primary
Hot shard forcing aggressive rebalancing	One shard holds >20% more chunks than the average; donor and recipient I/O both saturated	Per-shard chunk aggregation from `config.chunks`
Recipient cache saturation during clone	Recipient WiredTiger dirty ratio climbs and application threads start evicting	WiredTiger cache stats on the recipient primary
Sustained write load extending catch-up	`moveChunk.start` entries without matching commits; donor `currentQueue.writers` grows	`db.serverStatus().globalLock.currentQueue` on the donor

Quick checks

Run these read-only checks to confirm a migration storm is in progress.

// Check if the balancer is running
sh.isBalancerRunning()

// Review recent migration history and outcomes
db.getSiblingDB("config").changelog.find(
  { what: /moveChunk/ },
  { time: 1, what: 1, details: 1 }
).sort({ time: -1 }).limit(20)

// Count migration failures
db.getSiblingDB("config").changelog.find({ what: /moveChunk.error/ }).count()

// Check for unmigratable jumbo chunks
db.getSiblingDB("config").chunks.find({ jumbo: true }).count()

// See chunk distribution across shards
db.getSiblingDB("config").chunks.aggregate([
  { $group: { _id: "$shard", count: { $sum: 1 } } },
  { $sort: { count: -1 } }
])

// Check donor write queue depth
db.serverStatus().globalLock.currentQueue

# Check recipient disk health and I/O saturation
iostat -x 1 3

// Check config server primary latency
db.serverStatus().opLatencies

How to diagnose it

Confirm the storm from config.changelog. Inspect the last 30 minutes. If moveChunk.error entries outnumber successful commits, the balancer is retrying faster than migrations complete. Note details.from and details.to to identify the affected shard pair, and details.errmsg for the failure reason.
Correlate donor latency with lock queuing. On the donor primary, compare db.serverStatus().opLatencies.writes against baseline. If write latency spikes while globalLock.currentQueue.writers is elevated, the critical section is serializing writes.
Check recipient cache pressure from the clone phase. On the recipient primary, inspect db.serverStatus().wiredTiger.cache. A rising dirty ratio or nonzero pages evicted by application threads means the clone write stream is overwhelming WiredTiger’s flush capacity. This slows catch-up and extends the donor lock.
Measure config server command latency. On the config server primary, check opLatencies.commands. Elevated latency delays the metadata commit that ends the critical section. Even fast clone phases turn into long lock holds if the config server is slow.
Identify the imbalance trigger. Run the chunk distribution aggregation. If skew exceeds 20% and jumbo chunks exist, the balancer is stuck attempting to move unmigratable chunks.
Map the timeline. Overlay changelog timestamps with per-shard opcounters and disk I/O metrics. If migration start times correlate with I/O saturation and latency spikes, the causal chain is confirmed.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`config.changelog` `moveChunk` error rate	Failed migrations waste I/O and retry in tight loops	Errors exceed successes over any 10-minute window
Donor shard write `opLatencies`	The critical section range lock blocks concurrent writes to the chunk	p99 write latency spikes correlating with changelog entries
Recipient WiredTiger cache dirty ratio	Clone-phase writes hot data; if the disk cannot absorb it, dirty pages accumulate	Dirty ratio >15% or application-thread evictions incrementing
Config server primary command latency	Metadata commits gate how long the donor holds the range lock	Average command latency >2x baseline during balancing windows
Chunk distribution skew	Persistent imbalance forces the balancer to run continuously	Max-min chunk count >20% of the per-shard average
Jumbo chunk count	Unmigratable chunks create permanent hot spots and balancer retry loops	Any jumbo chunks on actively growing collections

Fixes

Stop the balancer immediately

Run sh.stopBalancer() to halt new migrations. This breaks the retry loop and stops new range locks and clone I/O.

Warning: This pauses rebalancing. Chunk imbalance will persist until you re-enable the balancer, but it gives immediate relief.

Resolve jumbo chunks

Jumbo chunks cannot be migrated by the balancer. If your shard key cardinality supports it, manually split the chunk range. If the shard key itself is the problem, plan a reshard.

Warning: Resharding is heavy I/O. Schedule it outside peak traffic.

Tune the balancer window

Restrict balancing to off-peak hours so migration I/O does not compete with application traffic. Tradeoff: data distribution lags behind traffic shifts during the day, but you avoid peak-hour latency spikes.

Throttle writes during catch-up

If sustained write volume keeps extending the critical section, temporarily pause bulk ingestion or lower application write concurrency. Tradeoff: slower pipeline, but shorter range-lock hold time.

Fix config server storage latency

If the config server primary shows elevated opLatencies, investigate its underlying disk with iostat or host-level storage metrics. Do not restart config servers during a storm. Resolve the storage contention first so metadata commits can flow.

Prevention

Monitor chunk distribution trends. Alert when skew exceeds 15%, before the balancer storms.
Size the recipient WiredTiger cache for clone load. Ensure the cache can absorb migration writes without dirty-ratio spikes.
Choose a high-cardinality shard key. This minimizes jumbo chunks and reduces the frequency of rebalancing.
Watch config server latency as a leading indicator. Elevated command latency on the config primary predicts migration stalls before they cascade to shards.
Restrict the balancer window. Limit automatic balancing to maintenance or low-traffic periods.

How Netdata helps

WiredTiger cache dirty ratio and application-thread eviction charts on recipient shards reveal clone-phase I/O saturation.
opLatencies across shard primaries and config servers in one view surface cross-shard write latency patterns from range-lock queuing.
Disk latency and utilization alerts flag when migration I/O saturates donor or recipient storage.
Ticket utilization and queue-depth charts on donor shards highlight admission-control backlog from range locks.

The Netdata solution

MongoDB monitoring with Netdata

Netdata monitors MongoDB with per-second metrics and automatic dashboards. Watch WiredTiger cache pressure, oplog window, connection counts, checkpoint stalls, and replication health in one place, correlated with the underlying host.

See MongoDB monitoring → Start monitoring free

MongoDB chunk migration storms: moveChunk I/O pressure and range locks

MongoDB chunk migration storms: moveChunk I/O pressure and range locks

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Stop the balancer immediately

Resolve jumbo chunks

Tune the balancer window

Throttle writes during catch-up

Fix config server storage latency

Prevention

How Netdata helps

Related guides

MongoDB monitoring with Netdata