MongoDB balancer stuck and jumbo chunks: permanent imbalance and how to fix it
One shard is hot while the others idle. sh.status() shows chunk counts skewed more than 20%. The balancer is either stopped or running without closing the gap. Until you fix the root cause, the imbalance persists.
Two failure modes cause this. The balancer itself can be disabled, restricted to a narrow window, or blocked by an unhealthy config server. Or the cluster has jumbo chunks: ranges that exceed the configured chunkSize but cannot split because too many documents share the exact same shard key value. MongoDB marks those chunks jumbo in config.chunks and the balancer skips them. That heavy chunk pins load and storage on a single shard and creates a floor on how balanced the cluster can become.
This guide covers diagnosing which mode is active, read-only checks to confirm it, and fixes that restore balance.
What this means
The balancer is a background process that runs on the config server primary. It moves chunks between shards so each shard owns roughly equal data. Chunk migrations are expensive: they copy documents to the destination, synchronize changes, and commit metadata, briefly blocking writes on the migrating range during the commit phase.
A chunk normally splits when it exceeds chunkSize (default 64 MB). After splitting, the balancer migrates the smaller pieces. A jumbo chunk breaks this pipeline. If many documents share an identical shard key value, MongoDB cannot find a valid split point inside the range. The chunk is marked jumbo and will not migrate. Even if every other chunk is balanced, that one heavy chunk leaves disproportionate load on one shard.
When total skew is above 20% and the balancer is inactive or cannot redistribute chunks, you are in this failure mode.
flowchart TD
A[Chunk skew exceeds 20%] --> B{Balancer enabled?}
B -->|No| C[Check state and window]
B -->|Yes| D{Jumbo chunks?}
D -->|Yes| E[Cannot split or migrate]
D -->|No| F[Check config server and I/O]
C --> G[Re-enable or widen window]
E --> H[Reshard or redesign key]
F --> I[Resolve config or saturation]Common causes
| Cause | What it looks like | First check |
|---|---|---|
| Balancer disabled | sh.status() shows no recent migrations; skew persists | sh.getBalancerState() |
| Jumbo chunks blocking migration | Specific ranges stuck on one shard despite active balancer | db.getSiblingDB("config").chunks.find({ jumbo: true }) |
| Config server unavailable | No splits or migrations proceed; metadata operations hang | rs.status() on the config server replica set |
| Balancer window too narrow | Skew accumulates during the day; minimal rebalancing overnight | The configured balancer active window |
| Migration I/O saturation | Migrations start but stall; donor or recipient latency spikes | Per-shard WiredTiger ticket utilization and cache pressure |
Quick checks
Run these from mongos unless noted. All are read-only.
// Balancer state. getBalancerState returns whether balancing is enabled.
// isBalancerRunning is true only during an active round, so a false value
// by itself does not mean the balancer is stuck.
sh.getBalancerState()
sh.isBalancerRunning()
// Chunk count per shard
db.getSiblingDB("config").chunks.aggregate([
{ $group: { _id: "$shard", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
])
// Jumbo chunks that block migration
db.getSiblingDB("config").chunks.find(
{ jumbo: true },
{ ns: 1, min: 1, max: 1, shard: 1 }
)
// Recent migration attempts and outcomes
db.getSiblingDB("config").changelog.find(
{ what: /^moveChunk/ },
{ time: 1, what: 1, "details.errmsg": 1, "details.from": 1, "details.to": 1 }
).sort({ time: -1 }).limit(20)
// Config server health. Run against a config server member.
rs.status()
// Compare workload intensity across shard primaries
db.serverStatus().opcounters
// WiredTiger concurrency saturation on the hot shard
db.serverStatus().wiredTiger.concurrentTransactions
How to diagnose it
- Quantify imbalance. Run the chunk count aggregation. If the difference between the most-loaded and least-loaded shard exceeds 20% of the average, the cluster is operationally imbalanced.
- Check balancer configuration.
sh.getBalancerState()must returntrue.sh.isBalancerRunning()only reports an active round; if it isfalse, look atconfig.changelogover the last hour to see whether anymoveChunkevents committed. If no migrations occur during an enabled period, the balancer is stalled. - Hunt jumbo chunks.
db.getSiblingDB("config").chunks.find({ jumbo: true })returns ranges that cannot migrate. Count them and note which shards own them. If jumbo chunks sit on the hot shard, they explain at least part of the skew. - Inspect the config server. Run
rs.status()against the config server replica set. The balancer needs a healthy primary. Any member not inPRIMARYorSECONDARY, or replication lag above a few seconds, can stall metadata operations. - Read the changelog. Look for recent
moveChunk.commitandmoveChunk.abortentries. Repeated aborts withdetails.errmsgmean migrations are failing. Correlate timestamps with per-shard metrics. If failures cluster during peak traffic, donor or recipient I/O saturation is the likely cause. - Correlate shard health. Compare
opLatencies, cache dirty ratio, and available WiredTiger tickets between the hot shard and the others. If the hot shard shows ticket exhaustion or application-thread evictions while idle shards are comfortable, the imbalance has already caused local saturation. - Decide the root cause. If the balancer is disabled, enable it. If it is enabled but jumbo chunks exist, the data distribution is the root cause. If it is enabled, no jumbos exist, and migrations abort, look at config server health or shard-level resource exhaustion.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Chunk distribution skew | Predicts hot shard and query latency bias | >20% count difference between any two shards |
| Balancer state | Tells you whether rebalancing is attempted | sh.getBalancerState() returns false or no migrations in changelog for >1 hour |
| Jumbo chunk count | Permanent migration blockers | Any jumbo: true chunk persisting >1 hour |
| Config server member state | Metadata operations need a healthy primary | Any config server not PRIMARY or SECONDARY |
Per-shard opLatencies p99 | Detects hot shard before saturation | One shard p99 >2x the others for >5 minutes |
| WiredTiger ticket utilization | Migrations add I/O pressure on donor and recipient | Available tickets <25% of total during active migrations |
| WiredTiger cache dirty ratio | Checkpoint pressure rises during large moves | Dirty ratio >10% sustained during migration windows |
| Migration changelog abort rate | Direct evidence of balancer failures | Repeated moveChunk.abort entries with details.errmsg |
Fixes
Re-enable or widen the balancer
If the balancer was disabled during maintenance, re-enable it from mongos:
sh.startBalancer()
To restrict balancing to off-peak hours, set an active window.
Avoid widening the window into peak traffic until donor and recipient shards have enough headroom, because migrations add read and write load to both sides.
Resolve config server pressure
The balancer cannot commit metadata changes without a config server primary. If the config server replica set has no primary, is in election churn, or suffers replication lag, fix the replica set first. Do not attempt chunk migrations during config server instability. See How MongoDB actually works in production for config server failure modes.
Address jumbo chunks
If the chunk is splittable: some jumbo chunks contain distinct shard key values but grew too large because the splitter lagged. Force a split at a valid boundary:
// Split the chunk containing this document at a midpoint
sh.splitFind("database.collection", { shardKeyField: "value" })
// Or split at an explicit boundary
sh.splitAt("database.collection", { shardKeyField: "value" })
After the split, the resulting chunks are no longer jumbo and the balancer can migrate them.
If documents share an identical shard key value: MongoDB cannot split the chunk, and the balancer will not migrate it. The permanent fix is a shard key that can distribute the documents. In MongoDB 5.0 and later, use live resharding:
db.adminCommand({
reshardCollection: "database.collection",
key: { newShardKeyField: 1 }
})
Live resharding rewrites chunks in the background, but it is I/O intensive and should be scheduled during a low-traffic window.
In MongoDB 4.4 and earlier, live resharding is not available. The collection must be rebuilt:
- Choose the new shard key.
- Dump the collection or copy it to an unsharded staging collection.
- Drop the original collection.
- Create a new empty collection with the desired shard key using
shardCollection. - Restore the data.
- Update application references if necessary.
For large collections this takes time and disk space. Plan the cutover carefully; for some workloads it is easier to dual-write to a new sharded collection and migrate reads gradually.
Reduce migration pressure
If migrations abort because donor or recipient shards saturate, either widen the capacity or narrow the migration window to a cooler period. You can also reduce the impact per migration by temporarily lowering chunkSize so individual chunks are smaller, but this increases the total number of migrations. Use it only when shards have spare I/O and the network between shards is healthy.
Validate
After each change, re-run the chunk count aggregation and the jumbo query. Expect skew to drop below 10% over the next few balancer rounds. If skew stalls above 10% and no jumbo chunks remain, check the changelog for aborts and the config server for instability.







