$ guides / mongodb / mongodb-balancer-stuck-jumbo-chunks ▌

Operations Guides

MongoDB balancer stuck and jumbo chunks: permanent imbalance and how to fix it

One shard is hot while the others idle. sh.status() shows chunk counts skewed more than 20%. The balancer is either stopped or running without closing the gap. Until you fix the root cause, the imbalance persists.

Two failure modes cause this. The balancer itself can be disabled, restricted to a narrow window, or blocked by an unhealthy config server. Or the cluster has jumbo chunks: ranges that exceed the configured chunkSize but cannot split because too many documents share the exact same shard key value. MongoDB marks those chunks jumbo in config.chunks and the balancer skips them. That heavy chunk pins load and storage on a single shard and creates a floor on how balanced the cluster can become.

This guide covers diagnosing which mode is active, read-only checks to confirm it, and fixes that restore balance.

What this means

The balancer is a background process that runs on the config server primary. It moves chunks between shards so each shard owns roughly equal data. Chunk migrations are expensive: they copy documents to the destination, synchronize changes, and commit metadata, briefly blocking writes on the migrating range during the commit phase.

A chunk normally splits when it exceeds chunkSize (default 64 MB). After splitting, the balancer migrates the smaller pieces. A jumbo chunk breaks this pipeline. If many documents share an identical shard key value, MongoDB cannot find a valid split point inside the range. The chunk is marked jumbo and will not migrate. Even if every other chunk is balanced, that one heavy chunk leaves disproportionate load on one shard.

When total skew is above 20% and the balancer is inactive or cannot redistribute chunks, you are in this failure mode.

flowchart TD
    A[Chunk skew exceeds 20%] --> B{Balancer enabled?}
    B -->|No| C[Check state and window]
    B -->|Yes| D{Jumbo chunks?}
    D -->|Yes| E[Cannot split or migrate]
    D -->|No| F[Check config server and I/O]
    C --> G[Re-enable or widen window]
    E --> H[Reshard or redesign key]
    F --> I[Resolve config or saturation]

Common causes

Cause	What it looks like	First check
Balancer disabled	`sh.status()` shows no recent migrations; skew persists	`sh.getBalancerState()`
Jumbo chunks blocking migration	Specific ranges stuck on one shard despite active balancer	`db.getSiblingDB("config").chunks.find({ jumbo: true })`
Config server unavailable	No splits or migrations proceed; metadata operations hang	`rs.status()` on the config server replica set
Balancer window too narrow	Skew accumulates during the day; minimal rebalancing overnight	The configured balancer active window
Migration I/O saturation	Migrations start but stall; donor or recipient latency spikes	Per-shard WiredTiger ticket utilization and cache pressure

Quick checks

Run these from mongos unless noted. All are read-only.

// Balancer state. getBalancerState returns whether balancing is enabled.
// isBalancerRunning is true only during an active round, so a false value
// by itself does not mean the balancer is stuck.
sh.getBalancerState()
sh.isBalancerRunning()

// Chunk count per shard
db.getSiblingDB("config").chunks.aggregate([
  { $group: { _id: "$shard", count: { $sum: 1 } } },
  { $sort: { count: -1 } }
])

// Jumbo chunks that block migration
db.getSiblingDB("config").chunks.find(
  { jumbo: true },
  { ns: 1, min: 1, max: 1, shard: 1 }
)

// Recent migration attempts and outcomes
db.getSiblingDB("config").changelog.find(
  { what: /^moveChunk/ },
  { time: 1, what: 1, "details.errmsg": 1, "details.from": 1, "details.to": 1 }
).sort({ time: -1 }).limit(20)

// Config server health. Run against a config server member.
rs.status()

// Compare workload intensity across shard primaries
db.serverStatus().opcounters

// WiredTiger concurrency saturation on the hot shard
db.serverStatus().wiredTiger.concurrentTransactions

How to diagnose it

Quantify imbalance. Run the chunk count aggregation. If the difference between the most-loaded and least-loaded shard exceeds 20% of the average, the cluster is operationally imbalanced.
Check balancer configuration. sh.getBalancerState() must return true. sh.isBalancerRunning() only reports an active round; if it is false, look at config.changelog over the last hour to see whether any moveChunk events committed. If no migrations occur during an enabled period, the balancer is stalled.
Hunt jumbo chunks. db.getSiblingDB("config").chunks.find({ jumbo: true }) returns ranges that cannot migrate. Count them and note which shards own them. If jumbo chunks sit on the hot shard, they explain at least part of the skew.
Inspect the config server. Run rs.status() against the config server replica set. The balancer needs a healthy primary. Any member not in PRIMARY or SECONDARY, or replication lag above a few seconds, can stall metadata operations.
Read the changelog. Look for recent moveChunk.commit and moveChunk.abort entries. Repeated aborts with details.errmsg mean migrations are failing. Correlate timestamps with per-shard metrics. If failures cluster during peak traffic, donor or recipient I/O saturation is the likely cause.
Correlate shard health. Compare opLatencies, cache dirty ratio, and available WiredTiger tickets between the hot shard and the others. If the hot shard shows ticket exhaustion or application-thread evictions while idle shards are comfortable, the imbalance has already caused local saturation.
Decide the root cause. If the balancer is disabled, enable it. If it is enabled but jumbo chunks exist, the data distribution is the root cause. If it is enabled, no jumbos exist, and migrations abort, look at config server health or shard-level resource exhaustion.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Chunk distribution skew	Predicts hot shard and query latency bias	>20% count difference between any two shards
Balancer state	Tells you whether rebalancing is attempted	`sh.getBalancerState()` returns `false` or no migrations in changelog for >1 hour
Jumbo chunk count	Permanent migration blockers	Any `jumbo: true` chunk persisting >1 hour
Config server member state	Metadata operations need a healthy primary	Any config server not `PRIMARY` or `SECONDARY`
Per-shard `opLatencies` p99	Detects hot shard before saturation	One shard p99 >2x the others for >5 minutes
WiredTiger ticket utilization	Migrations add I/O pressure on donor and recipient	Available tickets <25% of total during active migrations
WiredTiger cache dirty ratio	Checkpoint pressure rises during large moves	Dirty ratio >10% sustained during migration windows
Migration changelog abort rate	Direct evidence of balancer failures	Repeated `moveChunk.abort` entries with `details.errmsg`

Fixes

Re-enable or widen the balancer

If the balancer was disabled during maintenance, re-enable it from mongos:

sh.startBalancer()

To restrict balancing to off-peak hours, set an active window.

Avoid widening the window into peak traffic until donor and recipient shards have enough headroom, because migrations add read and write load to both sides.

Resolve config server pressure

The balancer cannot commit metadata changes without a config server primary. If the config server replica set has no primary, is in election churn, or suffers replication lag, fix the replica set first. Do not attempt chunk migrations during config server instability. See How MongoDB actually works in production for config server failure modes.

Address jumbo chunks

If the chunk is splittable: some jumbo chunks contain distinct shard key values but grew too large because the splitter lagged. Force a split at a valid boundary:

// Split the chunk containing this document at a midpoint
sh.splitFind("database.collection", { shardKeyField: "value" })

// Or split at an explicit boundary
sh.splitAt("database.collection", { shardKeyField: "value" })

After the split, the resulting chunks are no longer jumbo and the balancer can migrate them.

If documents share an identical shard key value: MongoDB cannot split the chunk, and the balancer will not migrate it. The permanent fix is a shard key that can distribute the documents. In MongoDB 5.0 and later, use live resharding:

db.adminCommand({
  reshardCollection: "database.collection",
  key: { newShardKeyField: 1 }
})

Live resharding rewrites chunks in the background, but it is I/O intensive and should be scheduled during a low-traffic window.

In MongoDB 4.4 and earlier, live resharding is not available. The collection must be rebuilt:

Choose the new shard key.
Dump the collection or copy it to an unsharded staging collection.
Drop the original collection.
Create a new empty collection with the desired shard key using shardCollection.
Restore the data.
Update application references if necessary.

For large collections this takes time and disk space. Plan the cutover carefully; for some workloads it is easier to dual-write to a new sharded collection and migrate reads gradually.

Reduce migration pressure

If migrations abort because donor or recipient shards saturate, either widen the capacity or narrow the migration window to a cooler period. You can also reduce the impact per migration by temporarily lowering chunkSize so individual chunks are smaller, but this increases the total number of migrations. Use it only when shards have spare I/O and the network between shards is healthy.

Validate

After each change, re-run the chunk count aggregation and the jumbo query. Expect skew to drop below 10% over the next few balancer rounds. If skew stalls above 10% and no jumbo chunks remain, check the changelog for aborts and the config server for instability.

The Netdata solution

MongoDB monitoring with Netdata

Netdata monitors MongoDB with per-second metrics and automatic dashboards. Watch WiredTiger cache pressure, oplog window, connection counts, checkpoint stalls, and replication health in one place, correlated with the underlying host.

See MongoDB monitoring → Start monitoring free