Elasticsearch too many shards per node: overallocation and the heap tax

Searches slow from milliseconds to seconds. Nodes drop out after long GC pauses, triggering shard relocations that overload the remaining nodes. Cluster health stays green while _cat/nodes shows climbing heap and segments.memory growing steadily. The cause is usually shard overallocation: thousands of open shards with individual nodes carrying more than their heap can sustain.

Each shard is a self-contained Lucene index. Elasticsearch keeps per-segment metadata in heap memory, and that overhead is charged per field, not per segment. A node with 1,000 shards may be spending multiple gigabytes of heap on metadata alone, leaving less room for caches, aggregations, and in-flight requests.

Small shards are worse than large ones: they accumulate more segments per gigabyte of data, and because merges are less efficient on small indices, the segment count and per-field overhead remain high.

The default cluster.max_shards_per_node is 1,000 for normal data nodes. Hitting it blocks new index creation with an HTTP 400 error. However, the practical comfort threshold is much lower. Per-node counts above 800 are critical; below 200 is comfortable. Between 500 and 800, cluster state management slows, master nodes burn CPU on allocation decisions, and heap pressure becomes visible in old GC metrics.

flowchart TD
    A[Too many shards per node] -- segment metadata --> B[Heap pressure rises]
    B -- stop-the-world GC --> C[Long old-gen pauses]
    C -- misses checks --> D[Node removed by master]
    D -- triggers --> E[Shard reallocation]
    E -- adds overhead --> F[Remaining nodes overloaded]
    F -- feeds back --> A

Common causes

CauseWhat it looks likeFirst thing to check
Time-series indices with short retention and no ILM cleanupShard count grows daily; old indices remain open_cat/indices?v&s=index:desc and ILM policy status
Over-sharded indices (many primaries for small data volumes)Many tiny shards with low document count_cat/indices?v&h=index,pri,rep,docs.count,store.size
High replica count on many small indicesReplica shards double the count without proportional value_cat/shards?v&h=index,shard,prirep,state
Age-based ILM rollover without size checksLow-document indices that still consume a full shard quota_cat/indices?v&h=index,pri,rep,docs.count&s=docs.count:asc
Missing ILM delete or shrink phasesIndices accumulate in hot/warm foreverGET /*/_ilm/explain?only_errors=true&only_managed=true

Quick checks

Run these safe, read-only commands to assess the scope of overallocation.

# Count shards per node (includes both primary and replica)
curl -s 'http://localhost:9200/_cat/shards?h=node,shard,prirep,index' | awk '{print $1}' | sort | uniq -c | sort -rn | head

# Check segment memory and total segments per node
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,segments.memory,heap.percent'

# Verify the cluster-wide shard limit
curl -s 'http://localhost:9200/_cluster/settings?include_defaults=true&filter_path=*.cluster.max_shards_per_node'

# List indices by primary shard count and store size
curl -s 'http://localhost:9200/_cat/indices?v&h=index,pri,rep,docs.count,store.size&s=pri:desc' | head -20

# Find low-document indices that may be empty or near-empty
curl -s 'http://localhost:9200/_cat/indices?v&h=index,pri,rep,docs.count,store.size&s=docs.count:asc' | head -20

# Identify the largest heap consumers on each node
curl -s 'http://localhost:9200/_nodes/stats/indices/segments,fielddata,query_cache,request_cache,completion'

# Check if unassigned shards are blocked by the limit
curl -s 'http://localhost:9200/_cluster/allocation/explain?pretty'

How to diagnose it

  1. Establish the per-node shard count. Use _cat/shards to count how many shards each node is actually hosting. Include both primary and replica shards. If any non-frozen data node is above 500, you are in the danger zone.

  2. Correlate shard count with heap usage. Pull _cat/nodes for segments.memory and heap.percent. If segments.memory is growing linearly with shard count and heap is sustained above 75%, the shards are the likely heap consumer.

  3. Identify the indices driving the count. Use _cat/indices sorted by primary shard count or by ascending document count. Look for indices with many primary shards but tiny store sizes (over-sharded), or large numbers of old time-series indices that should have been deleted.

  4. Check for low-document indices. Indices created by age-based rollover with little or no data still count toward the shard limit. Use _cat/indices sorted by docs.count to find them.

  5. Check ILM status. If ILM manages your time-series indices, run _ilm/explain to see whether indices are stuck before the delete or shrink phase. A stuck ILM policy is a common hidden driver of shard accumulation.

  6. Confirm the cluster state is not the primary heap consumer. Large cluster states from mapping explosions can also spike heap. Use _cluster/stats to check mapping field counts. If field count is under control and shards are high, overallocation is the primary cause.

  7. Check for allocation blocks. If the cluster has already hit max_shards_per_node, new indices will be rejected. Use _cluster/allocation/explain on an unassigned shard to confirm whether the limit is the blocker.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Shards per nodeDirect measure of overallocation burden>500 sustained; approaching 800 critical
segments.memory per nodeHeap consumed by segment metadataGrowing linearly with shard count; >10% of heap
JVM heap used percentOverall memory pressure on the nodeSustained >75%; old GC frequency increasing
Old GC collection timeStop-the-world pauses that trigger node removalIndividual pauses >10s; frequency increasing
Cluster state field countRules out mapping explosion as the heap driverField count growing without bound
Unassigned shard countIndicates the hard limit has been reachedNew indices rejected; unassigned primaries

Fixes

Close or delete obsolete indices

Deleting an index is the fastest way to reclaim shard quota and heap.

  • Destructive: Verify the index name and ensure it is not needed for compliance or recovery before deleting.
  • Check document count first: _cat/indices?v&h=index,pri,rep,docs.count,store.size
  • Delete: DELETE /<index>
  • For indices you must keep but do not search frequently: POST /<index>/_close. Closed indices do not count toward max_shards_per_node.

Reduce replica count on low-value indices

If you have many small indices with replicas you do not need for redundancy, lowering index.number_of_replicas removes one set of replica shards per index.

  • Tradeoff: This reduces redundancy. Do not do this on critical data without accepting the risk.
  • Example: PUT /<index>/_settings {"index.number_of_replicas": 0}

Shrink existing indices

For indices that are no longer written to, use the Shrink API to reduce the primary shard count. This creates a new index with fewer shards and copies the data.

  • Shrink requires the source index to be read-only and all primary shards on one node.
  • The new index must have a factor of the original primary shard count (for example, 6 shards can shrink to 3, 2, or 1).

Reindex into fewer, larger shards

For indices that are still active or cannot be shrunk, reindexing into a new index with a lower number_of_shards is the cleanest fix.

  • Shards are immutable after creation. You cannot resize a shard in place.
  • Create the target index with the correct shard count.
  • Use the Reindex API with "source": {"index": "<old>"} and "dest": {"index": "<new>"}.
  • After reindexing, update aliases or write targets, then delete the old index.

Forcemerge read-only indices

For time-series indices that are no longer written, forcemerge to one segment reduces segment metadata heap and improves search speed.

  • Warning: Only forcemerge indices that are fully written. Running forcemerge on a live-writing index consumes significant heap and I/O.
  • Command: POST /<index>/_forcemerge?max_num_segments=1

Tune ILM to prevent recurrence

  • Use rollover based on max_size or max_docs instead of max_age alone to avoid creating low-value indices.
  • Ensure the delete phase is configured and functioning.
  • Add a shrink action in the warm phase to consolidate shards on older indices.

Prevention

  • Target <200 shards per data node. If you are routinely above 500, you need fewer indices, fewer primary shards per index, or more data nodes.
  • Size primary shards to match data volume. Avoid creating five primary shards for an index that will never exceed a few gigabytes. One or two shards is often enough for small indices.
  • Use ILM rollover with size or doc limits. Age-based rollover alone creates wasteful indices during low-traffic periods. Size-based rollover ensures each shard reaches a reasonable threshold before rolling over.
  • Audit shard count before adding nodes. Adding nodes spreads shards but does not reduce per-shard overhead. If your indices are over-sharded, fix the index template first.
  • Monitor segments.memory and shard count as first-class metrics. They are leading indicators; heap percentage is a lagging one.

How Netdata helps

  • Per-node shard count: Chart shards per data node to spot imbalance before heap pressure appears.
  • Segment memory correlation: Plot segments.memory against JVM heap to confirm heap growth is driven by segment metadata rather than caches or aggregations.
  • Old GC pause alerting: Alert on old GC duration spikes, which are often the first visible symptom of per-shard heap overhead crossing into the danger zone.
  • Index growth tracking: Track index count growth against ILM execution to catch policies that are not cleaning up old shards.