Elasticsearch this action would add too many shards: max_shards_per_node limit

Creating an index or rolling over a data stream returns HTTP 400 validation_exception: “this action would add [N] shards, but this cluster currently has [X]/[Y] maximum normal shards open”. The cluster has hit cluster.max_shards_per_node, which defaults to 1000 open shards per non-frozen data node. Raising the limit via _cluster/settings unblocks writes but postpones the outage. The durable fix is consolidation.

What this means

Every shard is a Lucene index. Each consumes file descriptors, heap for segment metadata, and cluster state entries that the master publishes to every node on every change. cluster.max_shards_per_node guards against over-sharding, where excessive shard counts slow cluster state updates and pressure master and data node heap. When the limit is reached, the allocator refuses new shards. Existing indices remain searchable, but index creation, rollovers, and some reallocations are blocked.

flowchart TD
    A[Index creation fails with max_shards_per_node] --> B[Check active shards per data node]
    B --> C{Nearing 1000 per node?}
    C -->|Yes| D[Identify largest index consumers via _cat/indices]
    C -->|No| E[Check _cluster/allocation/explain for other blocks]
    D --> F{Are indices old or empty?}
    F -->|Yes| G[Delete or close abandoned indices]
    F -->|No| H{Can they be made read-only?}
    H -->|Yes| I[Shrink to fewer primary shards]
    H -->|No| J[Reindex into fewer shards or reduce replicas]

Common causes

CauseWhat it looks likeFirst thing to check
Time-series indices accumulating without deletionShard count grows linearly; old daily or weekly indices remain openGET /_cat/indices?v&h=index,pri,rep,store.size&s=index:desc for old, small indices
Index templates defaulting to too many primary shardsEvery new index creates multiple primaries regardless of data volumeCheck the active template for the index pattern and its default number_of_shards
Excess replica counts for the current node countReplicas multiply total shards without adding usable redundancy on small clustersGET /_cluster/health?filter_path=active_shards,active_primary_shards
Abandoned empty or tiny indicesMany indices with near-zero documents still consuming shard slotsGET /_cat/indices?v&h=index,docs.count,store.size&s=store.size:desc

Quick checks

Run these in sequence to triage:

# Cluster health and total shard counts
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,number_of_nodes,number_of_data_nodes,active_shards,active_primary_shards,unassigned_shards'

# Shards and their states
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state'

# Largest index consumers by age and size
curl -s 'http://localhost:9200/_cat/indices?v&h=index,pri,rep,docs.count,store.size,pri.store.size&s=index:desc' | head -20

# Per-node shard allocation and disk usage
curl -s 'http://localhost:9200/_cat/allocation?v'

# Segment memory per node
curl -s 'http://localhost:9200/_nodes/stats/segments?filter_path=nodes.*.name,nodes.*.segments.memory_in_bytes'

# Master task backlog
curl -s 'http://localhost:9200/_cluster/pending_tasks?pretty'

# Estimate cluster state complexity
curl -s 'http://localhost:9200/_cluster/stats?filter_path=indices.mappings.total_field_count'

How to diagnose it

  1. Confirm the breach. Run _cluster/health and divide active_shards by number_of_data_nodes. If the average is near 1000, the limit is the binding constraint. Use _cat/allocation to check for skew; one node may be at the limit while others are not.
  2. Find the fastest wins. Use _cat/indices sorted by age or size. Look for time-series prefixes with many small indices. Indices older than your retention requirement that are still open are immediate deletion candidates.
  3. Check for stuck ILM policies. If you use ILM, an index that should have been deleted or shrunk may be stalled. Run GET /<index>/_ilm/explain on the oldest managed indices. Look for an ERROR step or a stuck shrink action. Resolve the blocker, then call POST /<index>/_ilm/retry.
  4. Validate index template defaults. An outdated template may set a high number_of_shards for every new data stream. Review the template matching the failing index pattern and ensure the primary count aligns with actual data volume. Use GET /_index_template/<name> for composable templates or GET /_template/<name> for legacy templates.
  5. Assess heap and cluster state impact. Check _nodes/stats/segments for memory per node. If segment memory rises with shard count, over-sharding is already pressuring heap. Check _cluster/stats for total field count; if it is also elevated, the cluster state is bloated.
  6. Check allocation explain for blocked shards. If some shards are unassigned, run _cluster/allocation/explain to confirm the limit is the specific reason or whether a disk watermark is compounding the problem.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Active shards per data nodeDirectly approaches the hard limitSustained >800 per node against a 1000 default
Cluster state field countMany shards usually mean many indices and mappings, which bloat stateindices.mappings.total_field_count growing without bound
Segment memory per nodeEach open shard carries segment metadata overhead in heapsegments.memory_in_bytes rising in lockstep with shard count
Pending cluster tasksA large cluster state slows the master’s ability to publish changes>20 tasks or any task older than 30 seconds
JVM heap used percentShard metadata accumulates in heap; pressure leads to GC death spiralSustained >75% with a rising post-GC floor

Fixes

Delete or close abandoned indices

The fastest recovery is removing data you no longer need. Identify old, empty, or superseded indices via _cat/indices and delete them. Deletion frees shard slots, disk, and cluster state immediately. Closing an index removes its active shards from the cluster, though the metadata remains in state. Prefer deletion for true orphans.

WARNING: DELETE /<index> is destructive and cannot be undone without a snapshot. Verify the index name and retention policy before executing.

Reduce replica counts

If indices carry more replicas than needed for the current node count, lower number_of_replicas. This halves or thirds the shard count for those indices. The tradeoff is reduced redundancy and potentially slower reads. Do not reduce replicas on critical indices during an active node outage.

curl -X PUT 'http://localhost:9200/<index>/_settings' -H 'Content-Type: application/json' -d '{
  "index": { "number_of_replicas": 1 }
}'

Shrink read-only indices

For indices that are no longer written, use the Shrink API to reduce primary shard count. You must first set index.blocks.write=true and ensure all shards relocate to a single node. The target shard count must be a factor of the original; 12 primaries can shrink to 6, 4, 3, 2, or 1. The tradeoff is temporary disk space for the new index and a brief maintenance window. After shrinking, delete the source index to reclaim slots.

Reindex active indices into fewer shards

For indices still receiving writes, create a new index with fewer primary shards and use the Reindex API to copy data. Set refresh_interval=-1 and number_of_replicas=0 on the destination during the copy to reduce overhead, then restore them after cutover. Switch aliases or data streams to the new index once caught up. The tradeoff is duplicated disk usage and additional I/O. Run this during low-traffic hours.

Fix ILM retention

If ILM is supposed to manage index lifecycle but has stalled, the root cause is often a missing rollover alias, insufficient disk space for a shrink step, or a policy conflict. Resolve the specific error, then call POST /<index>/_ilm/retry on stuck indices. Long-term, ensure your ILM delete phase aligns with actual retention needs.

Temporarily raise the limit (emergency only)

You can raise cluster.max_shards_per_node dynamically to unblock writes while you consolidate. Treat this as a circuit breaker, not a fix. The cluster will continue to degrade from metadata overhead, and you will hit the new ceiling again. Raise it only to buy minutes, not days.

WARNING: This masks the root cause. Use only to prevent a complete write outage while you delete or shrink indices.

curl -X PUT 'http://localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "persistent": { "cluster.max_shards_per_node": 2000 }
}'

Prevention

Monitor shards per node as a first-class capacity metric alongside disk and heap. Review index templates quarterly to ensure default primary shard counts match expected data volume. Consolidate time-series data into fewer, larger indices instead of many small ones. Let ILM delete or shrink indices on schedule, and alert when ILM transitions fail. On clusters with dedicated master nodes, watch pending tasks and cluster state size; they are early indicators that shard accumulation is becoming a coordination problem.

How Netdata helps

  • Correlate the shard limit breach with cluster health state, JVM heap, and segment memory. Netdata collects these out of the box.
  • Per-node charts for segment memory and heap percent show which nodes are paying metadata overhead before the allocator blocks.
  • Alerts on pending task backlog and thread pool rejections fire while the cluster is still functional, giving you runway to delete or shrink indices instead of reacting to HTTP 400 errors.
  • Per-node disk and shard counts expose uneven distribution that concentrates shards and accelerates the limit hit.