Elasticsearch cluster health red: unassigned primaries and how to recover
Cluster health red means at least one primary shard is unassigned. Queries against affected indices return partial results or fail; writes are blocked. yellow only signals missing replicas, but red signals active data unavailability.
Cluster health is a lagging indicator. A red status sustained longer than two minutes after the cluster has formed is a real fault; a brief flash during startup is normal. By the time the status turns red, a node has likely departed, a disk has crossed a watermark, or a shard copy has been rejected as corrupt.
The master allocates shards based on disk watermarks (low 85%, high 90%, flood stage 95%), allocation filtering rules, awareness attributes, and the validity of existing shard copies. When a primary goes unassigned, the allocator has evaluated every candidate node and found none acceptable. Your job is to discover which constraint blocked placement, then either remove the constraint or recover the data through other means.
flowchart TD
A[Cluster health red] --> B{Sustained >2m
uptime >600s}
B -- No --> C[Transient startup
or restart]
B -- Yes --> D[GET /_cluster/health
?level=indices]
D --> E[List unassigned
primaries]
E --> F[POST /_cluster/
allocation/explain]
F --> G{Root cause}
G -- NODE_LEFT --> H[Check node logs
for GC or OOM]
G -- WATERMARK --> I[Free disk and clear
read_only blocks]
G -- ALLOC_FAILED --> J[reroute?retry_failed]
G -- NO_VALID_COPY --> K[Restore snapshot or
accept data loss]
G -- FILTER --> L[Fix allocation
settings]Common causes
| Cause | What it looks like | First check |
|---|---|---|
| Node loss (crash, OOM kill, GC pause, network partition) | Node count drops in /_cluster/health; unassigned reason shows NODE_LEFT | GET /_cat/nodes and node logs for OOM killer messages or fatal GC errors |
| Disk watermark exceeded | Shards refuse to allocate; indices become read-only at flood stage | GET /_cat/allocation?v for disk usage percent on every data node |
| Corrupt shard copy or max retries exceeded | Unassigned reason ALLOCATION_FAILED; shard repeatedly fails to initialize | POST /_cluster/allocation/explain for can_allocate and failure details |
| Allocation filtering or awareness misconfiguration | Shards stay unassigned despite healthy nodes and adequate disk space | POST /_cluster/allocation/explain for the blocking decider; GET /_cluster/settings |
| Insufficient nodes to host all copies | All nodes that held valid copies have left; remaining nodes cannot satisfy replication rules | POST /_cluster/allocation/explain per shard for no_valid_shard_copy |
Quick checks
Run these read-only commands to scope the incident.
# Check cluster health and identify red indices
curl -s 'http://localhost:9200/_cluster/health?level=indices&filter_path=status,unassigned_shards,indices.*.status'
# List unassigned shards and their reasons
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state' | grep UNASSIGNED
# Explain why the first unassigned shard is blocked
curl -s 'http://localhost:9200/_cluster/allocation/explain?pretty'
# Check node membership and basic health
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,node.role,heap.percent,cpu,load_1m,disk.used_percent'
# Check per-node disk usage against watermarks
curl -s 'http://localhost:9200/_cat/allocation?v'
# Verify allocation has not been disabled globally
curl -s 'http://localhost:9200/_cluster/settings?flat_settings=true&filter_path=persistent.cluster.routing.allocation.enable,transient.cluster.routing.allocation.enable'
How to diagnose it
- Confirm the state is sustained. If nodes have been running for less than 600 seconds, wait briefly. Initial cluster formation and shard discovery can transiently show red.
- Identify affected indices. Use
GET /_cluster/health?level=indices. Note which indices reportred; these own the unassigned primaries. - List unassigned primaries. Query
/_cat/shardsand filter toUNASSIGNED. Focus on rows whereprirepisp. Note theunassigned.reasonvalue. - Get the allocator’s exact reasoning. Call
POST /_cluster/allocation/explain. The response containscan_allocate(no,throttled,no_valid_shard_copy,allocation_delayed) and a per-node breakdown of why each node rejected the shard. - Correlate with node and disk health. A
NODE_LEFTreason paired with a lower node count points to a departed node. If nodes are present but disk usage is high, watermark deciders are blocking placement. - Inspect logs. On departed or target nodes, check Elasticsearch logs for
OutOfMemoryError, long GC pauses exceeding the fault detection timeout, or disk I/O errors that caused the allocator to reject a shard copy.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Cluster health status | Binary indicator of primary availability | red sustained longer than 2 minutes after JVM uptime exceeds 600 seconds |
| Unassigned shard count | Quantifies scope; reason points to root cause | Any unassigned primary sustained beyond the startup window |
| Node count | A drop means a node left and triggered reallocation | Unexpected decrease in number_of_data_nodes |
| Disk usage and watermarks | Disk above 85% blocks new allocation; 95% blocks writes | Disk usage above 85% on any data node |
| JVM heap and GC activity | Long stop-the-world GC causes node removal via fault detection | Heap usage above 85% with increasing old GC duration |
| Master stability | Master instability stalls all allocation decisions | Pending cluster tasks growing or master identity changing |
| Pending cluster tasks | Backlogged tasks delay shard allocation decisions | More than 20 pending tasks or any task older than 30 seconds |
Fixes
Transient node restart or rolling maintenance
If a node restarted, Elasticsearch delays automatic recovery by index.unassigned.node_left.delayed_timeout (default 1 minute) to give the node time to rejoin. During rolling restarts, set cluster.routing.allocation.enable: none before stopping nodes to prevent a rebalancing storm. If the delay has passed and primaries remain unassigned, move to the specific cause below.
Disk watermark and flood stage
If /_cat/allocation shows nodes above the high watermark (90%) or flood stage (95%), free disk space immediately. Delete old indices, force-merge read-only indices to reclaim space, or remove unneeded snapshots. When flood stage is reached, affected indices are automatically set to index.blocks.read_only_allow_delete. After freeing disk space, clear the block manually:
# Clear flood-stage read-only block after freeing disk space
curl -X PUT 'http://localhost:9200/_all/_settings' -H 'Content-Type: application/json' -d '{"index.blocks.read_only_allow_delete": null}'
Do not clear the block before freeing space, or writes will immediately re-trigger it.
Max retries exceeded or corrupt shard
Shards with reason ALLOCATION_FAILED have exhausted automatic retries. Trigger a new allocation attempt:
# Retry shards that failed automatic allocation
curl -X POST 'http://localhost:9200/_cluster/reroute?retry_failed'
If the shard copy is corrupt and no valid copy exists on another node, choose between restoring from snapshot or accepting data loss with a manual override.
No valid shard copy
When allocation/explain returns no_valid_shard_copy, the cluster has no intact primary. If you have a recent snapshot, restore the index. Snapshot restore is always preferable to forcing a partial allocation.
If no snapshot exists and you must recover the index, you can allocate a stale copy or an empty primary. Both require explicitly accepting data loss. The following example forces allocation of a stale primary to a node that holds an older copy:
# Force allocate a stale primary - DESTRUCTIVE, may lose data
curl -X POST 'http://localhost:9200/_cluster/reroute' -H 'Content-Type: application/json' -d '{
"commands": [
{
"allocate_stale_primary": {
"index": "my-index",
"shard": 0,
"node": "target-node-name",
"accept_data_loss": true
}
}
]
}'
If no stale copy exists anywhere, allocate_empty_primary creates a new empty shard on the named node. These commands are destructive. They may result in partial or total data loss for that shard and should only be used when snapshot restore is impossible.
Allocation filtering or awareness misconfiguration
If the allocation explain output names a filter or awareness decider, review cluster.routing.allocation.* and index.routing.allocation.* settings. Correct the attribute mismatch or remove the errant filter, then allow the allocator to retry.
Prevention
- Monitor per-node disk usage and project time-to-watermark; keep routine usage below 70% to absorb merge spikes.
- Use ILM to roll over and delete time-series indices before disks fill.
- Maintain tested snapshots; a successful snapshot does not guarantee a successful restore.
- Deploy dedicated master nodes to avoid master instability causing allocation stalls.
- Monitor JVM heap floor and old GC frequency to predict node loss from GC pressure before fault detection removes the node.
- Keep Elasticsearch versions uniform across the cluster; version mismatch after upgrades can block shard allocation.
How Netdata helps
Netdata collects Elasticsearch metrics that correlate red health with its leading indicators:
- Cluster health status and uptime: Alert on
redsustained longer than 2 minutes when JVM uptime exceeds 600 seconds, filtering out startup noise. - Per-node disk usage: Correlate unassigned shards with nodes crossing the 85%, 90%, or 95% disk watermarks before flood stage blocks writes.
- JVM heap usage and GC latency: Rising old-generation GC duration predicts node departures that trigger unassigned primaries.
- Node count and unassigned shard count: Surface unexpected node drops and quantify how many primaries are affected.
- Thread pool rejections and circuit breaker trips: Identify heap pressure and saturation that precede node removal and cascading failures.
Related guides
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch heap pressure death spiral: GC, node removal, and the cascade
- Elasticsearch JVM heap usage high: reading the sawtooth and the post-GC floor
- Elasticsearch monitoring checklist: the signals every production cluster needs
- Elasticsearch monitoring maturity model: from survival to expert
- Elasticsearch long GC pauses: old-generation stop-the-world and node drops
- How Elasticsearch actually works in production: a mental model for operators







