Troubleshooting

Elasticsearch Yellow Cluster Status Unassigned Shards Split-Brain and Recovery Strategies

A guide to resolving the common Elasticsearch yellow state by investigating shard allocation- understanding master election- and preventing data divergence

Elasticsearch Yellow Cluster Status Unassigned Shards Split-Brain and Recovery Strategies

You run a health check on your production cluster and the result comes back: status: yellow. It’s not the dreaded red status, so your application is likely still serving requests, but this is a critical warning sign. An Elasticsearch yellow cluster status is a direct indication that your data’s high availability is compromised. While all your primary shards are active, one or more replica shards have failed to be assigned to a node. Ignoring this warning can lead to data loss if another node fails, or worse, it could be a symptom of a network partition risking an Elasticsearch split-brain.

Understanding how to diagnose the cause of unassigned shards is a fundamental skill for anyone managing an Elasticsearch deployment. This guide will walk you through the process of investigating a yellow cluster, understanding the risks of split-brain, and implementing recovery and prevention strategies to ensure your cluster remains healthy and resilient.

Understanding Cluster Health: Green, Yellow, and Red

Elasticsearch uses a simple color-coded system to represent the health of your cluster. The _cluster/health API is used to retrieve this status.

The status field will be one of the following:

  • Green: All primary and replica shards are allocated and active. Your cluster is fully healthy and operational.
  • Yellow: All primary shards are allocated, but at least one replica shard is not. Your data is fully available, and search and indexing operations will function correctly. However, your high availability is at risk. If a node holding a primary shard fails, you could lose data because no replica is available to take its place.
  • Red: At least one primary shard is unassigned. This is a critical state. The cluster is missing data, and searches hitting that shard will fail. Indexing new documents into the missing shard is not possible.

A yellow status is your cue to investigate immediately. It’s the cluster telling you, “I’m working, but I have no safety net.”

The Root Cause of a Yellow Cluster: Unassigned Shards

The direct cause of a yellow status is always one or more unassigned shards. When a node leaves the cluster—whether due to a planned restart, a crash, or a network failure—the replica shards that were hosted on that node become unassigned. Elasticsearch will then attempt to re-allocate these replicas to other available nodes in the cluster. If it fails, the cluster state remains yellow.

Why Do Shards Become Unassigned?

Several factors can prevent Elasticsearch from assigning a replica shard:

  1. Node Failure or Departure: The most common reason. A node goes offline, and there isn’t another suitable node to host its replica shards.
  2. Insufficient Nodes: You have configured more replicas than you have data nodes. For example, if you have an index with 1 primary and 2 replica shards (a total of 3 copies) but only 2 data nodes in your cluster, one replica can never be assigned, as Elasticsearch will not place a replica on the same node as its primary.
  3. Disk Space Issues: The remaining nodes may not have enough disk space. Elasticsearch has built-in disk watermarks. If a node’s disk usage exceeds the cluster.routing.allocation.disk.watermark.low threshold, it will not be assigned new shards.
  4. Shard Allocation Awareness: You may have configured rules (e.g., to spread shards across different availability zones or racks) that cannot be satisfied with the current set of available nodes.
  5. Delayed Allocation: By default, Elasticsearch waits one minute (index.unassigned.node_left.delayed_timeout) before attempting to allocate shards from a node that has left. This is to prevent a massive re-shuffling of data during a brief node restart.

Diagnosing Unassigned Shards

To find out exactly why a shard is unassigned, Elasticsearch provides two powerful APIs.

First, you can use the _cat/shards API to list all shards and their states, filtering for any in the UNASSIGNED state. This will give you a quick summary of which shards are unassigned and a brief reason. For a much more detailed explanation, use the Cluster Allocation Explain API.

This API provides a rich, human-readable JSON output detailing the exact reason a shard allocation failed. It will check each node in the cluster and explain why it was or was not a viable candidate for hosting the shard, mentioning things like disk watermarks, version incompatibilities, or allocation awareness rules.

The Lurking Danger: Elasticsearch Split-Brain

While an unassigned shard is the immediate problem, it can be a symptom of a much more dangerous condition: a split-brain. This occurs when a network partition splits your cluster into two or more isolated groups, and each group, unable to communicate with the other, elects its own master node.

You now have two independent clusters writing and modifying what they believe is the “correct” version of the data. When the network partition heals and the clusters attempt to merge, there is no easy way to reconcile the divergent data. One master must demote itself, and any data written to it during the partition will be lost.

Quorum-Based Decision Making to Prevent Split-Brain

Modern Elasticsearch versions (7.0+) have robust mechanisms to prevent split-brain through a process called quorum-based decision making. The cluster requires a majority of master-eligible nodes (a quorum) to be present to elect a master or make any changes to the cluster state.

The quorum is defined as (N / 2) + 1, where N is the number of master-eligible nodes configured in the cluster.initial_master_nodes setting during the initial cluster formation.

This is why best practice is to always have an odd number of master-eligible nodes, typically three.

  • With 3 master-eligible nodes, the quorum is (3 / 2) + 1 = 2. The cluster can tolerate the failure of one master-eligible node and still form a quorum to elect a new master. If a network partition splits the cluster into a group of 1 and a group of 2, only the group of 2 can elect a master, preventing a split-brain.
  • With 2 master-eligible nodes, the quorum is (2 / 2) + 1 = 2. This setup has no fault tolerance. If one node fails, no quorum can be formed, and the cluster becomes unavailable.

In older versions, this was managed by the discovery.zen.minimum_master_nodes setting, which had to be manually configured. Modern Elasticsearch handles this much more safely, provided the cluster is bootstrapped correctly.

Recovery and Prevention Strategies

Once you’ve diagnosed why your shards are unassigned, you can take action.

1. Address the Root Cause

First, fix the underlying issue identified by the Allocation Explain API.

  • If a node is down, bring it back online.
  • If disk space is the problem, free up space or add a new node.
  • If an allocation awareness rule is the cause, you may need to adjust your cluster topology or the rule itself.

2. Forcing Shard Allocation (Use with Caution)

If you understand the risks and have resolved the underlying issue, but the shard remains unassigned, you can manually force an allocation using the _cluster/reroute API. This is a powerful command and should be treated as a last resort. Forcing a shard onto a node that has underlying issues (like low disk space) will only postpone the problem.

3. Proper Cluster Configuration for High Availability

The best way to handle a yellow cluster is to prevent it from happening in the first place with a resilient architecture.

  • Dedicated Node Roles: In a production environment, use dedicated Elasticsearch node roles. Have at least three dedicated master nodes that do nothing but manage the cluster state. Have dedicated data nodes for storing and searching data, and ingest nodes for processing documents. This prevents a high search load from impacting the stability of the master election process.
  • Shard Allocation Awareness: Configure your cluster to be aware of your physical infrastructure. By tagging nodes with their availability zone or rack ID, you can configure Elasticsearch to place a primary shard and its replicas in different physical locations. This way, the failure of an entire rack or AZ will not result in data loss.
  • Regularly Review Cluster Settings: Ensure your Elasticsearch discovery settings (discovery.seed_hosts and cluster.initial_master_nodes) are correctly configured and that your cluster has successfully bootstrapped.

Beyond Yellow: Proactive Cluster Monitoring

Reacting to a yellow cluster status is good, but proactively monitoring the leading indicators of failure is better. An advanced monitoring solution like Netdata gives you the real-time visibility you need to catch problems before they compromise your cluster’s health.

Netdata automatically discovers your Elasticsearch nodes and provides per-second metrics on hundreds of critical health indicators, including:

  • Unassigned Shard Count: Get alerted the instant a shard becomes unassigned, without waiting for the next health check.
  • Disk Usage and Watermarks: Track disk usage on every node and set proactive alerts to notify you long before you hit the high disk watermark.
  • Master Election Events: Monitor the cluster state for any unexpected master node changes or failures in the node fault detection process.
  • JVM Heap and Garbage Collection: Correlate unassigned shards with memory pressure or long GC pauses on a specific data node.

By having this level of detail, you can move from troubleshooting a yellow cluster to preventing one. You can identify nodes under stress, predict disk space shortages, and ensure your cluster topology is always resilient.

To gain a deeper understanding of your Elasticsearch cluster’s health and performance, sign up for a free Netdata account.