Elasticsearch master_not_discovered_exception: no elected master and stalled writes

HTTP 503 and master_not_discovered_exception mean bulk indexing, index creation, mapping updates, and shard allocation checks are being rejected. Search requests that do not need fresh cluster state may return cached results briefly, but the cluster cannot process writes or administrative work. The data nodes may be healthy, but without an elected master, the cluster cannot update shard routing, publish state changes, or acknowledge document writes. The root cause is usually one of four problems inside the master-eligible cohort.

Elasticsearch 7.0 and later use a consensus protocol called Zen2 for master election. Only master-eligible nodes vote. The elected master maintains the cluster state, which describes every index, shard, mapping, alias, and node. That state is serialized and published to all nodes on every change. Without a master, state updates halt and the default no_master_block rejects writes and metadata operations until election completes.

flowchart TD
    A[master_not_discovered_exception] --> B{Master-eligible quorum available?}
    B -->|No| C[Quorum loss
departed nodes still in voting config] B -->|Yes| D{Master node healthy?} D -->|No| E[Master GC stalls
or resource exhaustion] D -->|Yes| F[Network partition
between master-eligible nodes] C --> G[Writes and admin ops stall] E --> G F --> G G --> H[Check GET /_cat/master
Check voting config
Check master logs]

What this means

The master_not_discovered_exception error means the node handling your request cannot locate an elected master and cannot win an election itself. You may also see cluster_block_exception with no master in the error detail. During this state, the cluster cannot allocate shards, create or delete indices, update mappings, or process ingest pipelines that require cluster state changes.

Master elections require a strict majority of the voting configuration. In a three-node master-eligible setup, two nodes must be available. A critical detail in 7.x and later is that the voting configuration persists master-eligible nodes that have departed. If you permanently remove a master-eligible node without updating the voting configuration, it still counts toward quorum. Two running nodes out of three may be insufficient if the departed node has not been excluded.

Elections are also sensitive to master node health. Cluster fault detection uses follower and leader checks with a 10-second timeout and 1-second interval. Three consecutive failures trigger node removal; a hard TCP disconnect removes the node immediately. If the master suffers a long GC pause, it may miss these checks and be removed, triggering a new election. An oversized cluster state or rapid metadata changes can also cause election timeouts even when all nodes are online.

Common causes

CauseWhat it looks likeFirst thing to check
Quorum loss from departed master-eligible nodesGET /_cat/master returns nothing; node count dropped below the voting configuration majority; nodes left during a rolling restart or after a crashGET /_cat/nodes?v&h=name,node.role to count present master-eligible nodes
Master node GC stalls or heap pressureMaster drops out and rejoins intermittently; old-generation GC pauses exceed the 10-second fault-detection timeout; pending cluster tasks pile upMaster node JVM heap percentage and GC duration in node logs
Network partition between master-eligible nodesNode count fluctuates as nodes join and leave; master identity changes rapidly in logs; cluster behaves as split even though all processes are runningNetwork connectivity and latency between master-eligible nodes on the transport port
Oversized cluster state overwhelming masterRapid index creation, mapping explosions, or excessive template changes generate constant state updates; master CPU and heap spike under low query volume; pending tasks grow without clearingGET /_cluster/pending_tasks and indicators of cluster state version churn

Quick checks

Run these read-only commands to orient:

# Confirm whether a master is elected
curl -s 'http://localhost:9200/_cat/master?v'

# Check cluster health and visible node count
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,number_of_nodes,unassigned_shards'

# List nodes and roles; identify missing master-eligible nodes
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,node.role,heap.percent,cpu,load_1m'

# Inspect the master task backlog
curl -s 'http://localhost:9200/_cluster/pending_tasks?pretty'

# Identify the current master node ID
curl -s 'http://localhost:9200/_cluster/state?filter_path=master_node'

# Review node logs for GC pause warnings around the incident time
grep -i "gc" /var/log/elasticsearch/*.log | tail -20

If /_cat/master returns nothing and number_of_nodes is lower than expected, you are dealing with a node departure or network partition. If the node count looks correct but there is still no master, investigate GC pauses or cluster state overload on the previous master.

How to diagnose it

  1. Confirm the master is missing. GET /_cat/master returning empty means no master is elected. Record number_of_nodes from GET /_cluster/health and compare against your expected master-eligible count.

  2. Determine whether you have a quorum. A strict majority of the voting configuration must be online. With three master-eligible nodes, at least two must be available. If the visible count dropped from three to one, the cluster stalls until missing nodes return or are formally excluded.

  3. Check for permanently departed nodes. In 7.x and later, the voting configuration retains master-eligible nodes that have left. If you decommissioned a node without POST /_cluster/voting_config_exclusions, it may still count toward quorum, preventing election even though the offline node is not returning.

  4. Inspect the previous master’s health. If nodes are present but no master is elected, identify the most recent master from logs or the last /_cat/master output. Check JVM heap usage and GC logs. Old-generation GC pauses longer than 10 seconds cause the node to miss fault-detection checks. After three consecutive failures, the master is removed and a new election must occur. If the master is also a data node, heavy indexing or search load can starve coordination threads and trigger similar symptoms.

  5. Verify network paths. Ensure all master-eligible nodes reach each other on the transport interface. Check for asymmetric routes, firewall changes, or packet loss. A partition that isolates the master from other master-eligible nodes triggers a new election on the quorum side, while the isolated side reports master_not_discovered_exception.

  6. Look for cluster state overload. Run GET /_cluster/pending_tasks. If the backlog exceeds 100 tasks or any task is older than 5 minutes, the master is falling behind. Correlate with rapid index creation, dynamic mapping explosions, or template churn. An oversized cluster state slows serialization and publication, which can cause election timeouts if the master becomes unresponsive while publishing updates.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Master stability (GET /_cat/master)Proves the cluster has a recognized leaderBlank response for more than 30 seconds after startup
Node count vs. expectedLoss of master-eligible nodes below quorum blocks electionsDrop in master-eligible count below majority
Master node JVM heap used percentLong GC pauses trigger fault detection timeouts and master removalSustained heap above 85 percent or old GC pauses exceeding 10 seconds
Pending cluster tasksBacklog indicates the master cannot process state changes fast enoughMore than 100 pending tasks or any task older than 5 minutes
Cluster state version and field countRapid growth slows the master and increases publication timeVersion incrementing more than 10 times per second or field count growing without bound

Fixes

Restore quorum after node loss

If offline master-eligible nodes can be restarted, bring them back first. Once a majority is online, the cluster elects a master automatically.

If the nodes are permanently gone, exclude them from the voting configuration:

curl -X POST 'localhost:9200/_cluster/voting_config_exclusions?node_names=node_name_1,node_name_2'

Warning: This reduces fault tolerance. Only exclude nodes you do not intend to return. The voting config exclusions API is itself a cluster state operation; if the cluster currently has no master, you must restore enough nodes to establish quorum before the call will succeed.

Resolve master node resource exhaustion

If the master steps down due to GC pauses or heap pressure, treat it as a memory incident on the coordination plane. Identify the heap consumer: segment metadata from too many shards, fielddata from text-field aggregations, or an oversized cluster state. If the master is also a data node, migrate to dedicated master nodes to isolate coordination work from indexing and search load. Pause rapid index creation, ILM transitions, and mapping updates until the master recovers. Reduce cluster state size by consolidating indices and capping total field counts.

Repair network partitions

Fix the underlying network path between master-eligible nodes. This may involve resolving DNS issues, removing unexpected firewall rules, or repairing failed network interfaces. Do not restart nodes to force an election until connectivity is fully restored; sequential restarts without a quorum can prolong the outage and trigger additional election failures.

Reduce cluster state pressure

Stop automatic index creation and disable dynamic mapping until the master stabilizes. Set index.mapping.total_fields.limit to prevent mapping explosions from unstructured data. Replace per-minute or per-hour time-series indices with ILM-managed rollover to slow the rate of cluster state changes. If pending tasks back up due to snapshot or ILM activity, pause those operations temporarily to let the master clear its queue.

Prevention

  • Maintain an odd number of dedicated master-eligible nodes. Three is the standard minimum.
  • Always exclude master-eligible nodes from the voting configuration before permanently decommissioning them.
  • Monitor master node JVM heap, GC duration, and pending cluster tasks separately from data node metrics. Low baseline resource usage is normal for dedicated masters; any sustained CPU or heap spike signals coordination trouble.
  • Monitor cluster state size indicators, such as total field count and state version churn rate, as leading indicators of master overload.
  • Keep master-eligible nodes on stable, low-latency network paths. Avoid stretching them across unreliable segments or regions where partitions are likely.

How Netdata helps

  • Correlate master node JVM heap with election failures. Netdata tracks per-node heap utilization and GC activity. A master step-down preceded by an old-generation GC pause points to resource exhaustion rather than a network partition.
  • Detect node departures in real time. Netdata alerts on node reachability and process uptime changes, catching master-eligible node loss as it happens instead of relying on intermittent API polling.
  • Surface network problems early. OS-level network latency, retransmission, and TCP connection-state metrics reveal transport-layer issues between master-eligible nodes before they trigger quorum loss.
  • Track master overload signals. Combine Netdata CPU and memory metrics with Elasticsearch pending-task trends to spot when the master is falling behind on cluster state updates.