Elasticsearch monitoring maturity model: from survival to expert

Incidents escalate when teams monitor only the survival layer and miss leading indicators that predict cascades. A rising heap floor, a growing segment count, or a master task backlog surface long before cluster health turns red.

This guide organizes Elasticsearch monitoring into four levels: survival, operational, mature, and expert. Each level adds signals that reduce mean time to detection and prevent composite failure patterns. Build level 1 before going live, level 2 before handling production traffic, and levels 3 and 4 after your first serious incident.

flowchart TD
    L1[Level 1 Survival] -->|add throughput, errors, and saturation| L2[Level 2 Operational]
    L2 -->|add leading indicators and per-index rates| L3[Level 3 Mature]
    L3 -->|add state churn, per-shard stats, and pressure| L4[Level 4 Expert]

Level 1: survival

Level 1 is the minimum viable monitoring. These signals answer one question: is the cluster alive and accepting work? A node that fails to respond on port 9200, a cluster that stays red for more than a few minutes, or a heap above 85 percent with old GC firing are page-worthy conditions. Reachability checks should exercise the HTTP API, not just TCP, because the process can accept connections while the HTTP layer is hung or returning 503. Never operate without these baselines. Even a single-node instance needs reachability and disk checks; a flood-stage watermark blocks writes regardless of cluster size.

SignalWhy it mattersWarning sign
Node reachabilityConfirms the process is alive and accepting HTTP connections.No response on port 9200 for more than 30 seconds.
Cluster health statusGreen, yellow, or red indicates whether all primaries and replicas are assigned.Red sustained longer than 2 minutes; yellow sustained longer than 30 minutes outside of rolling restarts.
JVM heap used percentHeap pressure precedes GC death spirals and OOM kills.Sustained above 75 percent; page if above 85 percent with old GC or breaker trips.
Node countA drop means a node left the cluster and shard reallocation will begin.Any unplanned decrease, or loss of master-eligible quorum.
Disk usage vs. watermarksLow (85%), high (90%), and flood-stage (95%) triggers control allocation and write availability.Above low watermark; immediate attention at flood stage when indices receive read-only blocks.
Indexing rateBaseline confirmation that the cluster is ingesting data.Drop to zero while upstream data sources remain active.
Search rateBaseline confirmation that the cluster is serving queries.Drop to zero while client applications remain active.

Level 2: operational

Level 2 adds the signals that explain why a surviving cluster is slow or failing. Thread pool rejections show which subsystem is saturated: the write pool for indexing, search for queries, and management for background coordination. Latency splits into query and fetch phases so you know whether shards are slow to scan or slow to retrieve source. Unassigned shards and pending tasks reveal coordination and allocation health. ILM execution status catches silent index accumulation that eventually triggers disk watermark cascades. If you cannot identify which node is hot-spotted or why bulk requests return HTTP 429, you are not yet at level 2.

SignalWhy it mattersWarning sign
Thread pool rejections (check /_nodes/stats/thread_pool)A full queue means the node is pushing back. Write rejections return HTTP 429; search rejections fail user queries.Sustained nonzero rate on write, search, or management pools for more than 5 minutes.
Old GC count and timeOld-generation pauses are stop-the-world events that drive latency and node removal. Long pauses trigger the master to mark the node as failed, initiating reallocation.Frequency increasing, or individual pauses exceeding 5 seconds.
Search latency (query and fetch)The slowest shard determines end-to-end latency in the scatter-gather model.Sustained elevation above 2x baseline; query-phase latency rising independently of fetch.
Indexing latencyRising latency under constant load signals disk I/O contention, merge storms, or pipeline overhead.Sustained increase above 2x baseline.
Unassigned shard count and reasonUnassigned primaries mean unavailable data; unassigned replicas mean degraded redundancy.Any unassigned primary; replicas unassigned longer than 30 minutes.
Pending cluster tasks (check /_cluster/pending_tasks)Backlog on the master delays allocation, mapping updates, and index creation. A single slow task can block all subsequent metadata changes.More than 20 tasks, or any task older than 30 seconds.
Merge activity and segment countMerges reclaim deleted documents and improve search speed; too many segments degrade performance and consume heap.Segment count per shard exceeding 100; total segment memory growing on nodes.
Snapshot status and durationBackups that are not completing compromise recoverability.Last successful snapshot older than 2x the scheduled interval; failures accumulating.
Circuit breaker trips (check /_nodes/stats/breaker)The system is rejecting operations to prevent OOM. Repeated request breaker trips under query load often mean aggregations are too expensive.Any trip of the parent breaker; repeated fielddata or request breaker trips.
ILM execution statusStuck policies cause index and shard accumulation that eventually drives disk and heap pressure.Indices stuck in a phase for longer than the expected transition window.

Level 3: mature

Level 3 shifts from reactive to predictive. The heap sawtooth floor, not the peak, is the true indicator of memory pressure. Watch the post-GC minimum over a 24-hour window; a floor climbing toward 50 percent of max heap means long-lived objects are leaking or caches are unbounded. Per-index rates expose hot-spotting that cluster-wide averages hide. Cluster state size and version churn warn that the master is approaching instability long before elections flap. Every mapping update or dynamic index creation increments the state version and forces a publish to all nodes. Replica lag measured by sequence number gaps shows redundancy degrading in real time. These signals require historical trending and lower alert thresholds.

SignalWhy it mattersWarning sign
Heap sawtooth floorThe post-GC minimum heap is the best leading indicator of long-lived object accumulation.Floor trending upward over days; approaching 50 percent of max heap.
Fielddata cache size and evictionsFielddata on text fields loads terms into heap and should be near zero in modern deployments.Size above 10 percent of heap or any evictions occurring.
Cluster state size and versionEvery mapping, index, and alias inflates the state every node holds in heap. Rapid version increments signal churn.State consuming more than 5 percent of master heap; version incrementing faster than 10 per second sustained.
Translog size and uncommitted operationsLarge translogs extend recovery time and indicate flush problems.Uncommitted size well above the configured flush threshold or growing monotonically.
Refresh and flush timesSlow refresh creates segments slowly; slow flush delays durability and truncates translog.Average refresh time above 1 second or flush time above 30 seconds sustained.
Shard recovery activityRecovery competes with production traffic for disk I/O and network bandwidth.Recovery stalled at the same percentage for longer than 30 minutes.
Per-index indexing and search ratesCluster-wide averages hide hot-spotted indices or shards.Asymmetry where one index dominates node load or one shard carries disproportionate traffic.
Per-node segment memorySegment metadata lives in heap and scales with segment count and field count.segments.memory consuming more than 10 percent of node heap.
File descriptor utilizationEach segment consists of multiple files; exhaustion causes I/O and connection failures.Above 80 percent of the configured maximum.
Replica lag (sequence number gap)A growing gap between primary and replica means reduced redundancy and potential sync failures.Global checkpoint trailing max sequence number by more than 10,000 operations and growing.

Level 4: expert

Level 4 is for teams that have debugged enough incidents to know that averages lie. Per-shard segment distributions reveal the single oversized shard behind a latency spike. Hot threads and task cancellation isolate the exact query or merge consuming CPU. Indexing pressure stats break down memory usage by coordinating, primary, and replica stages. Adaptive replica selection metrics show which nodes the cluster is already avoiding. These signals are verbose and expensive to collect continuously, so sample them during incidents or bake them into automated diagnostics that fire when lower-level thresholds breach.

SignalWhy it mattersWarning sign
Cluster state version churnRapid version increments indicate unstable routing or excessive metadata changes.Version rate exceeding 10 per second sustained without planned topology changes.
Per-shard segment count distributionAverages hide individual problem shards with hundreds of segments.Outlier shards above 100 segments while siblings remain low.
Merge throttle timemerges.total_throttled_time_in_millis indicates I/O pressure forced Lucene to slow merges.Throttle time growing while segment count also grows.
OS page cache effectivenessElasticsearch relies on the kernel page cache, not heap, for segment access.Available memory for page cache shrinking relative to total segment data size.
Adaptive replica selection statsARS routes searches away from struggling nodes; monitoring it reveals hidden hot-spotting.Specific nodes consistently marked as poor targets by selection heuristics.
Indexing pressure statsMemory-based backpressure at coordinating, primary, and replica stages warns before heap failure.Current bytes sustained above 80 percent of the 10 percent heap limit, or rejections increasing.
Hot threadsCPU consumption breakdown by thread during incidents identifies what the JVM is doing right now.Persistent hot threads in merge, search, or OTHER_CPU (GC) categories.
Long-running tasksSearches or bulk operations that exceed expected duration hold resources and block queues.Tasks running longer than 30 seconds on a low-latency cluster.
Total mapping field countUnbounded field growth from dynamic mapping inflates cluster state and heap.Field count growing without bound or approaching index.mapping.total_fields.limit.
Snapshot incremental sizeTracks backup storage growth and segment churn between snapshots.Incremental size spiking after force merges or indicating unexpected data growth.

Advancing through these levels is not about collecting more metrics for their own sake. It is about reducing the time between symptom and mechanism. Level 1 tells you that something broke. Level 2 tells you which subsystem broke. Level 3 tells you it is breaking before it fails. Level 4 tells you exactly which shard, segment, or query is responsible.

How Netdata helps

Netdata collects Elasticsearch metrics from the JSON stats APIs and correlates them with system-level data. Elasticsearch performance is inseparable from OS resources.

  • Per-node heap and GC correlation: Netdata surfaces JVM heap alongside OS memory, letting you distinguish heap pressure from page cache starvation without switching tools.
  • Thread pool saturation visibility: Netdata tracks queue depth and rejection rates per pool, so you can spot the transition from queuing to rejection before clients fail.
  • Disk and I/O context: Disk watermark alerts in Elasticsearch are clearer when paired with OS-level I/O wait and throughput, showing whether saturation is from merges, recovery, or co-located workloads.
  • Cluster state and indexing pressure: Netdata indexes metrics like cluster health and indexing pressure bytes, enabling dashboards that show master-level signals alongside data-node saturation.

No related guides are currently available in this section.