Elasticsearch monitoring checklist: the signals every production cluster needs

Elasticsearch failures cascade. A long GC pause on one node causes it to miss fault detection checks; the master removes it. Shards relocate to survivors, increasing heap pressure and thread pool load. If disk is near the high watermark, relocation I/O pushes other nodes toward flood stage, which sets indices to read-only and blocks writes. By the time GET /_cluster/health returns red, the leading indicators fired minutes ago.

Cluster health is a lagging indicator. It tells you damage is done, not that it is coming. This checklist organizes signals into four levels. Use it to audit coverage or build instrumentation. Start with survival, but build dashboards that show all four levels so correlations are obvious.

flowchart TD
    L4[Level 4: Expert - Per-shard, per-task, per-query]
    L3[Level 3: Mature - Leading indicators, caches, state size]
    L2[Level 2: Operational - Queues, latency, rejections, segments]
    L1[Level 1: Survival - Health, heap, nodes, disk]

    L1 --> L2
    L2 --> L3
    L3 --> L4

Level 1 - survival

These are the minimum signals. Without them, you learn about failures from users.

  • Node reachability. Poll GET / on every node. No response means the process is down, the network path is broken, or the node is stuck in a GC pause long enough to miss health checks.

  • Cluster health status. GET /_cluster/health. Red means at least one primary shard is unassigned; data is unavailable. Yellow means replica shards are missing; redundancy is degraded. Green does not mean fast.

  • JVM heap used percent. From _nodes/stats/jvm, track mem.heap_used_percent. Sustained usage above 75% indicates memory pressure. Above 85% combined with rising old GC time or circuit breaker trips signals imminent instability.

  • Node count. GET /_cluster/health returns number_of_nodes and number_of_data_nodes. An unplanned drop means the master is reassigning shards, adding load to survivors.

  • Disk usage relative to watermarks. GET /_cat/allocation or _nodes/stats/fs to see disk.used_percent. Compare this to the cluster settings cluster.routing.allocation.disk.watermark.low (default 85%), high (90%), and flood_stage (95%). Track time-to-watermark, not just current usage. Flood stage sets indices to read-only.

Level 2 - operational

These answer why the cluster is slow and which node is the problem.

  • Thread pool rejection rate. From _nodes/stats/thread_pool, calculate rate(rejected) over one minute for write and search pools. These are cumulative counters. A rising rate means the node is returning HTTP 429.

  • Thread pool queue depth. From the same endpoint, track queue for write, search, and management. Sustained queue growth precedes rejections. A growing management queue is especially dangerous because it handles cluster state application.

  • Search latency. Calculate per-node query phase latency as delta(query_time_in_millis) / delta(query_total) from _nodes/stats/indices/search. Calculate fetch phase latency as delta(fetch_time_in_millis) / delta(fetch_total). High query latency means slow shard execution; high fetch latency means disk-bound reads.

  • Indexing latency. Calculate delta(indexing.index_time_in_millis) / delta(indexing.index_total) from _nodes/stats/indices/indexing. Rising latency under constant load indicates disk I/O contention, merge pressure, or expensive ingest pipelines.

  • Unassigned shard count. GET /_cluster/health returns unassigned_shards and delayed_unassigned_shards. Any unassigned primary is abnormal. GET /_cluster/allocation/explain shows why the allocator has not placed a shard. Shards stuck in ALLOCATION_FAILED after max retries will not self-heal without operator intervention.

  • Pending cluster tasks. GET /_cluster/pending_tasks. A backlog means the master cannot process cluster state changes fast enough. This is an early warning of coordination instability even when data operations look healthy.

  • Circuit breaker trips. From _nodes/stats/breaker, track tripped for parent, fielddata, and request. The parent breaker firing even once warrants immediate investigation because it tracks real heap usage. Also compare estimated_size_in_bytes to limit_size_in_bytes to see headroom.

  • Segment count and merge activity. From _nodes/stats/indices/segments, track segments.count and segments.memory_in_bytes. Hundreds of segments per shard or linear growth in segment memory degrades search and consumes heap. Check _nodes/stats/indices/merges to see if current is consistently at max concurrency but falling behind.

  • Snapshot age and SLM health. If using SLM, GET /_slm/policy and check last_success.time. Alert when the last successful snapshot is older than twice the scheduled interval. Snapshot success does not guarantee restore success; test restores periodically.

  • ILM execution. GET /<index>/_ilm/explain. Look for indices blocked in a phase by unmet rollover conditions, missing aliases, or insufficient disk space. Stuck indices silently accumulate shards and consume disk.

Level 3 - mature

These predict failures before they cascade.

  • Heap sawtooth floor. Track the minimum heap usage after old GC. A rising floor means long-lived objects are accumulating and the death spiral is approaching regardless of what the peak percentage shows.

  • Fielddata cache size. From _nodes/stats/indices/fielddata, memory_size_in_bytes should be near zero on modern clusters because doc_values handles aggregations. Significant usage indicates text fields are being sorted or aggregated without a keyword sub-field.

  • Query cache hit ratio. From _nodes/stats/indices/query_cache, calculate hit_count / (hit_count + miss_count). Low hit rate on cacheable workloads suggests ineffective caching or excessive refresh invalidation. Because the cache is per-segment, high segment counts amplify misses.

  • Cluster state complexity. A large cluster state consumes heap on every node and slows master publication. Monitor total field count across all indices as a proxy for state complexity.

  • Refresh and flush duration. From _nodes/stats/indices/refresh and indices/flush, calculate delta(total_time_in_millis) / delta(total). Sustained increases point to disk I/O contention or too many shards refreshing independently. Re-enable refresh_interval after bulk loads if you disabled it.

  • Shard recovery progress. GET /_cat/recovery?v&active_only=true. Stuck recoveries consume network and disk bandwidth that competes with production traffic. Recovery stuck below 50% for more than 30 minutes indicates a blockage such as a disk watermark or network partition.

  • Replica lag. Sequence number gaps between primary and replica indicate replication is falling behind, which weakens redundancy and can lead to removal of the replica from the in-sync set.

  • Per-node segment memory. From _nodes/stats/indices/segments, memory_in_bytes growing linearly is a leading indicator of heap pressure that raw shard count can miss.

  • File descriptor utilization. From _nodes/stats/process, compare open_file_descriptors to max_file_descriptors. Approaching 80% risks IOException: Too many open files even if other metrics look healthy.

  • Indexing pressure. Since ES 7.9, _nodes/stats/indexing_pressure tracks bytes at coordinating, primary, and replica stages. Pressure near the default 10% heap limit warns before thread pool rejections begin.

Level 4 - expert

These isolate exact queries, shards, and fields.

  • Cluster state version churn. Poll GET /_cluster/state and track the version field rate of change. Sustained increments faster than ten per second indicate excessive metadata churn that overwhelms the master even if the cluster looks healthy.

  • Per-shard segment distribution. Use _cat/segments/<index> or _stats?level=shards to find outlier shards with far more segments than the median. This pinpoints specific indexing or merge problems that cluster-wide averages hide.

  • Hot threads. GET /_nodes/hot_threads?threads=9999 reveals exactly what is consuming CPU during an incident. It is essential for live diagnosis and rarely useful for passive alerting.

  • Active tasks. GET /_tasks?detailed=true&actions=*search* exposes long-running searches, bulk operations, or stuck internal tasks. You can cancel tasks with POST /_tasks/<task_id>/_cancel. Warning: Cancelling tasks is disruptive and can leave partial state; use only during emergencies.

  • Adaptive replica selection stats. These show which data nodes are being avoided due to poor performance, explaining asymmetric load that uniform node-level averages obscure.

  • Script compilation rate. From _nodes/stats/script, track compilations and cache_evictions. Excessive compilations indicate dynamic script abuse or missing parameterization.

  • OS page cache headroom. ES relies on the kernel page cache for segment access. Starvation produces high search latency with low CPU. Latency remains elevated after restart until the cache warms. Monitor via OS-level tools or Netdata system collectors; the Elasticsearch API does not expose page cache.

  • Total mapping field count. Growth without bound signals an impending mapping explosion. Dynamic mapping on unstructured JSON can create tens of thousands of fields that bloat cluster state and heap.

How Netdata helps

Netdata collects Elasticsearch metrics from _cluster/health, _nodes/stats, _cat/shards, and related endpoints. The value is in correlation, not just collection.

  • Correlate JVM heap usage with old GC pause duration and thread pool rejections on the same chart to spot heap pressure before nodes drop out.
  • Overlay disk utilization and I/O wait with _cat/allocation watermark proximity to catch disk watermark cascades before flood stage blocks writes.
  • Baseline indexing and search latency per node from _nodes/stats to flag degradations that cumulative averages hide.
  • Track segment memory and file descriptor utilization per node to surface shard overallocation that cluster-wide health status masks.
  • Separate master node CPU and heap from data nodes, because master instability from metadata overload is invisible in cluster-wide averages.