Elasticsearch monitoring checklist: the signals every production cluster needs
Elasticsearch failures cascade. A long GC pause on one node causes it to miss fault detection checks; the master removes it. Shards relocate to survivors, increasing heap pressure and thread pool load. If disk is near the high watermark, relocation I/O pushes other nodes toward flood stage, which sets indices to read-only and blocks writes. By the time GET /_cluster/health returns red, the leading indicators fired minutes ago.
Cluster health is a lagging indicator. It tells you damage is done, not that it is coming. This checklist organizes signals into four levels. Use it to audit coverage or build instrumentation. Start with survival, but build dashboards that show all four levels so correlations are obvious.
flowchart TD
L4[Level 4: Expert - Per-shard, per-task, per-query]
L3[Level 3: Mature - Leading indicators, caches, state size]
L2[Level 2: Operational - Queues, latency, rejections, segments]
L1[Level 1: Survival - Health, heap, nodes, disk]
L1 --> L2
L2 --> L3
L3 --> L4Level 1 - survival
These are the minimum signals. Without them, you learn about failures from users.
Node reachability. Poll
GET /on every node. No response means the process is down, the network path is broken, or the node is stuck in a GC pause long enough to miss health checks.Cluster health status.
GET /_cluster/health. Red means at least one primary shard is unassigned; data is unavailable. Yellow means replica shards are missing; redundancy is degraded. Green does not mean fast.JVM heap used percent. From
_nodes/stats/jvm, trackmem.heap_used_percent. Sustained usage above 75% indicates memory pressure. Above 85% combined with rising old GC time or circuit breaker trips signals imminent instability.Node count.
GET /_cluster/healthreturnsnumber_of_nodesandnumber_of_data_nodes. An unplanned drop means the master is reassigning shards, adding load to survivors.Disk usage relative to watermarks.
GET /_cat/allocationor_nodes/stats/fsto seedisk.used_percent. Compare this to the cluster settingscluster.routing.allocation.disk.watermark.low(default 85%),high(90%), andflood_stage(95%). Track time-to-watermark, not just current usage. Flood stage sets indices to read-only.
Level 2 - operational
These answer why the cluster is slow and which node is the problem.
Thread pool rejection rate. From
_nodes/stats/thread_pool, calculaterate(rejected)over one minute forwriteandsearchpools. These are cumulative counters. A rising rate means the node is returning HTTP 429.Thread pool queue depth. From the same endpoint, track
queueforwrite,search, andmanagement. Sustained queue growth precedes rejections. A growingmanagementqueue is especially dangerous because it handles cluster state application.Search latency. Calculate per-node query phase latency as
delta(query_time_in_millis) / delta(query_total)from_nodes/stats/indices/search. Calculate fetch phase latency asdelta(fetch_time_in_millis) / delta(fetch_total). High query latency means slow shard execution; high fetch latency means disk-bound reads.Indexing latency. Calculate
delta(indexing.index_time_in_millis) / delta(indexing.index_total)from_nodes/stats/indices/indexing. Rising latency under constant load indicates disk I/O contention, merge pressure, or expensive ingest pipelines.Unassigned shard count.
GET /_cluster/healthreturnsunassigned_shardsanddelayed_unassigned_shards. Any unassigned primary is abnormal.GET /_cluster/allocation/explainshows why the allocator has not placed a shard. Shards stuck inALLOCATION_FAILEDafter max retries will not self-heal without operator intervention.Pending cluster tasks.
GET /_cluster/pending_tasks. A backlog means the master cannot process cluster state changes fast enough. This is an early warning of coordination instability even when data operations look healthy.Circuit breaker trips. From
_nodes/stats/breaker, tracktrippedforparent,fielddata, andrequest. Theparentbreaker firing even once warrants immediate investigation because it tracks real heap usage. Also compareestimated_size_in_bytestolimit_size_in_bytesto see headroom.Segment count and merge activity. From
_nodes/stats/indices/segments, tracksegments.countandsegments.memory_in_bytes. Hundreds of segments per shard or linear growth in segment memory degrades search and consumes heap. Check_nodes/stats/indices/mergesto see ifcurrentis consistently at max concurrency but falling behind.Snapshot age and SLM health. If using SLM,
GET /_slm/policyand checklast_success.time. Alert when the last successful snapshot is older than twice the scheduled interval. Snapshot success does not guarantee restore success; test restores periodically.ILM execution.
GET /<index>/_ilm/explain. Look for indices blocked in a phase by unmet rollover conditions, missing aliases, or insufficient disk space. Stuck indices silently accumulate shards and consume disk.
Level 3 - mature
These predict failures before they cascade.
Heap sawtooth floor. Track the minimum heap usage after old GC. A rising floor means long-lived objects are accumulating and the death spiral is approaching regardless of what the peak percentage shows.
Fielddata cache size. From
_nodes/stats/indices/fielddata,memory_size_in_bytesshould be near zero on modern clusters because doc_values handles aggregations. Significant usage indicates text fields are being sorted or aggregated without akeywordsub-field.Query cache hit ratio. From
_nodes/stats/indices/query_cache, calculatehit_count / (hit_count + miss_count). Low hit rate on cacheable workloads suggests ineffective caching or excessive refresh invalidation. Because the cache is per-segment, high segment counts amplify misses.Cluster state complexity. A large cluster state consumes heap on every node and slows master publication. Monitor total field count across all indices as a proxy for state complexity.
Refresh and flush duration. From
_nodes/stats/indices/refreshandindices/flush, calculatedelta(total_time_in_millis) / delta(total). Sustained increases point to disk I/O contention or too many shards refreshing independently. Re-enablerefresh_intervalafter bulk loads if you disabled it.Shard recovery progress.
GET /_cat/recovery?v&active_only=true. Stuck recoveries consume network and disk bandwidth that competes with production traffic. Recovery stuck below 50% for more than 30 minutes indicates a blockage such as a disk watermark or network partition.Replica lag. Sequence number gaps between primary and replica indicate replication is falling behind, which weakens redundancy and can lead to removal of the replica from the in-sync set.
Per-node segment memory. From
_nodes/stats/indices/segments,memory_in_bytesgrowing linearly is a leading indicator of heap pressure that raw shard count can miss.File descriptor utilization. From
_nodes/stats/process, compareopen_file_descriptorstomax_file_descriptors. Approaching 80% risksIOException: Too many open fileseven if other metrics look healthy.Indexing pressure. Since ES 7.9,
_nodes/stats/indexing_pressuretracks bytes at coordinating, primary, and replica stages. Pressure near the default 10% heap limit warns before thread pool rejections begin.
Level 4 - expert
These isolate exact queries, shards, and fields.
Cluster state version churn. Poll
GET /_cluster/stateand track theversionfield rate of change. Sustained increments faster than ten per second indicate excessive metadata churn that overwhelms the master even if the cluster looks healthy.Per-shard segment distribution. Use
_cat/segments/<index>or_stats?level=shardsto find outlier shards with far more segments than the median. This pinpoints specific indexing or merge problems that cluster-wide averages hide.Hot threads.
GET /_nodes/hot_threads?threads=9999reveals exactly what is consuming CPU during an incident. It is essential for live diagnosis and rarely useful for passive alerting.Active tasks.
GET /_tasks?detailed=true&actions=*search*exposes long-running searches, bulk operations, or stuck internal tasks. You can cancel tasks withPOST /_tasks/<task_id>/_cancel. Warning: Cancelling tasks is disruptive and can leave partial state; use only during emergencies.Adaptive replica selection stats. These show which data nodes are being avoided due to poor performance, explaining asymmetric load that uniform node-level averages obscure.
Script compilation rate. From
_nodes/stats/script, trackcompilationsandcache_evictions. Excessive compilations indicate dynamic script abuse or missing parameterization.OS page cache headroom. ES relies on the kernel page cache for segment access. Starvation produces high search latency with low CPU. Latency remains elevated after restart until the cache warms. Monitor via OS-level tools or Netdata system collectors; the Elasticsearch API does not expose page cache.
Total mapping field count. Growth without bound signals an impending mapping explosion. Dynamic mapping on unstructured JSON can create tens of thousands of fields that bloat cluster state and heap.
How Netdata helps
Netdata collects Elasticsearch metrics from _cluster/health, _nodes/stats, _cat/shards, and related endpoints. The value is in correlation, not just collection.
- Correlate JVM heap usage with old GC pause duration and thread pool rejections on the same chart to spot heap pressure before nodes drop out.
- Overlay disk utilization and I/O wait with
_cat/allocationwatermark proximity to catch disk watermark cascades before flood stage blocks writes. - Baseline indexing and search latency per node from
_nodes/statsto flag degradations that cumulative averages hide. - Track segment memory and file descriptor utilization per node to surface shard overallocation that cluster-wide health status masks.
- Separate master node CPU and heap from data nodes, because master instability from metadata overload is invisible in cluster-wide averages.







