Elasticsearch indexing rate dropped to zero: where the write path stalls
Your ingestion pipeline reports healthy connections, but Elasticsearch stopped accepting writes. The index_total counter is flat, upstream queues are building, and documents are erroring or disappearing. Because index_total increments for every document, update, delete, and individual bulk item, a sustained rate of zero means the write path is stalled. The cluster may still report green health, nodes may still respond to pings, and search may still work, but the pipeline is backed up.
The write path is a chain: the coordinating node routes the document to the correct primary shard. The primary writes to the translog and an in-memory buffer. A refresh creates a Lucene segment, a flush makes it durable, and merges consolidate segments in the background. Replication follows. If any stage blocks, the chain stops. A cluster-wide drop to zero usually points to a global block, saturation event, or resource exhaustion. A partial drop, where some nodes index normally while others report zero, indicates hot-spotting or localized primary unavailability.
flowchart TD
A[Indexing rate near zero] --> B{Cluster health red?}
B -->|Yes| C[Unassigned primary]
B -->|No| D{Disk above 95%?}
D -->|Yes| E[Flood-stage block]
D -->|No| F{Write rejections?}
F -->|Yes| G[Pool saturation or breaker]
F -->|No| H{Merges high?}
H -->|Yes| I[Merge storm]
H -->|No| J[Ingest pipeline or hot threads]What this means
indices.indexing.index_total is the definitive signal that documents are completing the full write path. It is a cumulative counter that increments for every successfully indexed document, including versioned updates, explicit deletions, and each item inside a bulk request. A 1000-document bulk contributes 1000 to index_total, not one. When this rate falls to zero while sources are still sending, documents are being rejected, circuit-broken, or stalled before the primary shard acknowledges them. The stall can happen at the coordinating node, the primary, or during replication. Per-node asymmetry is critical: if one node shows zero indexing while others carry load, the cause is a hot-spotted primary or node-level resource exhaustion, not a cluster-wide policy block.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Write thread pool saturation | HTTP 429 or EsRejectedExecutionException; write queue near max | GET /_cat/thread_pool/write?v&h=node_name,active,queue,rejected |
| Merge storm eating I/O | merges.current at max; segment count growing; indexing latency rising | GET /_cat/nodes?v&h=name,merges.current,segments.count |
| Slow ingest pipeline | High CPU on ingest nodes; indexing latency high but queue low | GET /_nodes/stats/ingest?filter_path=nodes.*.ingest |
| Primary shard unavailable | Cluster health red; queries against affected indices fail | GET /_cluster/health and GET /_cluster/allocation/explain |
| Circuit breaker rejecting bulk | HTTP 429; breakers.parent.tripped increasing | GET /_nodes/stats/breaker?filter_path=nodes.*.breakers |
| Flood-stage disk block | Disk >95%; index.blocks.read_only_allow_delete set; writes blocked | GET /_cat/allocation?v and index settings |
Quick checks
Run these read-only commands to narrow the failure domain.
# Confirm cluster health and node count
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,number_of_nodes,unassigned_shards'
# Verify indexing is stalled (take two samples 30s apart)
curl -s 'http://localhost:9200/_nodes/stats/indices/indexing?filter_path=nodes.*.indices.indexing'
# Check write thread pool for saturation
curl -s 'http://localhost:9200/_cat/thread_pool/write,search?v&h=node_name,name,active,queue,rejected'
# Check disk usage per node
curl -s 'http://localhost:9200/_cat/allocation?v'
# Check circuit breaker trips and estimated sizes
curl -s 'http://localhost:9200/_nodes/stats/breaker?filter_path=nodes.*.breakers'
# Check for active read-only blocks
curl -s 'http://localhost:9200/_all/_settings?filter_path=*.settings.index.blocks.read_only_allow_delete'
# Check merge activity and segment counts
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,merges.current,segments.count,segments.memory'
# Check ingest pipeline processor timing
curl -s 'http://localhost:9200/_nodes/stats/ingest?filter_path=nodes.*.ingest'
# Check indexing pressure memory usage (ES 7.9+)
curl -s 'http://localhost:9200/_nodes/stats/indexing_pressure?filter_path=nodes.*.indexing_pressure'
How to diagnose it
Confirm the stall. Take two samples of
_nodes/stats/indices/indexing30 seconds apart. Ifindex_totaldelta is zero across all data nodes, the write path is fully stalled. If only some nodes show zero, note which ones; asymmetry points to hot-spotting or local primary failure.Check cluster health. If
statusisred, unassigned primaries are blocking writes. RunGET /_cluster/allocation/explainto identify the exact shard and reason. If health is green or yellow, the stall is not from missing primaries.Check disk watermarks. Run
GET /_cat/allocation. If any data node shows disk usage above 95%, Elasticsearch has likely applied the flood-stage block. CheckGET /_all/_settingsforindex.blocks.read_only_allow_delete. This is the most common cause of a sudden, cluster-wide indexing stop.Check write thread pool rejections. Run
GET /_cat/thread_pool/write. Sustainedrejectedincreases mean the pool is overwhelmed. Note thequeuedepth. If the queue is full and rejections are climbing, the cluster cannot keep up with ingest volume.Check circuit breakers. Run
GET /_nodes/stats/breaker. Ifparent.tripped,request.tripped, orin_flight_requests.trippedare increasing, memory protection is rejecting bulk requests. Compareestimated_size_in_bytestolimit_size_in_bytesto see how thin the margin is.Check merge activity and segment counts. Run
GET /_cat/nodesformerges.currentandsegments.count. Ifmerges.currentis persistently at the scheduler’s max thread count andsegments.countis growing, a merge storm is consuming all disk I/O and throttling indexing.Check ingest pipeline stats. Run
GET /_nodes/stats/ingest. Look for a specific processor with disproportionately hightime_in_millisrelative to its count. Grok, enrich, and script processors are the usual suspects. High pipeline time creates back-pressure even when the write thread pool is not saturated.Check indexing pressure (ES 7.9+). Run
GET /_nodes/stats/indexing_pressure. If coordinating or primary stage memory usage is nearlimit_in_bytes, or if rejection counters are increasing, memory-based admission control is blocking writes before they reach the thread pool.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Indexing rate (index_total delta) | Confirms writes are completing | Zero while ingestion sources are active |
| Write thread pool rejected | Direct signal of write path pushback | Sustained delta >0 for more than 5 minutes |
| Write thread pool queue depth | Precursor to rejection | Persistently >50% of configured max |
| Disk used percent per node | Flood stage blocks all writes on affected shards | Any node above 95% |
Circuit breaker tripped counters | Memory protection rejecting operations | Any delta >0 on parent or request breaker |
| Cluster health status | Red means primaries are unassigned | status: red sustained for >2 minutes |
| Merge current and segment count | Merge storms compete for I/O | merges.current at max concurrency with growing segment count |
| Ingest pipeline processor time | Synchronous processing before write ack | One processor consuming disproportionate time |
| Indexing pressure memory (7.9+) | Earlier backpressure than thread pools | Current memory sustained >80% of limit |
Fixes
Write thread pool saturation
Do not increase thread_pool.write.queue_size as a first response. A larger queue delays rejection and increases memory pressure without fixing throughput. Instead, reduce bulk batch sizes from clients to lower per-request memory and CPU spikes. Add data nodes to increase cluster-wide write capacity. If only some nodes show saturation, investigate hot-spotted shards. Rebalancing or routing adjustments may help, but adding capacity is the sustainable fix.
Merge storm eating I/O
Increase index.refresh_interval on heavy-write indices from the default 1s to 10s or 30s to reduce segment creation pressure. Verify that index.merge.scheduler.max_thread_count is appropriate for your storage: one thread for spinning disks, higher for SSDs. If segment counts are in the hundreds per shard, merges are falling behind. Force merge to one segment only on indices that are no longer being written to. Running force merge on a live index creates oversized segments and can stall writes further.
Slow ingest pipeline
Replace complex grok patterns with dissect processors where possible. Reduce enrichment lookup cardinality or move lookups to the client side. If ingest processing is CPU-bound, add dedicated ingest nodes or enable the ingest role on additional data nodes to distribute pipeline execution.
Primary shard unavailable
Use GET /_cluster/allocation/explain to identify the allocation block. If the reason is ALLOCATION_FAILED and the shard has exceeded index.allocation.max_retries, run POST /_cluster/reroute?retry_failed=true. If the block is a disk watermark, free space before expecting allocation to resume. Corrupt shards may require restoration from snapshot.
Circuit breaker trip
Reduce bulk request payload size to lower per-request memory overhead. Fix mappings that trigger fielddata loading by using keyword sub-fields for text aggregations. If the parent breaker trips repeatedly, identify the heap consumer: check segments.memory, fielddata cache size, and cluster state growth. Raising breaker limits without fixing the underlying memory pressure risks OOM.
Flood-stage disk block
Immediately delete old indices or reduce replica count to free disk space. After space is freed, remove the read-only block explicitly:
# WARNING: Destructive. This removes the block from EVERY index.
# Target specific indices in production instead of _all.
curl -X PUT 'http://localhost:9200/_all/_settings' -H 'Content-Type: application/json' -d '{"index.blocks.read_only_allow_delete": null}'
In Elasticsearch 7.x+ and 8.x, the block is automatically removed when disk usage on the affected node drops below the high watermark. If you cannot free enough space, the block persists until you delete data.
Prevention
Monitor per-node indexing rate asymmetry. Cluster-wide averages hide hot-spotted primaries. Alert when one node’s indexing rate deviates significantly from peers hosting a similar primary shard count.
Size bulk requests and refresh intervals for your storage. Large bulks spike memory and trigger breaker trips. Frequent refreshes create segment storms on high-throughput indices.
Set mapping guardrails. Use index.mapping.total_fields.limit and index.mapping.depth.limit to prevent mapping explosions that bloat cluster state and heap, indirectly pressuring the write path.
Plan disk capacity with merge overhead. Merges temporarily require disk space for both old and new segments. Maintain at least 30% free disk to absorb merge spikes and avoid watermark cascades.
Monitor ingest pipelines before they reach production. Track per-processor timing during load tests. A slow grok pattern will not show up in thread pool metrics until it has already stalled the write path.
How Netdata helps
Netdata correlates indexing rate with write thread pool rejections and queue depth to isolate saturation from upstream blocks. Disk I/O wait, merge activity, and segment count trends surface merge storms before indexing throughput drops to zero. JVM heap usage alongside circuit breaker estimated sizes anticipates memory rejections. Disk watermark proximity alerts per node catch flood-stage risk before the block is applied. Per-node indexing rate charts expose hot-spotted primaries immediately.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch cluster state too large: field count, index count, and per-node heap
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
- Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch FORBIDDEN/12/index read-only / allow delete (api) — flood stage recovery







