Elasticsearch indexing rate dropped to zero: where the write path stalls

Your ingestion pipeline reports healthy connections, but Elasticsearch stopped accepting writes. The index_total counter is flat, upstream queues are building, and documents are erroring or disappearing. Because index_total increments for every document, update, delete, and individual bulk item, a sustained rate of zero means the write path is stalled. The cluster may still report green health, nodes may still respond to pings, and search may still work, but the pipeline is backed up.

The write path is a chain: the coordinating node routes the document to the correct primary shard. The primary writes to the translog and an in-memory buffer. A refresh creates a Lucene segment, a flush makes it durable, and merges consolidate segments in the background. Replication follows. If any stage blocks, the chain stops. A cluster-wide drop to zero usually points to a global block, saturation event, or resource exhaustion. A partial drop, where some nodes index normally while others report zero, indicates hot-spotting or localized primary unavailability.

flowchart TD
    A[Indexing rate near zero] --> B{Cluster health red?}
    B -->|Yes| C[Unassigned primary]
    B -->|No| D{Disk above 95%?}
    D -->|Yes| E[Flood-stage block]
    D -->|No| F{Write rejections?}
    F -->|Yes| G[Pool saturation or breaker]
    F -->|No| H{Merges high?}
    H -->|Yes| I[Merge storm]
    H -->|No| J[Ingest pipeline or hot threads]

What this means

indices.indexing.index_total is the definitive signal that documents are completing the full write path. It is a cumulative counter that increments for every successfully indexed document, including versioned updates, explicit deletions, and each item inside a bulk request. A 1000-document bulk contributes 1000 to index_total, not one. When this rate falls to zero while sources are still sending, documents are being rejected, circuit-broken, or stalled before the primary shard acknowledges them. The stall can happen at the coordinating node, the primary, or during replication. Per-node asymmetry is critical: if one node shows zero indexing while others carry load, the cause is a hot-spotted primary or node-level resource exhaustion, not a cluster-wide policy block.

Common causes

CauseWhat it looks likeFirst thing to check
Write thread pool saturationHTTP 429 or EsRejectedExecutionException; write queue near maxGET /_cat/thread_pool/write?v&h=node_name,active,queue,rejected
Merge storm eating I/Omerges.current at max; segment count growing; indexing latency risingGET /_cat/nodes?v&h=name,merges.current,segments.count
Slow ingest pipelineHigh CPU on ingest nodes; indexing latency high but queue lowGET /_nodes/stats/ingest?filter_path=nodes.*.ingest
Primary shard unavailableCluster health red; queries against affected indices failGET /_cluster/health and GET /_cluster/allocation/explain
Circuit breaker rejecting bulkHTTP 429; breakers.parent.tripped increasingGET /_nodes/stats/breaker?filter_path=nodes.*.breakers
Flood-stage disk blockDisk >95%; index.blocks.read_only_allow_delete set; writes blockedGET /_cat/allocation?v and index settings

Quick checks

Run these read-only commands to narrow the failure domain.

# Confirm cluster health and node count
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,number_of_nodes,unassigned_shards'

# Verify indexing is stalled (take two samples 30s apart)
curl -s 'http://localhost:9200/_nodes/stats/indices/indexing?filter_path=nodes.*.indices.indexing'

# Check write thread pool for saturation
curl -s 'http://localhost:9200/_cat/thread_pool/write,search?v&h=node_name,name,active,queue,rejected'

# Check disk usage per node
curl -s 'http://localhost:9200/_cat/allocation?v'

# Check circuit breaker trips and estimated sizes
curl -s 'http://localhost:9200/_nodes/stats/breaker?filter_path=nodes.*.breakers'

# Check for active read-only blocks
curl -s 'http://localhost:9200/_all/_settings?filter_path=*.settings.index.blocks.read_only_allow_delete'

# Check merge activity and segment counts
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,merges.current,segments.count,segments.memory'

# Check ingest pipeline processor timing
curl -s 'http://localhost:9200/_nodes/stats/ingest?filter_path=nodes.*.ingest'

# Check indexing pressure memory usage (ES 7.9+)
curl -s 'http://localhost:9200/_nodes/stats/indexing_pressure?filter_path=nodes.*.indexing_pressure'

How to diagnose it

  1. Confirm the stall. Take two samples of _nodes/stats/indices/indexing 30 seconds apart. If index_total delta is zero across all data nodes, the write path is fully stalled. If only some nodes show zero, note which ones; asymmetry points to hot-spotting or local primary failure.

  2. Check cluster health. If status is red, unassigned primaries are blocking writes. Run GET /_cluster/allocation/explain to identify the exact shard and reason. If health is green or yellow, the stall is not from missing primaries.

  3. Check disk watermarks. Run GET /_cat/allocation. If any data node shows disk usage above 95%, Elasticsearch has likely applied the flood-stage block. Check GET /_all/_settings for index.blocks.read_only_allow_delete. This is the most common cause of a sudden, cluster-wide indexing stop.

  4. Check write thread pool rejections. Run GET /_cat/thread_pool/write. Sustained rejected increases mean the pool is overwhelmed. Note the queue depth. If the queue is full and rejections are climbing, the cluster cannot keep up with ingest volume.

  5. Check circuit breakers. Run GET /_nodes/stats/breaker. If parent.tripped, request.tripped, or in_flight_requests.tripped are increasing, memory protection is rejecting bulk requests. Compare estimated_size_in_bytes to limit_size_in_bytes to see how thin the margin is.

  6. Check merge activity and segment counts. Run GET /_cat/nodes for merges.current and segments.count. If merges.current is persistently at the scheduler’s max thread count and segments.count is growing, a merge storm is consuming all disk I/O and throttling indexing.

  7. Check ingest pipeline stats. Run GET /_nodes/stats/ingest. Look for a specific processor with disproportionately high time_in_millis relative to its count. Grok, enrich, and script processors are the usual suspects. High pipeline time creates back-pressure even when the write thread pool is not saturated.

  8. Check indexing pressure (ES 7.9+). Run GET /_nodes/stats/indexing_pressure. If coordinating or primary stage memory usage is near limit_in_bytes, or if rejection counters are increasing, memory-based admission control is blocking writes before they reach the thread pool.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Indexing rate (index_total delta)Confirms writes are completingZero while ingestion sources are active
Write thread pool rejectedDirect signal of write path pushbackSustained delta >0 for more than 5 minutes
Write thread pool queue depthPrecursor to rejectionPersistently >50% of configured max
Disk used percent per nodeFlood stage blocks all writes on affected shardsAny node above 95%
Circuit breaker tripped countersMemory protection rejecting operationsAny delta >0 on parent or request breaker
Cluster health statusRed means primaries are unassignedstatus: red sustained for >2 minutes
Merge current and segment countMerge storms compete for I/Omerges.current at max concurrency with growing segment count
Ingest pipeline processor timeSynchronous processing before write ackOne processor consuming disproportionate time
Indexing pressure memory (7.9+)Earlier backpressure than thread poolsCurrent memory sustained >80% of limit

Fixes

Write thread pool saturation

Do not increase thread_pool.write.queue_size as a first response. A larger queue delays rejection and increases memory pressure without fixing throughput. Instead, reduce bulk batch sizes from clients to lower per-request memory and CPU spikes. Add data nodes to increase cluster-wide write capacity. If only some nodes show saturation, investigate hot-spotted shards. Rebalancing or routing adjustments may help, but adding capacity is the sustainable fix.

Merge storm eating I/O

Increase index.refresh_interval on heavy-write indices from the default 1s to 10s or 30s to reduce segment creation pressure. Verify that index.merge.scheduler.max_thread_count is appropriate for your storage: one thread for spinning disks, higher for SSDs. If segment counts are in the hundreds per shard, merges are falling behind. Force merge to one segment only on indices that are no longer being written to. Running force merge on a live index creates oversized segments and can stall writes further.

Slow ingest pipeline

Replace complex grok patterns with dissect processors where possible. Reduce enrichment lookup cardinality or move lookups to the client side. If ingest processing is CPU-bound, add dedicated ingest nodes or enable the ingest role on additional data nodes to distribute pipeline execution.

Primary shard unavailable

Use GET /_cluster/allocation/explain to identify the allocation block. If the reason is ALLOCATION_FAILED and the shard has exceeded index.allocation.max_retries, run POST /_cluster/reroute?retry_failed=true. If the block is a disk watermark, free space before expecting allocation to resume. Corrupt shards may require restoration from snapshot.

Circuit breaker trip

Reduce bulk request payload size to lower per-request memory overhead. Fix mappings that trigger fielddata loading by using keyword sub-fields for text aggregations. If the parent breaker trips repeatedly, identify the heap consumer: check segments.memory, fielddata cache size, and cluster state growth. Raising breaker limits without fixing the underlying memory pressure risks OOM.

Flood-stage disk block

Immediately delete old indices or reduce replica count to free disk space. After space is freed, remove the read-only block explicitly:

# WARNING: Destructive. This removes the block from EVERY index.
# Target specific indices in production instead of _all.
curl -X PUT 'http://localhost:9200/_all/_settings' -H 'Content-Type: application/json' -d '{"index.blocks.read_only_allow_delete": null}'

In Elasticsearch 7.x+ and 8.x, the block is automatically removed when disk usage on the affected node drops below the high watermark. If you cannot free enough space, the block persists until you delete data.

Prevention

Monitor per-node indexing rate asymmetry. Cluster-wide averages hide hot-spotted primaries. Alert when one node’s indexing rate deviates significantly from peers hosting a similar primary shard count.

Size bulk requests and refresh intervals for your storage. Large bulks spike memory and trigger breaker trips. Frequent refreshes create segment storms on high-throughput indices.

Set mapping guardrails. Use index.mapping.total_fields.limit and index.mapping.depth.limit to prevent mapping explosions that bloat cluster state and heap, indirectly pressuring the write path.

Plan disk capacity with merge overhead. Merges temporarily require disk space for both old and new segments. Maintain at least 30% free disk to absorb merge spikes and avoid watermark cascades.

Monitor ingest pipelines before they reach production. Track per-processor timing during load tests. A slow grok pattern will not show up in thread pool metrics until it has already stalled the write path.

How Netdata helps

Netdata correlates indexing rate with write thread pool rejections and queue depth to isolate saturation from upstream blocks. Disk I/O wait, merge activity, and segment count trends surface merge storms before indexing throughput drops to zero. JVM heap usage alongside circuit breaker estimated sizes anticipates memory rejections. Disk watermark proximity alerts per node catch flood-stage risk before the block is applied. Per-node indexing rate charts expose hot-spotted primaries immediately.