Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
A bulk request can return HTTP 200 while rejecting individual documents inside it. indices.indexing.index_failed climbs, but pipelines that only check HTTP status miss the rejections and documents disappear. This guide covers three failure classes: mapper parsing and type conflicts, per-item bulk errors hidden in HTTP 200 responses, and version_conflict_engine_exception under concurrent updates. It also distinguishes these from node-level write rejections and circuit breaker trips, which produce different symptoms and require different fixes.
What this means
Document-level indexing failures are hard rejections: the document reached the primary shard, failed validation, and was rejected, incrementing indices.indexing.index_failed. This differs from a write thread pool rejection, where the node never accepted the request and the client gets HTTP 429. It also differs from a circuit breaker trip, which rejects the entire request to protect the node from OOM.
The bulk API is the most common source. It returns HTTP 200 when the request is processed, but sets errors: true with per-item status codes in the response body. Clients that only check HTTP status miss failures. index_failed aggregates hard rejections without error-type breakdown, so correlate it with bulk response items or logs for root cause. Because the metric is reported per data node, aggregate it across the cluster when calculating failure rates.
flowchart TD
A[index_failed spike or bulk errors] --> B{Cluster block?}
B -->|Yes| C[Check disk watermarks
and index.blocks.*]
B -->|No| D{Per-item error type}
D -->|mapper_parsing| E[Check mapping vs
incoming document]
D -->|version_conflict| F[Check concurrency
and retry logic]
D -->|HTTP 429| G[Check write thread pool
and circuit breakers]
C --> H[Free disk or
remove block]
E --> I[Fix schema
or pipeline]
F --> J[Reduce contention
or accept baseline]
G --> K[Backpressure
or scale out]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Mapper parsing exception or type conflict | index_failed spikes; bulk items show mapper_parsing_exception or illegal_argument_exception; often follows a mapping change or schema drift | GET /<index>/_mapping against the rejected document’s fields |
| Version conflict under concurrent updates | Bulk items show version_conflict_engine_exception; rate correlates with concurrent updates or rapid retries on the same _id | Application concurrency model and whether updates target the same documents |
| Cluster block (disk flood stage or read-only) | Bulk items show cluster_block_exception; entire indices reject writes while read paths remain functional | GET /_cluster/health and GET /<index>/_settings?filter_path=*.index.blocks.* |
Quick checks
Run these safe, read-only commands to classify the failure.
# Check document-level hard rejections
curl -s 'http://localhost:9200/_nodes/stats/indices?filter_path=nodes.*.indices.indexing.index_failed'
# Compare failed to total indexing operations to get a failure ratio
curl -s 'http://localhost:9200/_nodes/stats/indices?filter_path=nodes.*.indices.indexing.index_total,nodes.*.indices.indexing.index_failed'
# Check cluster health and unassigned shards
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards'
# Check for index-level read-only blocks
curl -s 'http://localhost:9200/<INDEX>/_settings?filter_path=*.index.blocks.*'
# Check write thread pool rejections (node-level backpressure)
curl -s 'http://localhost:9200/_cat/thread_pool/write?v&h=node_name,active,queue,rejected'
# Check disk allocation for flood-stage blocks
curl -s 'http://localhost:9200/_cat/allocation?v&h=node_name,disk.percent,disk.used,disk.total'
# Check pending cluster tasks for master pressure
curl -s 'http://localhost:9200/_cluster/pending_tasks?pretty'
How to diagnose it
- Distinguish document-level failures from node-level rejections. If
index_failedis climbing whilethread_pool.write.rejectedis flat, the problem is document validation or conflicts, not node saturation. Check both metrics via_nodes/statsand confirm the HTTP response code: 429 indicates node-level backpressure, while HTTP 200 with bulkerrors: trueindicates document-level issues. - Inspect bulk response bodies on the client side. Look for
errors: trueand iterate theitemsarray. Failed items contain anerrorobject withtypeandreason. Log the_id,_index, anderror.reasonof failed items. Do not rely on HTTP 200 alone. - If the error type is
mapper_parsing_exception, compare the rejected document against the index mapping. Check for type mismatches (string sent to an integer field), unknown fields underdynamic: strict, or date format mismatches. If dynamic mapping is enabled, verify it did not infer an incompatible type for a new field. - If the error type is
version_conflict_engine_exception, measure the rate relative to your indexing volume. A low baseline rate is normal under concurrent updates. A sustained spike suggests excessive contention on the same document IDs. - If the error type is
cluster_block_exception, check disk watermarks with_cat/allocation. If a node exceeds the flood stage (95% by default), indices with shards on that node are read-only. Also check for explicitindex.blocks.writeorindex.blocks.read_onlysettings. If no node is above flood stage but the block persists, investigate whether a maintenance script or security tool applied it explicitly. - Check for mapping explosions or recent mapping changes. A sudden increase in
index_failedafter a deployment often means a schema change introduced a type conflict. Review the mapping for runaway dynamic field creation, especially fields mapped astextwithkeywordsubfields that were intended to be pure keywords, or numeric fields that received string values.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
indices.indexing.index_failed | Hard document-level rejections | Sudden spike from near-zero, or sustained rate >0.1% of indexing rate |
Bulk response errors flag | HTTP 200 hides per-item failures | errors: true in any bulk response |
Write thread pool rejected | Node-level backpressure, not doc-level | Sustained delta >0 over 5 minutes |
index.blocks.read_only_allow_delete | Flood stage blocks all writes | Block present on actively written indices |
version_conflict_engine_exception rate | OCC collision indicator | Rate exceeding historical baseline under normal load |
| Pending cluster tasks | Master instability can block writes and allocation | >100 pending tasks or tasks >30 seconds old |
Fixes
Mapper parsing and type conflicts
Stop the pipeline from sending bad documents. Update the index mapping explicitly for legitimate new fields, or fix the producer to send the correct type. Under strict mapping, unknown fields require either an explicit mapping addition or removal from the document. If an existing mapped field has the wrong type, reindex into a new index with an explicit mapping. You cannot change the type of an existing mapped field.
Version conflicts
Accept a baseline rate of version conflicts under concurrency. Do not chase zero. If the rate is pathological, reduce concurrency on hot document IDs. Architectural fixes, such as single-writer patterns per partition or idempotent writes with version_type=external, outperform client-side retries. If you use the Update API, configure its built-in retry_on_conflict parameter.
Cluster blocks
If disk flood stage triggered the block, free disk space first. WARNING: Deleting indices is destructive and cannot be undone. Reducing replica counts impairs fault tolerance and may trigger relocations that temporarily increase disk usage.
Delete old indices, reduce replica counts temporarily, or expand storage. The read_only_allow_delete block is automatically removed when disk drops below the flood-stage watermark, but if it persists you can clear it manually with PUT /<index>/_settings to remove index.blocks.read_only_allow_delete. If the block was set explicitly during maintenance or a security incident, remove it only after understanding why it was applied. Removing the block without fixing the underlying disk pressure causes immediate re-application.
Prevention
- Validate documents against the expected mapping before sending them to Elasticsearch.
- Monitor
index_failedas a ratio ofindex_total, not just an absolute count. Alert when the failure rate exceeds 0.1% of successful indexing. - Monitor disk watermarks and ILM execution to prevent flood-stage blocks.
- Client applications must inspect the bulk response
itemsarray, not just the HTTP status code. - Track version conflict rates as a normal operating metric. Set thresholds based on your concurrency model, not an arbitrary zero target.
- Cap field count with
index.mapping.total_fields.limitto prevent mapping explosions from dynamic mapping. - Audit field cardinality regularly if you rely on dynamic mapping; unexpected high-cardinality fields used as object keys can explode the mapping.
How Netdata helps
- Correlate
index_failedspikes with disk watermark breaches, JVM heap pressure, and write thread pool rejections on the same timeline. - Alert when the indexing failure rate deviates from baseline, distinguishing document-level errors from node-level saturation.
- Surface cluster health transitions and per-node allocation pressure so you catch flood-stage blocks before they stop writes.
- Historical metrics for
indexing.index_totalandindexing.index_time_in_millishelp determine whether a failure burst correlates with a traffic surge or a schema change.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch FORBIDDEN/12/index read-only / allow delete (api) — flood stage recovery
- Elasticsearch heap pressure death spiral: GC, node removal, and the cascade
- Elasticsearch high disk watermark [90%] exceeded: shard relocation and the cascade
- Elasticsearch JVM heap usage high: reading the sawtooth and the post-GC floor







