Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts

A bulk request can return HTTP 200 while rejecting individual documents inside it. indices.indexing.index_failed climbs, but pipelines that only check HTTP status miss the rejections and documents disappear. This guide covers three failure classes: mapper parsing and type conflicts, per-item bulk errors hidden in HTTP 200 responses, and version_conflict_engine_exception under concurrent updates. It also distinguishes these from node-level write rejections and circuit breaker trips, which produce different symptoms and require different fixes.

What this means

Document-level indexing failures are hard rejections: the document reached the primary shard, failed validation, and was rejected, incrementing indices.indexing.index_failed. This differs from a write thread pool rejection, where the node never accepted the request and the client gets HTTP 429. It also differs from a circuit breaker trip, which rejects the entire request to protect the node from OOM.

The bulk API is the most common source. It returns HTTP 200 when the request is processed, but sets errors: true with per-item status codes in the response body. Clients that only check HTTP status miss failures. index_failed aggregates hard rejections without error-type breakdown, so correlate it with bulk response items or logs for root cause. Because the metric is reported per data node, aggregate it across the cluster when calculating failure rates.

flowchart TD
    A[index_failed spike or bulk errors] --> B{Cluster block?}
    B -->|Yes| C[Check disk watermarks
and index.blocks.*] B -->|No| D{Per-item error type} D -->|mapper_parsing| E[Check mapping vs
incoming document] D -->|version_conflict| F[Check concurrency
and retry logic] D -->|HTTP 429| G[Check write thread pool
and circuit breakers] C --> H[Free disk or
remove block] E --> I[Fix schema
or pipeline] F --> J[Reduce contention
or accept baseline] G --> K[Backpressure
or scale out]

Common causes

CauseWhat it looks likeFirst thing to check
Mapper parsing exception or type conflictindex_failed spikes; bulk items show mapper_parsing_exception or illegal_argument_exception; often follows a mapping change or schema driftGET /<index>/_mapping against the rejected document’s fields
Version conflict under concurrent updatesBulk items show version_conflict_engine_exception; rate correlates with concurrent updates or rapid retries on the same _idApplication concurrency model and whether updates target the same documents
Cluster block (disk flood stage or read-only)Bulk items show cluster_block_exception; entire indices reject writes while read paths remain functionalGET /_cluster/health and GET /<index>/_settings?filter_path=*.index.blocks.*

Quick checks

Run these safe, read-only commands to classify the failure.

# Check document-level hard rejections
curl -s 'http://localhost:9200/_nodes/stats/indices?filter_path=nodes.*.indices.indexing.index_failed'

# Compare failed to total indexing operations to get a failure ratio
curl -s 'http://localhost:9200/_nodes/stats/indices?filter_path=nodes.*.indices.indexing.index_total,nodes.*.indices.indexing.index_failed'

# Check cluster health and unassigned shards
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards'

# Check for index-level read-only blocks
curl -s 'http://localhost:9200/<INDEX>/_settings?filter_path=*.index.blocks.*'

# Check write thread pool rejections (node-level backpressure)
curl -s 'http://localhost:9200/_cat/thread_pool/write?v&h=node_name,active,queue,rejected'

# Check disk allocation for flood-stage blocks
curl -s 'http://localhost:9200/_cat/allocation?v&h=node_name,disk.percent,disk.used,disk.total'

# Check pending cluster tasks for master pressure
curl -s 'http://localhost:9200/_cluster/pending_tasks?pretty'

How to diagnose it

  1. Distinguish document-level failures from node-level rejections. If index_failed is climbing while thread_pool.write.rejected is flat, the problem is document validation or conflicts, not node saturation. Check both metrics via _nodes/stats and confirm the HTTP response code: 429 indicates node-level backpressure, while HTTP 200 with bulk errors: true indicates document-level issues.
  2. Inspect bulk response bodies on the client side. Look for errors: true and iterate the items array. Failed items contain an error object with type and reason. Log the _id, _index, and error.reason of failed items. Do not rely on HTTP 200 alone.
  3. If the error type is mapper_parsing_exception, compare the rejected document against the index mapping. Check for type mismatches (string sent to an integer field), unknown fields under dynamic: strict, or date format mismatches. If dynamic mapping is enabled, verify it did not infer an incompatible type for a new field.
  4. If the error type is version_conflict_engine_exception, measure the rate relative to your indexing volume. A low baseline rate is normal under concurrent updates. A sustained spike suggests excessive contention on the same document IDs.
  5. If the error type is cluster_block_exception, check disk watermarks with _cat/allocation. If a node exceeds the flood stage (95% by default), indices with shards on that node are read-only. Also check for explicit index.blocks.write or index.blocks.read_only settings. If no node is above flood stage but the block persists, investigate whether a maintenance script or security tool applied it explicitly.
  6. Check for mapping explosions or recent mapping changes. A sudden increase in index_failed after a deployment often means a schema change introduced a type conflict. Review the mapping for runaway dynamic field creation, especially fields mapped as text with keyword subfields that were intended to be pure keywords, or numeric fields that received string values.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
indices.indexing.index_failedHard document-level rejectionsSudden spike from near-zero, or sustained rate >0.1% of indexing rate
Bulk response errors flagHTTP 200 hides per-item failureserrors: true in any bulk response
Write thread pool rejectedNode-level backpressure, not doc-levelSustained delta >0 over 5 minutes
index.blocks.read_only_allow_deleteFlood stage blocks all writesBlock present on actively written indices
version_conflict_engine_exception rateOCC collision indicatorRate exceeding historical baseline under normal load
Pending cluster tasksMaster instability can block writes and allocation>100 pending tasks or tasks >30 seconds old

Fixes

Mapper parsing and type conflicts

Stop the pipeline from sending bad documents. Update the index mapping explicitly for legitimate new fields, or fix the producer to send the correct type. Under strict mapping, unknown fields require either an explicit mapping addition or removal from the document. If an existing mapped field has the wrong type, reindex into a new index with an explicit mapping. You cannot change the type of an existing mapped field.

Version conflicts

Accept a baseline rate of version conflicts under concurrency. Do not chase zero. If the rate is pathological, reduce concurrency on hot document IDs. Architectural fixes, such as single-writer patterns per partition or idempotent writes with version_type=external, outperform client-side retries. If you use the Update API, configure its built-in retry_on_conflict parameter.

Cluster blocks

If disk flood stage triggered the block, free disk space first. WARNING: Deleting indices is destructive and cannot be undone. Reducing replica counts impairs fault tolerance and may trigger relocations that temporarily increase disk usage.

Delete old indices, reduce replica counts temporarily, or expand storage. The read_only_allow_delete block is automatically removed when disk drops below the flood-stage watermark, but if it persists you can clear it manually with PUT /<index>/_settings to remove index.blocks.read_only_allow_delete. If the block was set explicitly during maintenance or a security incident, remove it only after understanding why it was applied. Removing the block without fixing the underlying disk pressure causes immediate re-application.

Prevention

  • Validate documents against the expected mapping before sending them to Elasticsearch.
  • Monitor index_failed as a ratio of index_total, not just an absolute count. Alert when the failure rate exceeds 0.1% of successful indexing.
  • Monitor disk watermarks and ILM execution to prevent flood-stage blocks.
  • Client applications must inspect the bulk response items array, not just the HTTP status code.
  • Track version conflict rates as a normal operating metric. Set thresholds based on your concurrency model, not an arbitrary zero target.
  • Cap field count with index.mapping.total_fields.limit to prevent mapping explosions from dynamic mapping.
  • Audit field cardinality regularly if you rely on dynamic mapping; unexpected high-cardinality fields used as object keys can explode the mapping.

How Netdata helps

  • Correlate index_failed spikes with disk watermark breaches, JVM heap pressure, and write thread pool rejections on the same timeline.
  • Alert when the indexing failure rate deviates from baseline, distinguishing document-level errors from node-level saturation.
  • Surface cluster health transitions and per-node allocation pressure so you catch flood-stage blocks before they stop writes.
  • Historical metrics for indexing.index_total and indexing.index_time_in_millis help determine whether a failure burst correlates with a traffic surge or a schema change.