Elasticsearch ingest pipeline bottleneck: grok, enrich, and per-processor time
Logstash or Beats instances log HTTP 429 errors. Elasticsearch indexing rate drops while upstream volume is flat. The write thread pool queue grows, but cluster health is green and there is no disk pressure or heap pressure. The culprit is often a single slow processor inside an ingest pipeline. A grok pattern backtracking on malformed logs, an enrich processor doing synchronous lookups, or a heavy Painless script throttles every document before it reaches Lucene. Because ingest processing runs on the write path, the backlog overflows the write thread pool queue and returns bulk rejections to upstream clients.
What this means
Documents targeting a pipeline execute each processor serially on the bulk thread. If one processor averages even a few milliseconds per document, throughput collapses at volume. The write thread pool queue absorbs transient skew, but once it saturates, Elasticsearch returns EsRejectedExecutionException and HTTP 429. Upstream retries compound the load. The failure is localized to one processor, but the impact cascades cluster-wide for that indexing workload.
flowchart TD
A[Ingest pipeline processor] -->|Slow per-document execution| B[Bulk thread blocked]
B --> C[Write thread pool queue grows]
C --> D[EsRejectedExecutionException]
D --> E[HTTP 429 to upstream]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Grok backtracking | CPU spikes on ingest nodes; one pipeline dominates timing | Per-processor time_in_millis and count in _nodes/stats/ingest |
| Slow enrich lookup | Enrich processor shows high time; may pair with search latency | Pipeline stats for the enrich processor and its enrich index health |
| Expensive script processor | Disproportionate CPU on ingest nodes; script processor time high | Per-processor breakdown for script execution |
| Oversized documents | Broad pipeline slowdown without a single hot processor | Document size and mapping complexity from upstream |
Quick checks
Run these read-only requests to confirm the bottleneck is on the ingest and write path.
# Per-processor timing and failure counts. Look for one processor with high time_in_millis relative to count.
curl -s 'http://localhost:9200/_nodes/stats/ingest?filter_path=nodes.*.ingest'
# Write thread pool queue depth and cumulative rejections. Rejected is monotonic; sample twice and diff.
curl -s 'http://localhost:9200/_cat/thread_pool/write?v&h=node_name,active,queue,rejected'
# Indexing rate and total time. Derive latency by diffing index_time_in_millis against index_total over an interval.
curl -s 'http://localhost:9200/_nodes/stats/indices/indexing?filter_path=nodes.*.indices.indexing'
# Node CPU and roles. Ingest-role nodes running hot while data-only nodes are calm points to pipeline CPU burn.
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,cpu,node.role'
# Cluster health to exclude unassigned shards or master instability.
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards'
How to diagnose it
- Confirm the write path is backing up. Check the
writethread poolqueueandrejectedcounters. Sustained queue growth with nonzero rejection deltas means the node cannot drain bulk requests. Note thatrejectedis cumulative; calculate the delta between two samples. - Pull pipeline stats with
GET /_nodes/stats/ingest. Drill intonodes.{id}.ingest.pipelines.{name}.processors. Each element showstype,stats.count, andstats.time_in_millis. - Compare per-processor time to count. Divide
time_in_millisbycountto get average milliseconds per document. The processor with the highest average is the bottleneck. For example, a processor averaging 5 ms per document adds 5 seconds to a 1000-document bulk request. - Check node CPU asymmetry. If nodes with the
ingestrole show significantly highercputhan data-only nodes, the pipeline is consuming CPU on the ingest tier. - Correlate indexing latency with pipeline changes. If indexing time spiked immediately after a pipeline
PUT, the new or updated processor is the trigger.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Per-processor time vs count | Isolates the exact slow step | One processor dominates total pipeline time, averaging multiple milliseconds per document while others average microseconds |
| Write thread pool queue | Precursor to rejection; measures backpressure | Sustained growth over several sampling intervals without draining |
| Write thread pool rejections | Confirms Elasticsearch is pushing back upstream | Nonzero delta over multiple intervals |
| Indexing rate | Validates throughput collapse | Drop >30% from baseline while upstream volume is flat |
| Node CPU by role | Identifies ingest-tier saturation | Ingest nodes >80% sustained while data nodes are calm |
| Indexing latency | Includes pipeline execution overhead | Sustained >2x baseline without disk or heap pressure |
Fixes
Replace grok with dissect
If the log format is predictable, dissect splits on delimiters instead of evaluating regex. This removes backtracking risk and lowers per-document CPU. Updating a pipeline is non-destructive, but parsing failures cause documents to be rejected unless on_failure is configured. Validate with representative samples before deploying.
Reduce enrich lookup cost
The enrich processor executes a synchronous search against its enrich index for every document. If that system index is recovering or the lookup field is not optimized, each search blocks the bulk thread. Tradeoff: move enrichment into Logstash or an external stream processor if lookup volume exceeds what the cluster can serve without impacting indexing.
Simplify or remove inline scripts
Painless scripts run synchronously on the bulk thread. String concatenation, loops, or heavy date math directly throttle throughput. Tradeoff: pre-compute values upstream, or use an ingest processor that handles the same logic natively.
Scale ingest capacity
Add nodes with the ingest role to isolate pipeline execution from data and master responsibilities. Alternatively, add the ingest role to existing nodes. Warning: changing node.roles requires updating elasticsearch.yml and performing a rolling restart. This is disruptive; ensure the cluster can tolerate the restarts. More ingest nodes add cluster membership overhead; verify network and heap are sized accordingly.
Prevention
- Benchmark every pipeline change against production-like document volume before deploying.
- Monitor per-processor average time after each update; alert if any processor exceeds your throughput budget.
- Prefer
dissectovergrokfor fixed-format logs. - Keep inline scripts stateless and minimal. Offload complex transformations to the data producer or a stream processor.
How Netdata helps
- Charts Elasticsearch write thread pool queue depth alongside indexing rate to surface backpressure before rejections spike.
- Tracks per-node CPU utilization by role, making ingest-node saturation visible without manual
_cat/nodesqueries. - Surfaces JVM heap and GC pause metrics next to indexing latency, distinguishing pipeline slowdown from heap pressure.
- Alerts on thread pool rejection deltas so you catch the cascade before upstream buffers overflow.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch cluster state too large: field count, index count, and per-node heap
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
- Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch FORBIDDEN/12/index read-only / allow delete (api) — flood stage recovery







