Elasticsearch ingest pipeline bottleneck: grok, enrich, and per-processor time

Logstash or Beats instances log HTTP 429 errors. Elasticsearch indexing rate drops while upstream volume is flat. The write thread pool queue grows, but cluster health is green and there is no disk pressure or heap pressure. The culprit is often a single slow processor inside an ingest pipeline. A grok pattern backtracking on malformed logs, an enrich processor doing synchronous lookups, or a heavy Painless script throttles every document before it reaches Lucene. Because ingest processing runs on the write path, the backlog overflows the write thread pool queue and returns bulk rejections to upstream clients.

What this means

Documents targeting a pipeline execute each processor serially on the bulk thread. If one processor averages even a few milliseconds per document, throughput collapses at volume. The write thread pool queue absorbs transient skew, but once it saturates, Elasticsearch returns EsRejectedExecutionException and HTTP 429. Upstream retries compound the load. The failure is localized to one processor, but the impact cascades cluster-wide for that indexing workload.

flowchart TD
    A[Ingest pipeline processor] -->|Slow per-document execution| B[Bulk thread blocked]
    B --> C[Write thread pool queue grows]
    C --> D[EsRejectedExecutionException]
    D --> E[HTTP 429 to upstream]

Common causes

CauseWhat it looks likeFirst thing to check
Grok backtrackingCPU spikes on ingest nodes; one pipeline dominates timingPer-processor time_in_millis and count in _nodes/stats/ingest
Slow enrich lookupEnrich processor shows high time; may pair with search latencyPipeline stats for the enrich processor and its enrich index health
Expensive script processorDisproportionate CPU on ingest nodes; script processor time highPer-processor breakdown for script execution
Oversized documentsBroad pipeline slowdown without a single hot processorDocument size and mapping complexity from upstream

Quick checks

Run these read-only requests to confirm the bottleneck is on the ingest and write path.

# Per-processor timing and failure counts. Look for one processor with high time_in_millis relative to count.
curl -s 'http://localhost:9200/_nodes/stats/ingest?filter_path=nodes.*.ingest'

# Write thread pool queue depth and cumulative rejections. Rejected is monotonic; sample twice and diff.
curl -s 'http://localhost:9200/_cat/thread_pool/write?v&h=node_name,active,queue,rejected'

# Indexing rate and total time. Derive latency by diffing index_time_in_millis against index_total over an interval.
curl -s 'http://localhost:9200/_nodes/stats/indices/indexing?filter_path=nodes.*.indices.indexing'

# Node CPU and roles. Ingest-role nodes running hot while data-only nodes are calm points to pipeline CPU burn.
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,cpu,node.role'

# Cluster health to exclude unassigned shards or master instability.
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards'

How to diagnose it

  1. Confirm the write path is backing up. Check the write thread pool queue and rejected counters. Sustained queue growth with nonzero rejection deltas means the node cannot drain bulk requests. Note that rejected is cumulative; calculate the delta between two samples.
  2. Pull pipeline stats with GET /_nodes/stats/ingest. Drill into nodes.{id}.ingest.pipelines.{name}.processors. Each element shows type, stats.count, and stats.time_in_millis.
  3. Compare per-processor time to count. Divide time_in_millis by count to get average milliseconds per document. The processor with the highest average is the bottleneck. For example, a processor averaging 5 ms per document adds 5 seconds to a 1000-document bulk request.
  4. Check node CPU asymmetry. If nodes with the ingest role show significantly higher cpu than data-only nodes, the pipeline is consuming CPU on the ingest tier.
  5. Correlate indexing latency with pipeline changes. If indexing time spiked immediately after a pipeline PUT, the new or updated processor is the trigger.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Per-processor time vs countIsolates the exact slow stepOne processor dominates total pipeline time, averaging multiple milliseconds per document while others average microseconds
Write thread pool queuePrecursor to rejection; measures backpressureSustained growth over several sampling intervals without draining
Write thread pool rejectionsConfirms Elasticsearch is pushing back upstreamNonzero delta over multiple intervals
Indexing rateValidates throughput collapseDrop >30% from baseline while upstream volume is flat
Node CPU by roleIdentifies ingest-tier saturationIngest nodes >80% sustained while data nodes are calm
Indexing latencyIncludes pipeline execution overheadSustained >2x baseline without disk or heap pressure

Fixes

Replace grok with dissect

If the log format is predictable, dissect splits on delimiters instead of evaluating regex. This removes backtracking risk and lowers per-document CPU. Updating a pipeline is non-destructive, but parsing failures cause documents to be rejected unless on_failure is configured. Validate with representative samples before deploying.

Reduce enrich lookup cost

The enrich processor executes a synchronous search against its enrich index for every document. If that system index is recovering or the lookup field is not optimized, each search blocks the bulk thread. Tradeoff: move enrichment into Logstash or an external stream processor if lookup volume exceeds what the cluster can serve without impacting indexing.

Simplify or remove inline scripts

Painless scripts run synchronously on the bulk thread. String concatenation, loops, or heavy date math directly throttle throughput. Tradeoff: pre-compute values upstream, or use an ingest processor that handles the same logic natively.

Scale ingest capacity

Add nodes with the ingest role to isolate pipeline execution from data and master responsibilities. Alternatively, add the ingest role to existing nodes. Warning: changing node.roles requires updating elasticsearch.yml and performing a rolling restart. This is disruptive; ensure the cluster can tolerate the restarts. More ingest nodes add cluster membership overhead; verify network and heap are sized accordingly.

Prevention

  • Benchmark every pipeline change against production-like document volume before deploying.
  • Monitor per-processor average time after each update; alert if any processor exceeds your throughput budget.
  • Prefer dissect over grok for fixed-format logs.
  • Keep inline scripts stateless and minimal. Offload complex transformations to the data producer or a stream processor.

How Netdata helps

  • Charts Elasticsearch write thread pool queue depth alongside indexing rate to surface backpressure before rejections spike.
  • Tracks per-node CPU utilization by role, making ingest-node saturation visible without manual _cat/nodes queries.
  • Surfaces JVM heap and GC pause metrics next to indexing latency, distinguishing pipeline slowdown from heap pressure.
  • Alerts on thread pool rejection deltas so you catch the cascade before upstream buffers overflow.

Related guides