Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429

HTTP 429 responses from Elasticsearch, or stack traces containing EsRejectedExecutionException, mean the write thread pool queue is full. The write thread pool (named bulk before Elasticsearch 6.3) executes indexing, bulk, update, and delete operations on each data node using a bounded queue. When all threads are busy and the queue fills, the node rejects the operation. This guide covers how to confirm the diagnosis, distinguish it from other rejection paths, and fix the root cause instead of masking the symptom.

What this means

Elasticsearch routes every write to the primary shard. On the target data node, the write thread pool executes the operation. This pool has a fixed number of threads, usually sized to the node’s allocated processors, and a bounded queue. By default, the write queue holds up to 10000 operations in Elasticsearch 7.x and later. When all threads are busy and the queue is full, the node returns EsRejectedExecutionException to the client with HTTP 429.

The rejected counter from /_cat/thread_pool and /_nodes/stats/thread_pool is cumulative and resets only on node restart. Track the delta over time, not the absolute value. A brief burst of rejections during a traffic spike may self-resolve if the client backs off; sustained rejections mean the cluster cannot keep up with ingest demand.

HTTP 429 can also come from circuit breakers or indexing pressure backpressure. Thread pool rejections reference the queue in the error body; circuit breaker trips reference memory limits. Check the response details to avoid treating a memory problem as a thread capacity problem.

flowchart LR
    A[Client bulk/index request] --> B[Coordinating node]
    B --> C[Primary shard node]
    C --> D[Write thread pool queue]
    D -->|Capacity available| E[Worker thread executes write]
    D -->|Queue full| F[EsRejectedExecutionException]
    F --> G[Return HTTP 429 to client]

Common causes

CauseWhat it looks likeFirst thing to check
Hot-spotting or undersized clusterRejections concentrated on one or two nodes while others are idle/_cat/thread_pool/write sorted by rejected; /_cat/nodes for asymmetric CPU or disk
Merge storm or GC pressure reducing throughputRejections rising alongside old GC pauses, high segment counts, or elevated merge concurrency/_nodes/stats/indices/segments for segment count; /_nodes/stats/jvm for GC
Sudden burst traffic exceeding capacitySpike in indexing rate with queue depth jumping across all nodes/_nodes/stats/indices/indexing rate delta; application traffic metrics
Slow storage or I/O saturationHigh indexing latency, elevated I/O wait, queues growing despite moderate CPU/_nodes/stats/indices/indexing latency; OS iostat
Oversized bulk requests or too many concurrent streamsRejections at low overall indexing rate with very large bulk payloadsClient bulk batch size and concurrency settings

Quick checks

# Check write thread pool queue depth and cumulative rejections per node
curl -s 'http://localhost:9200/_cat/thread_pool/write?v&h=node_name,active,queue,rejected'
# Check indexing rate and cumulative time per node
curl -s 'http://localhost:9200/_nodes/stats/indices/indexing?filter_path=nodes.*.indices.indexing'
# Check heap, CPU, and load for capacity signals
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,cpu,load_1m'
# Check current merges for I/O pressure
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,merges.current'
# Check segment counts per node
curl -s 'http://localhost:9200/_nodes/stats/indices/segments?filter_path=nodes.*.indices.segments.count'
# Check indexing pressure to distinguish memory backpressure from thread pool saturation (ES 7.9+)
curl -s 'http://localhost:9200/_nodes/stats/indexing_pressure?filter_path=nodes.*.indexing_pressure'

How to diagnose it

  1. Confirm write pool rejection. Run /_cat/thread_pool/write and sort by rejected. If the delta is rising on one or more nodes, you have confirmed thread pool saturation. If search or get pools show rejections instead, the issue is read-side.
  2. Determine scope. Are rejections cluster-wide or isolated? Cluster-wide points to total ingest exceeding cluster capacity. Isolated points to hot-spotted shards or a single slow node.
  3. Correlate with indexing rate. Compute delta(index_total) / interval. If the rate jumped before rejections began, the cluster is undersized for the peak. If the rate is flat, the cluster has lost capacity elsewhere.
  4. Check for GC and heap pressure. Run /_nodes/stats/jvm and compare old GC time. Long stop-the-world pauses freeze threads and effectively reduce write throughput without reducing queue size.
  5. Inspect merges and segments. High segment count with merges.current at the scheduler limit means background merges are falling behind or consuming all I/O, leaving fewer cycles for writes.
  6. Check indexing pressure (ES 7.9+). If indexing_pressure.memory.total.coordinating_rejections or primary_rejections are rising while thread pool rejections are flat, the bottleneck is in-flight indexing memory, not thread availability.
  7. Review client behavior. Large bulk batches or high client concurrency can overwhelm even a healthy cluster. A common pattern is Logstash or a custom client sending batches that are too large or not backing off on 429.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Write thread pool rejected deltaDirect measure of capacity overload> 0 / min sustained for > 5 minutes
Write thread pool queue depthLeading indicator before rejections occurPersistently > 1000 or growing toward the queue limit
Indexing rate vs baselineTells you if traffic grew or capacity shrankDrop > 30% from 1-hour baseline or spike > 2x
Old GC collection timeGC pauses steal threads and reduce throughputIncreasing trend or > 5 s
Segments per node / merge concurrencyMerge backlog competes for I/O and CPUSegment count > 100 per shard or merges.current at scheduler limit
Disk I/O waitStorage saturation slows every writeSustained > 20-30%
Indexing pressure rejections (7.9+)Distinguishes memory backpressure from thread exhaustionAny delta > 0 on coordinating, primary, or replica stages

Fixes

Burst traffic and undersized cluster. The durable fix is to add data nodes or reduce ingest volume. Temporarily, reduce client batch sizes and concurrency to smooth the load. Do not increase thread_pool.write.queue_size; this only delays rejection and increases heap pressure from queued tasks.

Hot-spotted shards. Use /_cat/shards to identify indices with primaries concentrated on the busiest nodes. If time-series indices are skewed to recent nodes, verify ILM rollover and allocation filtering. For immediate relief, use /_cluster/reroute to move specific hot shards, though this generates recovery I/O.

Storage bottleneck. If merges.current is pegged and I/O wait is high, increase index.refresh_interval on heavy-write indices to lower segment creation rate. For spinning disks, set index.merge.scheduler.max_thread_count to 1. If storage is consistently saturated, move to SSD or add nodes.

GC and heap pressure. If the node is near heap limits and old GC is frequent, reduce shard count (close or delete old indices), eliminate fielddata usage on text fields, or add nodes. Do not increase heap beyond the compressed OOPs threshold (roughly 30.5 GB). Run more nodes instead.

Indexing pressure saturation (ES 7.9+). If rejections come from indexing_pressure rather than the thread pool, reduce bulk request payload size and lower concurrent bulk streams. The indexing memory limit defaults to 10% of heap and should not be raised without careful consideration.

Prevention

Implement client-side retry with exponential backoff on HTTP 429. Elasticsearch clients support this natively; ensure it is enabled. Without backoff, rejected data is lost and immediate retries amplify the overload.

Size clusters for peak ingest plus headroom, not average load. Thread pool queues absorb seconds of burst, not minutes.

Monitor write queue depth as a leading indicator. A queue that oscillates near its limit during normal business hours means the cluster has no burst headroom.

Keep segment counts under control with ILM force-merge policies for cold time-series indices, and avoid overly aggressive refresh_interval on high-volume indices.

How Netdata helps

  • Correlates write thread pool rejections with per-node CPU, disk I/O, and JVM heap on the same charts to distinguish capacity shortages from storage bottlenecks.
  • Alerts on write queue depth and rejection rate deltas before clients see HTTP 429.
  • Compares thread pool saturation across data nodes to expose hot-spotting hidden by cluster-wide averages.
  • Tracks indexing latency and merge activity alongside rejection metrics to identify whether the root cause is traffic growth or background I/O pressure.