$ guides / elasticsearch / elasticsearch-esrejectedexecutionexception-write-queue ▌

Operations Guides

Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429

HTTP 429 responses from Elasticsearch, or stack traces containing EsRejectedExecutionException, mean the write thread pool queue is full. The write thread pool (named bulk before Elasticsearch 6.3) executes indexing, bulk, update, and delete operations on each data node using a bounded queue. When all threads are busy and the queue fills, the node rejects the operation. This guide covers how to confirm the diagnosis, distinguish it from other rejection paths, and fix the root cause instead of masking the symptom.

What this means

Elasticsearch routes every write to the primary shard. On the target data node, the write thread pool executes the operation. This pool has a fixed number of threads, usually sized to the node’s allocated processors, and a bounded queue. By default, the write queue holds up to 10000 operations in Elasticsearch 7.x and later. When all threads are busy and the queue is full, the node returns EsRejectedExecutionException to the client with HTTP 429.

The rejected counter from /_cat/thread_pool and /_nodes/stats/thread_pool is cumulative and resets only on node restart. Track the delta over time, not the absolute value. A brief burst of rejections during a traffic spike may self-resolve if the client backs off; sustained rejections mean the cluster cannot keep up with ingest demand.

HTTP 429 can also come from circuit breakers or indexing pressure backpressure. Thread pool rejections reference the queue in the error body; circuit breaker trips reference memory limits. Check the response details to avoid treating a memory problem as a thread capacity problem.

flowchart LR
    A[Client bulk/index request] --> B[Coordinating node]
    B --> C[Primary shard node]
    C --> D[Write thread pool queue]
    D -->|Capacity available| E[Worker thread executes write]
    D -->|Queue full| F[EsRejectedExecutionException]
    F --> G[Return HTTP 429 to client]

Common causes

Cause	What it looks like	First thing to check
Hot-spotting or undersized cluster	Rejections concentrated on one or two nodes while others are idle	`/_cat/thread_pool/write` sorted by `rejected`; `/_cat/nodes` for asymmetric CPU or disk
Merge storm or GC pressure reducing throughput	Rejections rising alongside old GC pauses, high segment counts, or elevated merge concurrency	`/_nodes/stats/indices/segments` for segment count; `/_nodes/stats/jvm` for GC
Sudden burst traffic exceeding capacity	Spike in indexing rate with queue depth jumping across all nodes	`/_nodes/stats/indices/indexing` rate delta; application traffic metrics
Slow storage or I/O saturation	High indexing latency, elevated I/O wait, queues growing despite moderate CPU	`/_nodes/stats/indices/indexing` latency; OS `iostat`
Oversized bulk requests or too many concurrent streams	Rejections at low overall indexing rate with very large bulk payloads	Client bulk batch size and concurrency settings

Quick checks

# Check write thread pool queue depth and cumulative rejections per node
curl -s 'http://localhost:9200/_cat/thread_pool/write?v&h=node_name,active,queue,rejected'

# Check indexing rate and cumulative time per node
curl -s 'http://localhost:9200/_nodes/stats/indices/indexing?filter_path=nodes.*.indices.indexing'

# Check heap, CPU, and load for capacity signals
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,cpu,load_1m'

# Check current merges for I/O pressure
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,merges.current'

# Check segment counts per node
curl -s 'http://localhost:9200/_nodes/stats/indices/segments?filter_path=nodes.*.indices.segments.count'

# Check indexing pressure to distinguish memory backpressure from thread pool saturation (ES 7.9+)
curl -s 'http://localhost:9200/_nodes/stats/indexing_pressure?filter_path=nodes.*.indexing_pressure'

How to diagnose it

Confirm write pool rejection. Run /_cat/thread_pool/write and sort by rejected. If the delta is rising on one or more nodes, you have confirmed thread pool saturation. If search or get pools show rejections instead, the issue is read-side.
Determine scope. Are rejections cluster-wide or isolated? Cluster-wide points to total ingest exceeding cluster capacity. Isolated points to hot-spotted shards or a single slow node.
Correlate with indexing rate. Compute delta(index_total) / interval. If the rate jumped before rejections began, the cluster is undersized for the peak. If the rate is flat, the cluster has lost capacity elsewhere.
Check for GC and heap pressure. Run /_nodes/stats/jvm and compare old GC time. Long stop-the-world pauses freeze threads and effectively reduce write throughput without reducing queue size.
Inspect merges and segments. High segment count with merges.current at the scheduler limit means background merges are falling behind or consuming all I/O, leaving fewer cycles for writes.
Check indexing pressure (ES 7.9+). If indexing_pressure.memory.total.coordinating_rejections or primary_rejections are rising while thread pool rejections are flat, the bottleneck is in-flight indexing memory, not thread availability.
Review client behavior. Large bulk batches or high client concurrency can overwhelm even a healthy cluster. A common pattern is Logstash or a custom client sending batches that are too large or not backing off on 429.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Write thread pool rejected delta	Direct measure of capacity overload	> 0 / min sustained for > 5 minutes
Write thread pool queue depth	Leading indicator before rejections occur	Persistently > 1000 or growing toward the queue limit
Indexing rate vs baseline	Tells you if traffic grew or capacity shrank	Drop > 30% from 1-hour baseline or spike > 2x
Old GC collection time	GC pauses steal threads and reduce throughput	Increasing trend or > 5 s
Segments per node / merge concurrency	Merge backlog competes for I/O and CPU	Segment count > 100 per shard or merges.current at scheduler limit
Disk I/O wait	Storage saturation slows every write	Sustained > 20-30%
Indexing pressure rejections (7.9+)	Distinguishes memory backpressure from thread exhaustion	Any delta > 0 on coordinating, primary, or replica stages

Fixes

Burst traffic and undersized cluster. The durable fix is to add data nodes or reduce ingest volume. Temporarily, reduce client batch sizes and concurrency to smooth the load. Do not increase thread_pool.write.queue_size; this only delays rejection and increases heap pressure from queued tasks.

Hot-spotted shards. Use /_cat/shards to identify indices with primaries concentrated on the busiest nodes. If time-series indices are skewed to recent nodes, verify ILM rollover and allocation filtering. For immediate relief, use /_cluster/reroute to move specific hot shards, though this generates recovery I/O.

Storage bottleneck. If merges.current is pegged and I/O wait is high, increase index.refresh_interval on heavy-write indices to lower segment creation rate. For spinning disks, set index.merge.scheduler.max_thread_count to 1. If storage is consistently saturated, move to SSD or add nodes.

GC and heap pressure. If the node is near heap limits and old GC is frequent, reduce shard count (close or delete old indices), eliminate fielddata usage on text fields, or add nodes. Do not increase heap beyond the compressed OOPs threshold (roughly 30.5 GB). Run more nodes instead.

Indexing pressure saturation (ES 7.9+). If rejections come from indexing_pressure rather than the thread pool, reduce bulk request payload size and lower concurrent bulk streams. The indexing memory limit defaults to 10% of heap and should not be raised without careful consideration.

Prevention

Implement client-side retry with exponential backoff on HTTP 429. Elasticsearch clients support this natively; ensure it is enabled. Without backoff, rejected data is lost and immediate retries amplify the overload.

Size clusters for peak ingest plus headroom, not average load. Thread pool queues absorb seconds of burst, not minutes.

Monitor write queue depth as a leading indicator. A queue that oscillates near its limit during normal business hours means the cluster has no burst headroom.

Keep segment counts under control with ILM force-merge policies for cold time-series indices, and avoid overly aggressive refresh_interval on high-volume indices.

How Netdata helps

Correlates write thread pool rejections with per-node CPU, disk I/O, and JVM heap on the same charts to distinguish capacity shortages from storage bottlenecks.
Alerts on write queue depth and rejection rate deltas before clients see HTTP 429.
Compares thread pool saturation across data nodes to expose hot-spotting hidden by cluster-wide averages.
Tracks indexing latency and merge activity alongside rejection metrics to identify whether the root cause is traffic growth or background I/O pressure.

The Netdata solution

Elasticsearch monitoring with Netdata

Netdata monitors Elasticsearch with per-second metrics and ML anomaly detection. Correlate JVM heap pressure, shard counts, disk watermarks, mapping growth, and merge activity with cluster and node health in one view.

See Elasticsearch monitoring → Start monitoring free

Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429

Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Prevention

How Netdata helps

Related guides

Elasticsearch monitoring with Netdata