Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
HTTP 429 responses from Elasticsearch, or stack traces containing EsRejectedExecutionException, mean the write thread pool queue is full. The write thread pool (named bulk before Elasticsearch 6.3) executes indexing, bulk, update, and delete operations on each data node using a bounded queue. When all threads are busy and the queue fills, the node rejects the operation. This guide covers how to confirm the diagnosis, distinguish it from other rejection paths, and fix the root cause instead of masking the symptom.
What this means
Elasticsearch routes every write to the primary shard. On the target data node, the write thread pool executes the operation. This pool has a fixed number of threads, usually sized to the node’s allocated processors, and a bounded queue. By default, the write queue holds up to 10000 operations in Elasticsearch 7.x and later. When all threads are busy and the queue is full, the node returns EsRejectedExecutionException to the client with HTTP 429.
The rejected counter from /_cat/thread_pool and /_nodes/stats/thread_pool is cumulative and resets only on node restart. Track the delta over time, not the absolute value. A brief burst of rejections during a traffic spike may self-resolve if the client backs off; sustained rejections mean the cluster cannot keep up with ingest demand.
HTTP 429 can also come from circuit breakers or indexing pressure backpressure. Thread pool rejections reference the queue in the error body; circuit breaker trips reference memory limits. Check the response details to avoid treating a memory problem as a thread capacity problem.
flowchart LR
A[Client bulk/index request] --> B[Coordinating node]
B --> C[Primary shard node]
C --> D[Write thread pool queue]
D -->|Capacity available| E[Worker thread executes write]
D -->|Queue full| F[EsRejectedExecutionException]
F --> G[Return HTTP 429 to client]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Hot-spotting or undersized cluster | Rejections concentrated on one or two nodes while others are idle | /_cat/thread_pool/write sorted by rejected; /_cat/nodes for asymmetric CPU or disk |
| Merge storm or GC pressure reducing throughput | Rejections rising alongside old GC pauses, high segment counts, or elevated merge concurrency | /_nodes/stats/indices/segments for segment count; /_nodes/stats/jvm for GC |
| Sudden burst traffic exceeding capacity | Spike in indexing rate with queue depth jumping across all nodes | /_nodes/stats/indices/indexing rate delta; application traffic metrics |
| Slow storage or I/O saturation | High indexing latency, elevated I/O wait, queues growing despite moderate CPU | /_nodes/stats/indices/indexing latency; OS iostat |
| Oversized bulk requests or too many concurrent streams | Rejections at low overall indexing rate with very large bulk payloads | Client bulk batch size and concurrency settings |
Quick checks
# Check write thread pool queue depth and cumulative rejections per node
curl -s 'http://localhost:9200/_cat/thread_pool/write?v&h=node_name,active,queue,rejected'
# Check indexing rate and cumulative time per node
curl -s 'http://localhost:9200/_nodes/stats/indices/indexing?filter_path=nodes.*.indices.indexing'
# Check heap, CPU, and load for capacity signals
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,heap.percent,cpu,load_1m'
# Check current merges for I/O pressure
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,merges.current'
# Check segment counts per node
curl -s 'http://localhost:9200/_nodes/stats/indices/segments?filter_path=nodes.*.indices.segments.count'
# Check indexing pressure to distinguish memory backpressure from thread pool saturation (ES 7.9+)
curl -s 'http://localhost:9200/_nodes/stats/indexing_pressure?filter_path=nodes.*.indexing_pressure'
How to diagnose it
- Confirm write pool rejection. Run
/_cat/thread_pool/writeand sort byrejected. If the delta is rising on one or more nodes, you have confirmed thread pool saturation. Ifsearchorgetpools show rejections instead, the issue is read-side. - Determine scope. Are rejections cluster-wide or isolated? Cluster-wide points to total ingest exceeding cluster capacity. Isolated points to hot-spotted shards or a single slow node.
- Correlate with indexing rate. Compute
delta(index_total) / interval. If the rate jumped before rejections began, the cluster is undersized for the peak. If the rate is flat, the cluster has lost capacity elsewhere. - Check for GC and heap pressure. Run
/_nodes/stats/jvmand compare old GC time. Long stop-the-world pauses freeze threads and effectively reduce write throughput without reducing queue size. - Inspect merges and segments. High segment count with
merges.currentat the scheduler limit means background merges are falling behind or consuming all I/O, leaving fewer cycles for writes. - Check indexing pressure (ES 7.9+). If
indexing_pressure.memory.total.coordinating_rejectionsorprimary_rejectionsare rising while thread pool rejections are flat, the bottleneck is in-flight indexing memory, not thread availability. - Review client behavior. Large bulk batches or high client concurrency can overwhelm even a healthy cluster. A common pattern is Logstash or a custom client sending batches that are too large or not backing off on 429.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Write thread pool rejected delta | Direct measure of capacity overload | > 0 / min sustained for > 5 minutes |
| Write thread pool queue depth | Leading indicator before rejections occur | Persistently > 1000 or growing toward the queue limit |
| Indexing rate vs baseline | Tells you if traffic grew or capacity shrank | Drop > 30% from 1-hour baseline or spike > 2x |
| Old GC collection time | GC pauses steal threads and reduce throughput | Increasing trend or > 5 s |
| Segments per node / merge concurrency | Merge backlog competes for I/O and CPU | Segment count > 100 per shard or merges.current at scheduler limit |
| Disk I/O wait | Storage saturation slows every write | Sustained > 20-30% |
| Indexing pressure rejections (7.9+) | Distinguishes memory backpressure from thread exhaustion | Any delta > 0 on coordinating, primary, or replica stages |
Fixes
Burst traffic and undersized cluster. The durable fix is to add data nodes or reduce ingest volume. Temporarily, reduce client batch sizes and concurrency to smooth the load. Do not increase thread_pool.write.queue_size; this only delays rejection and increases heap pressure from queued tasks.
Hot-spotted shards. Use /_cat/shards to identify indices with primaries concentrated on the busiest nodes. If time-series indices are skewed to recent nodes, verify ILM rollover and allocation filtering. For immediate relief, use /_cluster/reroute to move specific hot shards, though this generates recovery I/O.
Storage bottleneck. If merges.current is pegged and I/O wait is high, increase index.refresh_interval on heavy-write indices to lower segment creation rate. For spinning disks, set index.merge.scheduler.max_thread_count to 1. If storage is consistently saturated, move to SSD or add nodes.
GC and heap pressure. If the node is near heap limits and old GC is frequent, reduce shard count (close or delete old indices), eliminate fielddata usage on text fields, or add nodes. Do not increase heap beyond the compressed OOPs threshold (roughly 30.5 GB). Run more nodes instead.
Indexing pressure saturation (ES 7.9+). If rejections come from indexing_pressure rather than the thread pool, reduce bulk request payload size and lower concurrent bulk streams. The indexing memory limit defaults to 10% of heap and should not be raised without careful consideration.
Prevention
Implement client-side retry with exponential backoff on HTTP 429. Elasticsearch clients support this natively; ensure it is enabled. Without backoff, rejected data is lost and immediate retries amplify the overload.
Size clusters for peak ingest plus headroom, not average load. Thread pool queues absorb seconds of burst, not minutes.
Monitor write queue depth as a leading indicator. A queue that oscillates near its limit during normal business hours means the cluster has no burst headroom.
Keep segment counts under control with ILM force-merge policies for cold time-series indices, and avoid overly aggressive refresh_interval on high-volume indices.
How Netdata helps
- Correlates write thread pool rejections with per-node CPU, disk I/O, and JVM heap on the same charts to distinguish capacity shortages from storage bottlenecks.
- Alerts on write queue depth and rejection rate deltas before clients see HTTP 429.
- Compares thread pool saturation across data nodes to expose hot-spotting hidden by cluster-wide averages.
- Tracks indexing latency and merge activity alongside rejection metrics to identify whether the root cause is traffic growth or background I/O pressure.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch FORBIDDEN/12/index read-only / allow delete (api) — flood stage recovery
- Elasticsearch heap pressure death spiral: GC, node removal, and the cascade
- Elasticsearch high disk watermark [90%] exceeded: shard relocation and the cascade
- Elasticsearch JVM heap usage high: reading the sawtooth and the post-GC floor







