Elasticsearch IOException: Too many open files – file descriptors, segments, and ulimit
IOException: Too many open files means the Elasticsearch process has reached its per-process RLIMIT_NOFILE. Once open_file_descriptors reaches max_file_descriptors, the kernel returns EMFILE on every new open() call. The node cannot create Lucene segments, accept transport or HTTP connections, or maintain cluster membership. Each shard is a Lucene index composed of multiple segment files, and every network connection consumes a descriptor.
This usually follows one of two paths: the initial ulimit was too low for the shard count, or shard and segment growth overwhelmed a previously adequate limit. Raising the limit buys headroom; reducing segment and shard count fixes the root cause.
What this means
Every Elasticsearch index is split into shards, and every shard is a self-contained Lucene index made of immutable segments. Each segment consists of multiple physical files. A single shard with fifty segments can hold hundreds of open files. Multiply by hundreds of shards per node, add inter-node transport connections, HTTP client connections, and recovery file handles, and the total open file descriptor count climbs rapidly.
When open_file_descriptors reaches max_file_descriptors, the next open() fails with EMFILE, which Elasticsearch surfaces as IOException: Too many open files. The node cannot create new segments, accept new connections, or write to the translog. It may reject searches, drop out of the cluster, or fail to restart if a low limit is enforced at startup. Elasticsearch recommends a minimum limit of 65,536, but dense clusters often need 131,072 or higher.
flowchart TD
A[High shard and segment count] --> B[Multiple open files per segment]
C[Network connections] --> D[FD consumption rises]
B --> D
D --> E{FD usage approaches max?}
E -->|Yes| F[IOException: Too many open files]
F --> G[Failed segment opens]
F --> H[Connection refusals]
F --> I[Erratic cluster membership]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Too many shards and segments per node | FD percentage climbing steadily; IOException during indexing or search; search latency rising from segment overhead | _cat/nodes for file_desc.percent and segments.count; _cat/indices for pri.segments.count |
| Insufficient OS or container FD limit | Node fails to start, or FD percentage sits above 80% despite moderate shard counts | _nodes/stats/process for max_file_descriptors; OS /proc/<pid>/limits for open files |
| Merge backlog or aggressive refresh interval | Segment count growing monotonically; merge activity cannot keep up; FD count rises faster than shard count | _nodes/stats/indices/merges for current and total_time; index refresh_interval settings |
| Connection accumulation | FD count high but segment count normal; many client or inter-node connections left open | OS-level connection counts via ss or /proc/<pid>/fd |
Quick checks
# Check current and max FDs per node from the cluster API
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,file_desc.current,file_desc.max,file_desc.percent'
# Check detailed process stats for FD usage
curl -s 'http://localhost:9200/_nodes/stats/process?filter_path=nodes.*.process.open_file_descriptors,nodes.*.process.max_file_descriptors'
# Count open FDs on the local OS
ES_PID=$(pgrep -f org.elasticsearch.bootstrap.Elasticsearch | head -1)
ls /proc/$ES_PID/fd | wc -l
cat /proc/$ES_PID/limits | grep 'open files'
# Check segment counts per index
curl -s 'http://localhost:9200/_cat/indices?v&h=index,pri,docs.count,store.size,pri.segments.count&s=pri.segments.count:desc' | head -20
# Check per-node segment totals and merge activity
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,segments.memory,merges.current,merges.total'
# Look for the error in recent logs
grep -i "Too many open files" /var/log/elasticsearch/*.log
# Check if recoveries or snapshots are transiently spiking FD usage
curl -s 'http://localhost:9200/_cat/recovery?v&active_only&h=index,shard,stage,source_host,target_host,bytes_percent'
curl -s 'http://localhost:9200/_snapshot/_status'
How to diagnose it
Confirm FD exhaustion. Run the
_cat/nodesand_nodes/stats/processchecks. Iffile_desc.percentis above 90%, oropen_file_descriptorsis within a few hundred ofmax_file_descriptors, you have found the bottleneck. Cross-check with the OS-level/proc/<pid>/fdcount.Determine whether the limit is too low or usage is too high. If
max_file_descriptorsis below 65,536, the limit is the problem. If it is above 65,536 but usage is still near the ceiling, the cluster topology is the problem.Correlate with segment and shard count. Use
_cat/indicesand_cat/nodesto see if segment count per shard is high. Healthy shards typically have 10-50 segments; more than 100 suggests merges are falling behind. Check total segments per node. A node with thousands of segments will consume tens of thousands of file descriptors.Check for transient spikes. Active shard recoveries, snapshots, or force merges can temporarily spike FD usage. Use
_cat/recoveryand_snapshot/_statusto see if a maintenance operation is pushing the node over the edge. If FD usage is high but stable outside of these operations, the baseline is too high.Check logs for the exact failure mode.
IOException: Too many open filesappearing alongside failed segment opens indicates Lucene cannot create new files. If it appears with transport connection errors, the node cannot accept new cluster or client connections. Both have the same root cause but different operational impact.Check for connection leaks. If segment and shard counts are moderate but FD usage is still high, inspect OS-level connection state. A misconfigured client pool that opens persistent connections without closing them can exhaust descriptors even on a small cluster.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
process.open_file_descriptors vs max_file_descriptors | Percentage of hard limit consumed; there is no graceful degradation beyond 100% | Sustained >80% |
pri.segments.count per index | Each segment maps to multiple FDs; segment explosion is the leading driver of FD growth | >100 segments per shard |
segments.count per node | Total segment metadata and open files on the node | Tens of thousands and rising |
| Shard count per node | Each shard carries translog and segment file overhead | Approaching 1,000 per node (default cluster.max_shards_per_node) |
Merge current and total_time_in_millis | Merges consolidate segments; when they fall behind, segment count and FDs grow | merges.current persistently at max thread count |
| Thread pool rejections / connection errors | FD exhaustion prevents new connections and operations | Nonzero write or search rejections, or transport connect failures |
Fixes
Raise the file descriptor limit
If max_file_descriptors is below 65,536, or if you are running a dense cluster and the limit is under 131,072, raise it. This requires restarting the Elasticsearch process.
- systemd-managed services: create a drop-in override for the service unit. Use
LimitNOFILEin the[Service]section. Usesystemctl edit elasticsearchto create a persistent override rather than editing the main unit file directly. - Docker: pass
--ulimit nofile=65536:65536(or higher) when running the container, or set theulimitsblock in docker-compose. - Tarball installations: configure the OS
limits.confor the startup script.systemdservices do not readlimits.conf; use the unit override instead.
After changing the limit, restart the node and verify via _nodes/stats/process.
Reduce segment count immediately
For open indices that are no longer written, force merge to one segment. This reclaims file descriptors and improves search performance.
POST /<index>/_forcemerge?max_num_segments=1
Warning: never force merge a live index that is still receiving writes. It blocks writes and can cause I/O storms. The force merge API also cannot run on a closed index; if the index is already closed, it is not consuming FDs anyway.
Reduce shard density
Close or delete unused indices. For time-series data, ensure ILM is deleting old indices rather than leaving them open. Use the shrink API to reduce shard counts on old indices. The cluster default max_shards_per_node is 1,000 in Elasticsearch 7.x and later; staying well below that leaves headroom for FDs, heap, and management overhead.
Tune indexing to slow segment creation
If high segment count is driven by a very low refresh_interval, raise it on heavy-write indices. Set refresh_interval to 30s or longer during bulk indexing instead of the default 1s. Fewer refreshes means fewer segments created, which means fewer files open and less merge pressure.
Fix client connection leaks
If FD usage is high but segment count is normal, audit client-side connection pools. Long-lived HTTP keep-alive connections, unclosed transport clients, or load balancers with aggressive health checks can accumulate connections. Reduce keep-alive duration or add client-side connection limits.
Prevention
- Set limits generously at deploy time. Configure
LimitNOFILEor container ulimits to at least 131,072 on production nodes. - Monitor FD percentage continuously. Alert when
file_desc.percentexceeds 70%. This gives runway before the hard limit. - Keep segment counts low. Force-merge read-only indices. Maintain
refresh_intervalappropriate to the workload. Monitorsegments.countper node as a first-class resource. - Control shard sprawl. Target fewer than 500 shards per node. Use ILM to delete or close old indices. Avoid creating indices with excessive primary shard counts.
- Audit connection patterns. Track OS-level connection counts alongside Elasticsearch metrics. If FDs grow while segments stay flat, investigate clients.
How Netdata helps
- Charts
process.open_file_descriptorsandprocess.max_file_descriptorsfrom the Elasticsearch API, so FD utilization is visible without manual API calls. - Correlates FD utilization with
segments.count,segments.memory, and shard count per node to distinguish segment-driven exhaustion from connection leaks. - Alerts on FD utilization thresholds before the kernel returns
EMFILE, leaving time to force-merge or scale. - Cross-references Elasticsearch FD metrics with OS-level file descriptor and socket metrics to identify whether the process or the host is the constraint.
- Surfaces Elasticsearch error logs containing
Too many open filesalongside the relevant cluster and node metrics.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch cluster state too large: field count, index count, and per-node heap
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
- Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch FORBIDDEN/12/index read-only / allow delete (api) – flood stage recovery







