Elasticsearch IOException: Too many open files – file descriptors, segments, and ulimit

IOException: Too many open files means the Elasticsearch process has reached its per-process RLIMIT_NOFILE. Once open_file_descriptors reaches max_file_descriptors, the kernel returns EMFILE on every new open() call. The node cannot create Lucene segments, accept transport or HTTP connections, or maintain cluster membership. Each shard is a Lucene index composed of multiple segment files, and every network connection consumes a descriptor.

This usually follows one of two paths: the initial ulimit was too low for the shard count, or shard and segment growth overwhelmed a previously adequate limit. Raising the limit buys headroom; reducing segment and shard count fixes the root cause.

What this means

Every Elasticsearch index is split into shards, and every shard is a self-contained Lucene index made of immutable segments. Each segment consists of multiple physical files. A single shard with fifty segments can hold hundreds of open files. Multiply by hundreds of shards per node, add inter-node transport connections, HTTP client connections, and recovery file handles, and the total open file descriptor count climbs rapidly.

When open_file_descriptors reaches max_file_descriptors, the next open() fails with EMFILE, which Elasticsearch surfaces as IOException: Too many open files. The node cannot create new segments, accept new connections, or write to the translog. It may reject searches, drop out of the cluster, or fail to restart if a low limit is enforced at startup. Elasticsearch recommends a minimum limit of 65,536, but dense clusters often need 131,072 or higher.

flowchart TD
    A[High shard and segment count] --> B[Multiple open files per segment]
    C[Network connections] --> D[FD consumption rises]
    B --> D
    D --> E{FD usage approaches max?}
    E -->|Yes| F[IOException: Too many open files]
    F --> G[Failed segment opens]
    F --> H[Connection refusals]
    F --> I[Erratic cluster membership]

Common causes

CauseWhat it looks likeFirst thing to check
Too many shards and segments per nodeFD percentage climbing steadily; IOException during indexing or search; search latency rising from segment overhead_cat/nodes for file_desc.percent and segments.count; _cat/indices for pri.segments.count
Insufficient OS or container FD limitNode fails to start, or FD percentage sits above 80% despite moderate shard counts_nodes/stats/process for max_file_descriptors; OS /proc/<pid>/limits for open files
Merge backlog or aggressive refresh intervalSegment count growing monotonically; merge activity cannot keep up; FD count rises faster than shard count_nodes/stats/indices/merges for current and total_time; index refresh_interval settings
Connection accumulationFD count high but segment count normal; many client or inter-node connections left openOS-level connection counts via ss or /proc/<pid>/fd

Quick checks

# Check current and max FDs per node from the cluster API
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,file_desc.current,file_desc.max,file_desc.percent'

# Check detailed process stats for FD usage
curl -s 'http://localhost:9200/_nodes/stats/process?filter_path=nodes.*.process.open_file_descriptors,nodes.*.process.max_file_descriptors'

# Count open FDs on the local OS
ES_PID=$(pgrep -f org.elasticsearch.bootstrap.Elasticsearch | head -1)
ls /proc/$ES_PID/fd | wc -l
cat /proc/$ES_PID/limits | grep 'open files'

# Check segment counts per index
curl -s 'http://localhost:9200/_cat/indices?v&h=index,pri,docs.count,store.size,pri.segments.count&s=pri.segments.count:desc' | head -20

# Check per-node segment totals and merge activity
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,segments.count,segments.memory,merges.current,merges.total'

# Look for the error in recent logs
grep -i "Too many open files" /var/log/elasticsearch/*.log

# Check if recoveries or snapshots are transiently spiking FD usage
curl -s 'http://localhost:9200/_cat/recovery?v&active_only&h=index,shard,stage,source_host,target_host,bytes_percent'
curl -s 'http://localhost:9200/_snapshot/_status'

How to diagnose it

  1. Confirm FD exhaustion. Run the _cat/nodes and _nodes/stats/process checks. If file_desc.percent is above 90%, or open_file_descriptors is within a few hundred of max_file_descriptors, you have found the bottleneck. Cross-check with the OS-level /proc/<pid>/fd count.

  2. Determine whether the limit is too low or usage is too high. If max_file_descriptors is below 65,536, the limit is the problem. If it is above 65,536 but usage is still near the ceiling, the cluster topology is the problem.

  3. Correlate with segment and shard count. Use _cat/indices and _cat/nodes to see if segment count per shard is high. Healthy shards typically have 10-50 segments; more than 100 suggests merges are falling behind. Check total segments per node. A node with thousands of segments will consume tens of thousands of file descriptors.

  4. Check for transient spikes. Active shard recoveries, snapshots, or force merges can temporarily spike FD usage. Use _cat/recovery and _snapshot/_status to see if a maintenance operation is pushing the node over the edge. If FD usage is high but stable outside of these operations, the baseline is too high.

  5. Check logs for the exact failure mode. IOException: Too many open files appearing alongside failed segment opens indicates Lucene cannot create new files. If it appears with transport connection errors, the node cannot accept new cluster or client connections. Both have the same root cause but different operational impact.

  6. Check for connection leaks. If segment and shard counts are moderate but FD usage is still high, inspect OS-level connection state. A misconfigured client pool that opens persistent connections without closing them can exhaust descriptors even on a small cluster.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
process.open_file_descriptors vs max_file_descriptorsPercentage of hard limit consumed; there is no graceful degradation beyond 100%Sustained >80%
pri.segments.count per indexEach segment maps to multiple FDs; segment explosion is the leading driver of FD growth>100 segments per shard
segments.count per nodeTotal segment metadata and open files on the nodeTens of thousands and rising
Shard count per nodeEach shard carries translog and segment file overheadApproaching 1,000 per node (default cluster.max_shards_per_node)
Merge current and total_time_in_millisMerges consolidate segments; when they fall behind, segment count and FDs growmerges.current persistently at max thread count
Thread pool rejections / connection errorsFD exhaustion prevents new connections and operationsNonzero write or search rejections, or transport connect failures

Fixes

Raise the file descriptor limit

If max_file_descriptors is below 65,536, or if you are running a dense cluster and the limit is under 131,072, raise it. This requires restarting the Elasticsearch process.

  • systemd-managed services: create a drop-in override for the service unit. Use LimitNOFILE in the [Service] section. Use systemctl edit elasticsearch to create a persistent override rather than editing the main unit file directly.
  • Docker: pass --ulimit nofile=65536:65536 (or higher) when running the container, or set the ulimits block in docker-compose.
  • Tarball installations: configure the OS limits.conf or the startup script. systemd services do not read limits.conf; use the unit override instead.

After changing the limit, restart the node and verify via _nodes/stats/process.

Reduce segment count immediately

For open indices that are no longer written, force merge to one segment. This reclaims file descriptors and improves search performance.

POST /<index>/_forcemerge?max_num_segments=1

Warning: never force merge a live index that is still receiving writes. It blocks writes and can cause I/O storms. The force merge API also cannot run on a closed index; if the index is already closed, it is not consuming FDs anyway.

Reduce shard density

Close or delete unused indices. For time-series data, ensure ILM is deleting old indices rather than leaving them open. Use the shrink API to reduce shard counts on old indices. The cluster default max_shards_per_node is 1,000 in Elasticsearch 7.x and later; staying well below that leaves headroom for FDs, heap, and management overhead.

Tune indexing to slow segment creation

If high segment count is driven by a very low refresh_interval, raise it on heavy-write indices. Set refresh_interval to 30s or longer during bulk indexing instead of the default 1s. Fewer refreshes means fewer segments created, which means fewer files open and less merge pressure.

Fix client connection leaks

If FD usage is high but segment count is normal, audit client-side connection pools. Long-lived HTTP keep-alive connections, unclosed transport clients, or load balancers with aggressive health checks can accumulate connections. Reduce keep-alive duration or add client-side connection limits.

Prevention

  • Set limits generously at deploy time. Configure LimitNOFILE or container ulimits to at least 131,072 on production nodes.
  • Monitor FD percentage continuously. Alert when file_desc.percent exceeds 70%. This gives runway before the hard limit.
  • Keep segment counts low. Force-merge read-only indices. Maintain refresh_interval appropriate to the workload. Monitor segments.count per node as a first-class resource.
  • Control shard sprawl. Target fewer than 500 shards per node. Use ILM to delete or close old indices. Avoid creating indices with excessive primary shard counts.
  • Audit connection patterns. Track OS-level connection counts alongside Elasticsearch metrics. If FDs grow while segments stay flat, investigate clients.

How Netdata helps

  • Charts process.open_file_descriptors and process.max_file_descriptors from the Elasticsearch API, so FD utilization is visible without manual API calls.
  • Correlates FD utilization with segments.count, segments.memory, and shard count per node to distinguish segment-driven exhaustion from connection leaks.
  • Alerts on FD utilization thresholds before the kernel returns EMFILE, leaving time to force-merge or scale.
  • Cross-references Elasticsearch FD metrics with OS-level file descriptor and socket metrics to identify whether the process or the host is the constraint.
  • Surfaces Elasticsearch error logs containing Too many open files alongside the relevant cluster and node metrics.