Kafka Too many open files: file descriptor exhaustion from segments and connections
Producers time out, consumers disconnect, and the broker log shows java.io.IOException: Too many open files. The broker may still run, but it cannot open new log segments or accept additional TCP connections. File descriptor exhaustion is a cliff-edge failure: the broker operates normally until it hits the hard limit, then the data path stops.
Kafka brokers hold a file descriptor for every log segment and every network connection. Each partition maintains an active segment and older retained segments. A broker with thousands of partitions and tens of segments each, plus hundreds of client connections, can hold tens of thousands of open file descriptors. The default Linux per-process limit of 1024 is far below production requirements.
The broker can go from healthy to rejecting produce requests within minutes if a large topic is created, retention is increased, or a client reconnect storm occurs. The fix is rarely a single knob. Determine whether segments or connections dominate FD usage, raise the appropriate limits, and validate the capacity model before scaling.
flowchart TD
A[Broker rejecting connections or logging Too many open files] --> B{Check OpenFileDescriptorCount vs limit}
B -->|Above 75%| C[Count segments and connections]
B -->|Well below limit| D[Look for a different root cause]
C --> E{Which dominates FD usage?}
E -->|Segments| F[Check partition count and log.segment.bytes]
E -->|Connections| G[Check for client leaks or reconnect storms]
F --> H[Increase ulimit or raise segment size]
G --> I[Fix clients and increase ulimit if needed]What this means
A Kafka broker uses file descriptors for log segment data files, log segment index files, and network connections. Each segment requires at least a data file and an index file, and each active client connection holds an additional FD. Total FD usage grows linearly with partition count, retention, and client connections.
The broker does not queue segment opens when the FD limit is reached. It cannot roll a new segment, accept a new connection, or read from existing segments if an operation requires a temporary file handle. The result is producer timeouts, consumer disconnections, and hard exceptions in the broker log. A conservative estimate is:
(partitions * 3 * segments_per_partition) + connections + 1000
Set the process file descriptor limit to at least 100,000. If the broker is below that threshold and FD usage is climbing, it carries operational risk regardless of current stability.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Default ulimit too low | FD count sits near 1024, often on new or containerized brokers | cat /proc/{pid}/limits for Max open files |
| High partition count with many retained segments | FD count grows linearly with topics; segment files dominate /proc/{pid}/fd | Count *.index files under log.dirs |
Small log.segment.bytes | Many small segments per partition, multiplying FD overhead | log.segment.bytes in server.properties |
| Connection leak or reconnect storm | FD count spikes without a matching increase in segment files | Per-listener connection-count JMX metric or ss output |
| Long retention keeping old segments open | FD count creeps upward over days as data accumulates | Retention settings and log cleaner health |
Quick checks
Run these read-only commands on the affected broker to confirm the scope.
# Check the broker's current FD limit and usage
KAFKA_PID=$(pgrep -f kafka.Kafka | head -n 1)
cat /proc/$KAFKA_PID/limits | grep "Max open files"
ls /proc/$KAFKA_PID/fd | wc -l
# Check FD count and ceiling via JMX
echo "get -b java.lang:type=OperatingSystem OpenFileDescriptorCount" | java -jar jmxterm.jar -l localhost:9999
echo "get -b java.lang:type=OperatingSystem MaxFileDescriptorCount" | java -jar jmxterm.jar -l localhost:9999
# Count segment index files across all log directories
grep log.dirs /etc/kafka/server.properties | cut -d'=' -f2 | tr ',' '\n' | while read d; do [ -n "$d" ] && find "$d" -type f -name "*.index"; done | wc -l
# Check active TCP connections held by the broker process
KAFKA_PID=$(pgrep -f kafka.Kafka | head -n 1)
ss -tnp | grep -w "$KAFKA_PID" | wc -l
# Inspect broker logs for the exact exhaustion error
grep -i "too many open files" /var/log/kafka/server.log
# List per-listener connection metrics via JMX
echo "beans -d kafka.server -s type=socket-server-metrics" | java -jar jmxterm.jar -l localhost:9999
How to diagnose it
Confirm FD saturation. Compare
OpenFileDescriptorCounttoMaxFileDescriptorCount. Usage above 75% of the limit is risky. Above 95% withToo many open fileserrors in the broker log confirms FD exhaustion.Determine if segments or connections dominate. Count
*.indexfiles inlog.dirsand multiply by two to three for a conservative segment FD floor. If this floor is close to the total FD count, segments dominate. If it is far below the total, connections or other handles are the culprit.Check for recent topology changes. New topics with many partitions, a reduction in
log.segment.bytes, or an increase inretention.msorretention.bytescan rapidly increase segment count. Correlate FD growth with recent changes.Inspect connection metrics. Aggregate
connection-countacross all listeners via JMX. If connections spiked, check for client reconnect storms, producers or consumers that do not reuse connections, or services that create a new client per request.Review network thread health. A high connection count can saturate network threads before FDs exhaust. Check
NetworkProcessorAvgIdlePercent. If it is below 0.3 while FDs are high, the broker has combined connection and thread saturation.Validate the capacity estimate. Using current partition count, average segments per partition, and connection count, evaluate whether the configured ulimit is theoretically sufficient. If the estimate exceeds 80% of the limit, the broker is under-provisioned regardless of other optimizations.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
java.lang:type=OperatingSystem OpenFileDescriptorCount vs MaxFileDescriptorCount | Direct FD headroom measurement | Sustained above 75% of limit |
kafka.server:type=socket-server-metrics connection-count | Each connection holds an FD | Sudden spike or steady growth above baseline |
| Partition count per broker | Drives segment-derived FDs | Sharp increase or broker above 4,000 partitions |
Segment index file count in log.dirs | Proxy for segment FDs | Index count * 2-3 approaches 80% of the ulimit budget |
NetworkProcessorAvgIdlePercent | Saturated network threads cannot accept connections even if FDs remain | Sustained below 0.3 |
Fixes
Raise the file descriptor limit
This is the definitive fix, but it may require a broker restart depending on how the process was started.
- systemd: Set
LimitNOFILE=100000(or higher) in the broker’s service unit and runsystemctl daemon-reload. A broker restart is required to pick up the new limit. - Container runtimes: Verify that the container runtime and orchestrator do not cap FDs below the host limit. In Kubernetes, check that init containers or the runtime do not override the ulimit.
- limits.conf: Ensure
/etc/security/limits.confmatches the systemd or container limit. The effective limit is the most restrictive layer.
Tradeoff: Restarting a broker to apply a new ulimit triggers ISR shrink, leader migration, and page cache cold start. Schedule this during a maintenance window if the broker is not completely down.
Reduce segment count
If raising the limit is blocked by platform policy, reduce the number of segments.
- Increase
log.segment.bytes(default 1 GiB). Larger segments reduce files per partition, but they reduce retention granularity and can delay compaction. - Reduce
retention.msorretention.bytesto delete older segments sooner. Only viable if the topic does not require deep historical retention. - Delete unused topics or partitions. This is destructive; verify downstream consumers before dropping data.
Tradeoff: Larger segments increase the minimum data footprint per partition and can delay the log cleaner. Do not reduce segment counts by disabling retention on compacted topics.
Reduce connection count
If connections dominate FD usage, fix the client layer.
- Consolidate producers and consumers so each process reuses a single client instance across requests.
- Check for connection leaks in micro-batch frameworks that instantiate a new producer per batch without closing it.
- Review idle connection timeouts. Brokers and clients keep idle connections open indefinitely by default. Tuning
connections.max.idle.mshelps, but setting it too low causes unnecessary reconnects.
Tradeoff: Aggressive idle timeouts increase reconnect overhead and can trigger consumer rebalances.
Restart the broker (last resort)
If the broker is at the hard limit and cannot recover, a controlled restart is the only path. Ensure the new limit is applied before starting the process. Restart recovery time scales with partition count and can take tens of minutes.
Prevention
- Set ulimit at deployment time. Start production brokers with a file descriptor limit of at least 100,000. Do not rely on the Linux default of 1024.
- Monitor FD ratio continuously. Alert when
OpenFileDescriptorCountexceeds 75% ofMaxFileDescriptorCount. Do not wait for the absolute limit. - Estimate before scaling topics. Use
(partitions * 3 * segments_per_partition) + connections + 1000to validate that new topics will not push the broker over the limit. - Tune segment size deliberately. If you lower
log.segment.bytesfor latency reasons, recalculate the FD budget. Smaller segments multiply file handles. - Architect clients for connection reuse. Persistent connections and pooled producers prevent connection leaks during traffic spikes.
How Netdata helps
- Tracks per-process open FDs from
/procalongside JMX-derivedOpenFileDescriptorCountandMaxFileDescriptorCount. - Surfaces connection count and
NetworkProcessorAvgIdlePercentto distinguish segment pressure from connection storms. - Correlates FD growth with broker restarts, topic creations, and partition count changes through historical metrics.
- Alerts on FD utilization rate-of-change.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka enable.auto.commit data loss: committed offsets that outrun processing
- Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka __consumer_offsets growing huge: compaction failure on the offsets topic
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka disk I/O latency high: await, LocalTimeMs, and the slow-disk broker
- Kafka disk space planning: retention, replication, and runway estimation







