Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery
When a Kafka broker exhausts disk space on a log directory, it does not throttle. It marks that directory offline or crashes entirely. Partitions on the failed directory become unavailable immediately. If all log.dirs fail, the broker exits. You may see under-replicated partitions or producer timeouts seconds before failure, but often the first sign is a hard process exit with No space left on device.
Kafka assigns new partitions round-robin across log.dirs, but existing partitions never move automatically. One mount can hit 100% while others sit at 50%, so host-level disk alerts are misleading. Recovery is not a simple matter of freeing a few gigabytes: an unclean shutdown forces hours of segment scanning and index rebuild on restart.
What this means
In Kafka, each partition is a directory of segment files under one of the paths listed in log.dirs. The active segment receives appends. When the underlying filesystem returns ENOSPC, the broker treats this as a fatal I/O error for that log directory. If you configured multiple directories, the broker may continue serving partitions on the surviving directories. Only when every directory fails does the broker shut down completely. Existing partitions are pinned to their original directory, so adding disk space to an empty directory does not rebalance load.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Retention misconfiguration | Disk grows steadily while BytesInPerSec is stable | log.retention.ms and retention.bytes versus actual segment ages and sizes |
| Dead log cleaner | Disk grows on compacted topics, especially __consumer_offsets, despite stable traffic | Log cleaner max-dirty-percent and dead thread count via JMX; broker logs for cleaner errors |
| Partition reassignment | Sudden disk spike during or after a reassignment; old replicas not removed | kafka-reassign-partitions.sh --verify |
| Uneven log.dirs fill | One mount at 95%, others at 50%; hot topics clustered on one disk | df -h per mount point; directory sizes via kafka-log-dirs.sh --describe |
| Burst producer traffic | BytesInPerSec spike correlating with disk growth | Per-topic BytesInPerSec; producer metrics |
Quick checks
Run these read-only commands to assess the scope without making changes.
# Check free space on every log directory individually
grep '^log.dirs' /etc/kafka/server.properties | cut -d= -f2 | tr ',' '\n' | sed 's/^ *//' | while read d; do df -h "$d"; done
# Check for offline log directories via the admin tool
kafka-log-dirs.sh --bootstrap-server localhost:9092 --describe
# Check broker logs for directory shutdown or ENOSPC messages
grep -iE "no space left|shutting down|offline" /var/log/kafka/server.log
# Inspect JMX for offline directory count
echo "get -b kafka.log:type=LogManager,name=OfflineLogDirectoryCount Value" | java -jar jmxterm.jar -l localhost:9999
# Check under-replicated partitions to measure replication impact
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Check for reassignment in progress, which temporarily inflates disk usage
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 --verify
# Estimate directory size by topic on a single mount (IO-intensive; run with care)
du -sh "$LOG_DIR"/* 2>/dev/null | sort -rh | head -20
# Check disk I/O latency for the devices backing log.dirs
iostat -xz 1
# Check for stuck log cleaner threads via JMX
echo "get -b kafka.log:type=LogCleaner,name=DeadThreadCount Value" | java -jar jmxterm.jar -l localhost:9999
flowchart TD
A[Disk alert on broker] --> B{Check all log.dirs mounts}
B -->|One dir full| C[Check partition distribution and reassignment status]
B -->|All dirs full| D[Broker shutdown or crash]
C --> E{Is cleaner healthy?}
E -->|No| F[Dead log cleaner: compacted topics growing]
E -->|Yes| G[Retention misconfig or reassignment leak]
D --> H[Free OS-level space and restart broker]
F --> I[Restart broker to resurrect cleaner]
G --> J[Reduce retention or complete reassignment]How to diagnose it
- Verify per-directory disk usage. Use
df -hon eachlog.dirsmount independently. Do not trust host-level disk metrics. - Check broker logs for the failure mode. Look for
No space left on deviceor log directory offline messages to determine if the failure is partial or total. - Query
OfflineLogDirectoryCountvia JMX. A value greater than zero confirms Kafka has marked directories offline. - Correlate disk growth with traffic. Compare
BytesInPerSecagainst expected usage:(BytesInPerSec * retention_seconds * replication_factor / broker_count) + compaction_overhead. If actual usage exceeds this, investigate leaks. - Check for active reassignments. Reassignments temporarily inflate disk usage because both old and new replicas exist until the reassignment completes.
- Inspect compacted topics. Check the log cleaner dirty ratio and dead thread count. If the dirty ratio climbs and dead threads are nonzero, the cleaner has stalled and compacted topics are growing without bound.
- If the broker crashed, inspect logs before restarting. Review the final log lines for the fatal error and verify filesystem mount health.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Disk space utilization per log.dir | Kafka hits a cliff edge at 100%; per-dir monitoring is essential because allocation is uneven | >75% ticket; >90% or <4h runway page |
| OfflineLogDirectoryCount | Binary indicator that Kafka has taken a directory offline | >0 page |
| BytesInPerSec vs expected steady-state | Validates capacity model; unexplained growth points to compaction or reassignment leaks | Actual exceeds (BytesInPerSec * retention_seconds * RF / brokers) + compaction overhead |
| Log cleaner dirty ratio / dead thread count | Silent cleaner stalls cause unbounded disk growth on compacted topics like __consumer_offsets | Dirty ratio >50% sustained or dead thread count > 0 |
| UnderReplicatedPartitions | Partitions on an offline directory cannot replicate | Nonzero sustained outside maintenance |
| OfflinePartitionsCount | Total unavailability if the last ISR member lost its directory | Nonzero sustained |
Fixes
Retention misconfiguration
Lower retention.ms or retention.bytes dynamically via kafka-configs.sh. This is destructive: data older than the new threshold is deleted at the next retention check interval (default 5 minutes). Do not lower retention below your consumers’ lag horizon. For compacted topics, retention.bytes is ignored unless you also set cleanup.policy=compact,delete.
Dead log cleaner
There is no way to restart the cleaner thread without restarting the broker. Perform a controlled shutdown and restart. Tradeoff: the broker will be offline for the restart duration, and on return it must rebuild ISR from empty (expect elevated UnderReplicatedPartitions). If the cleaner crashes again on the same segment, identify the affected topic and consider restoring from a healthy replica.
Active reassignment consuming disk
Reassignments copy partitions to new brokers before removing them from old ones. If disk is critically low, cancel the reassignment or let it complete and then remove old replicas. Do not kill the broker mid-reassignment.
Broker crashed after all log.dirs failed
Free space at the OS level by moving non-Kafka files or expanding the volume. Restart the broker. If the shutdown was unclean, startup will scan all segments and replay logs. Monitor startup logs; if the broker fails to start, check for corrupt segments.
One directory full among many
If the broker is still running on other directories, do not manually delete segment files. Use kafka-reassign-partitions.sh to move partitions off the affected broker to others with headroom. This is the only safe way to relocate existing partition data. Plan to add capacity or decommission the saturated directory.
Prevention
- Monitor each
log.dirsmount point independently. Host-level disk utilization is misleading. - Calculate expected usage:
(BytesInPerSec * retention_seconds * replication_factor) / number_of_brokers + compaction overhead. If actual usage exceeds this, investigate leaks. - Monitor log cleaner dirty ratio and dead thread count. A stalled cleaner is a common silent cause of unbounded growth.
- Maintain at least 15-20% free space per volume to accommodate compaction doubling, reassignment copies, and burst traffic.
- Trend disk growth and alert on runway (time-to-full), not just percentage thresholds.
- Review partition distribution after major topology changes. Round-robin allocation means new partitions go to the least-loaded directory initially, but hot topics can still skew usage over time.
How Netdata helps
- Per-mount disk utilization and Kafka JMX metrics (
BytesInPerSec,OfflineLogDirectoryCount,UnderReplicatedPartitions) are available together, so you can correlate disk growth with write rate and replication health. - Runway alerts based on disk usage rate-of-change flag time-to-full trends before static percentage thresholds fire.
- Anomaly detection on
BytesInPerSecand disk utilization catches divergent growth that static thresholds miss, such as a dead log cleaner or stuck reassignment. - Per-broker process monitoring tracks uptime and recovery state, distinguishing controlled restarts from unclean shutdowns after a disk-full event.
dfmonitoring evaluates each mount point independently, so unevenlog.dirsfill is visible immediately.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka enable.auto.commit data loss: committed offsets that outrun processing
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka fetch request latency high: FetchConsumer vs FetchFollower and page cache misses
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka JVM heap and Full GC pauses: ISR drops, session timeouts, and right-sizing the heap
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind







