$ guides / kafka / kafka-broker-out-of-disk ▌

Operations Guides

Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery

When a Kafka broker exhausts disk space on a log directory, it does not throttle. It marks that directory offline or crashes entirely. Partitions on the failed directory become unavailable immediately. If all log.dirs fail, the broker exits. You may see under-replicated partitions or producer timeouts seconds before failure, but often the first sign is a hard process exit with No space left on device.

Kafka assigns new partitions round-robin across log.dirs, but existing partitions never move automatically. One mount can hit 100% while others sit at 50%, so host-level disk alerts are misleading. Recovery is not a simple matter of freeing a few gigabytes: an unclean shutdown forces hours of segment scanning and index rebuild on restart.

What this means

In Kafka, each partition is a directory of segment files under one of the paths listed in log.dirs. The active segment receives appends. When the underlying filesystem returns ENOSPC, the broker treats this as a fatal I/O error for that log directory. If you configured multiple directories, the broker may continue serving partitions on the surviving directories. Only when every directory fails does the broker shut down completely. Existing partitions are pinned to their original directory, so adding disk space to an empty directory does not rebalance load.

Common causes

Cause	What it looks like	First thing to check
Retention misconfiguration	Disk grows steadily while `BytesInPerSec` is stable	`log.retention.ms` and `retention.bytes` versus actual segment ages and sizes
Dead log cleaner	Disk grows on compacted topics, especially `__consumer_offsets`, despite stable traffic	Log cleaner `max-dirty-percent` and dead thread count via JMX; broker logs for cleaner errors
Partition reassignment	Sudden disk spike during or after a reassignment; old replicas not removed	`kafka-reassign-partitions.sh --verify`
Uneven log.dirs fill	One mount at 95%, others at 50%; hot topics clustered on one disk	`df -h` per mount point; directory sizes via `kafka-log-dirs.sh --describe`
Burst producer traffic	`BytesInPerSec` spike correlating with disk growth	Per-topic `BytesInPerSec`; producer metrics

Quick checks

Run these read-only commands to assess the scope without making changes.

# Check free space on every log directory individually
grep '^log.dirs' /etc/kafka/server.properties | cut -d= -f2 | tr ',' '\n' | sed 's/^ *//' | while read d; do df -h "$d"; done

# Check for offline log directories via the admin tool
kafka-log-dirs.sh --bootstrap-server localhost:9092 --describe

# Check broker logs for directory shutdown or ENOSPC messages
grep -iE "no space left|shutting down|offline" /var/log/kafka/server.log

# Inspect JMX for offline directory count
echo "get -b kafka.log:type=LogManager,name=OfflineLogDirectoryCount Value" | java -jar jmxterm.jar -l localhost:9999

# Check under-replicated partitions to measure replication impact
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Check for reassignment in progress, which temporarily inflates disk usage
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 --verify

# Estimate directory size by topic on a single mount (IO-intensive; run with care)
du -sh "$LOG_DIR"/* 2>/dev/null | sort -rh | head -20

# Check disk I/O latency for the devices backing log.dirs
iostat -xz 1

# Check for stuck log cleaner threads via JMX
echo "get -b kafka.log:type=LogCleaner,name=DeadThreadCount Value" | java -jar jmxterm.jar -l localhost:9999

flowchart TD
    A[Disk alert on broker] --> B{Check all log.dirs mounts}
    B -->|One dir full| C[Check partition distribution and reassignment status]
    B -->|All dirs full| D[Broker shutdown or crash]
    C --> E{Is cleaner healthy?}
    E -->|No| F[Dead log cleaner: compacted topics growing]
    E -->|Yes| G[Retention misconfig or reassignment leak]
    D --> H[Free OS-level space and restart broker]
    F --> I[Restart broker to resurrect cleaner]
    G --> J[Reduce retention or complete reassignment]

How to diagnose it

Verify per-directory disk usage. Use df -h on each log.dirs mount independently. Do not trust host-level disk metrics.
Check broker logs for the failure mode. Look for No space left on device or log directory offline messages to determine if the failure is partial or total.
Query OfflineLogDirectoryCount via JMX. A value greater than zero confirms Kafka has marked directories offline.
Correlate disk growth with traffic. Compare BytesInPerSec against expected usage: (BytesInPerSec * retention_seconds * replication_factor / broker_count) + compaction_overhead. If actual usage exceeds this, investigate leaks.
Check for active reassignments. Reassignments temporarily inflate disk usage because both old and new replicas exist until the reassignment completes.
Inspect compacted topics. Check the log cleaner dirty ratio and dead thread count. If the dirty ratio climbs and dead threads are nonzero, the cleaner has stalled and compacted topics are growing without bound.
If the broker crashed, inspect logs before restarting. Review the final log lines for the fatal error and verify filesystem mount health.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Disk space utilization per log.dir	Kafka hits a cliff edge at 100%; per-dir monitoring is essential because allocation is uneven	>75% ticket; >90% or <4h runway page
OfflineLogDirectoryCount	Binary indicator that Kafka has taken a directory offline	>0 page
BytesInPerSec vs expected steady-state	Validates capacity model; unexplained growth points to compaction or reassignment leaks	Actual exceeds `(BytesInPerSec * retention_seconds * RF / brokers) + compaction overhead`
Log cleaner dirty ratio / dead thread count	Silent cleaner stalls cause unbounded disk growth on compacted topics like `__consumer_offsets`	Dirty ratio >50% sustained or dead thread count > 0
UnderReplicatedPartitions	Partitions on an offline directory cannot replicate	Nonzero sustained outside maintenance
OfflinePartitionsCount	Total unavailability if the last ISR member lost its directory	Nonzero sustained

Fixes

Retention misconfiguration

Lower retention.ms or retention.bytes dynamically via kafka-configs.sh. This is destructive: data older than the new threshold is deleted at the next retention check interval (default 5 minutes). Do not lower retention below your consumers’ lag horizon. For compacted topics, retention.bytes is ignored unless you also set cleanup.policy=compact,delete.

Dead log cleaner

There is no way to restart the cleaner thread without restarting the broker. Perform a controlled shutdown and restart. Tradeoff: the broker will be offline for the restart duration, and on return it must rebuild ISR from empty (expect elevated UnderReplicatedPartitions). If the cleaner crashes again on the same segment, identify the affected topic and consider restoring from a healthy replica.

Active reassignment consuming disk

Reassignments copy partitions to new brokers before removing them from old ones. If disk is critically low, cancel the reassignment or let it complete and then remove old replicas. Do not kill the broker mid-reassignment.

Broker crashed after all log.dirs failed

Free space at the OS level by moving non-Kafka files or expanding the volume. Restart the broker. If the shutdown was unclean, startup will scan all segments and replay logs. Monitor startup logs; if the broker fails to start, check for corrupt segments.

One directory full among many

If the broker is still running on other directories, do not manually delete segment files. Use kafka-reassign-partitions.sh to move partitions off the affected broker to others with headroom. This is the only safe way to relocate existing partition data. Plan to add capacity or decommission the saturated directory.

Prevention

Monitor each log.dirs mount point independently. Host-level disk utilization is misleading.
Calculate expected usage: (BytesInPerSec * retention_seconds * replication_factor) / number_of_brokers + compaction overhead. If actual usage exceeds this, investigate leaks.
Monitor log cleaner dirty ratio and dead thread count. A stalled cleaner is a common silent cause of unbounded growth.
Maintain at least 15-20% free space per volume to accommodate compaction doubling, reassignment copies, and burst traffic.
Trend disk growth and alert on runway (time-to-full), not just percentage thresholds.
Review partition distribution after major topology changes. Round-robin allocation means new partitions go to the least-loaded directory initially, but hot topics can still skew usage over time.

How Netdata helps

Per-mount disk utilization and Kafka JMX metrics (BytesInPerSec, OfflineLogDirectoryCount, UnderReplicatedPartitions) are available together, so you can correlate disk growth with write rate and replication health.
Runway alerts based on disk usage rate-of-change flag time-to-full trends before static percentage thresholds fire.
Anomaly detection on BytesInPerSec and disk utilization catches divergent growth that static thresholds miss, such as a dead log cleaner or stuck reassignment.
Per-broker process monitoring tracks uptime and recovery state, distinguishing controlled restarts from unclean shutdowns after a disk-full event.
df monitoring evaluates each mount point independently, so uneven log.dirs fill is visible immediately.

Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery

Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Retention misconfiguration

Dead log cleaner

Active reassignment consuming disk

Broker crashed after all log.dirs failed

One directory full among many

Prevention

How Netdata helps

Related guides