Kafka Log directory failed / OfflineLogDirectoryCount > 0: disk errors and JBOD recovery
What this means
When Kafka catches an IOException on a log.dirs path, it marks that log directory offline. The broker increments kafka.log:type=LogManager,name=OfflineLogDirectoryCount and logs the failure. Partitions with replicas on the failed directory lose those replicas. If the partition leader was on that directory and unclean.leader.election.enable=false, the partition becomes unavailable until the controller elects a new leader from the remaining ISR. Producers with acks=all see NotEnoughReplicasException when the surviving ISR drops below min.insync.replicas.
On JBOD hosts with multiple log.dirs, the failure is scoped to the bad disk. The broker stays online while any log directory remains healthy and only shuts down when all configured directories fail, or when only one directory is configured. A broker can therefore present a mix of healthy and unavailable partitions, which is easy to miss in aggregate broker-level dashboards. Per-directory and per-disk metrics are essential.
flowchart TD
A[Disk I/O error or filesystem corruption] --> B[Kafka catches IOException]
B --> C[Broker marks log directory offline]
C --> D[OfflineLogDirectoryCount increments]
C --> E[Partitions on failed dir lose replicas]
E --> F{ISR above min.insync.replicas?}
F -->|Yes| G[Writes continue with reduced durability]
F -->|No| H[NotEnoughReplicasException or offline partitions]
D --> I[Operator investigates dmesg and disk health]
I --> J{JBOD with surviving dirs?}
J -->|Yes| K[Broker stays online partial degradation]
J -->|No| L[Full broker impact or shutdown]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Failing disk or SSD on a JBOD host | OfflineLogDirectoryCount = 1 on a broker; other dirs healthy; dmesg shows ATA/SCSI errors | dmesg and /proc/diskstats for the specific device |
| Filesystem corruption or remounted read-only | Kernel remounted filesystem read-only after errors; Kafka cannot append to segments | mount output and dmesg for remount events |
| Disk full on one JBOD volume | df shows 100% on one log.dirs path; others have space | Per-directory disk usage, not aggregate |
| RAID rebuild or heavy non-Kafka I/O | Elevated await across all disks; Kafka request latency spikes | iostat -xz 1 for queue depth and latency |
Quick checks
# Confirm offline log directories via JMX
echo "get -b kafka.log:type=LogManager,name=OfflineLogDirectoryCount Value" | java -jar jmxterm.jar -l localhost:9999
# Check broker logs for the exact failure strings
grep -E "Shutdown log directory|Log directory .* failed" /var/log/kafka/server.log
<!-- TODO: verify exact log message strings in target Kafka version -->
# Inspect kernel disk errors
dmesg | grep -i "error" | tail -n 50
# Check per-directory disk usage
grep '^log.dirs=' /etc/kafka/server.properties | cut -d= -f2 | tr ',' '\n' | sed 's/^ *//' | while read -r d; do df -h "$d"; done
# Check disk I/O latency for each log dir device
iostat -xz 1 5
# Verify partition availability across log directories
kafka-log-dirs.sh --bootstrap-server localhost:9092 --describe
# Check if the broker process is still serving connections
PID=$(pgrep -f 'kafka\.Kafka' | head -n 1); test -n "$PID" && ss -tnp | grep "pid=${PID}" | wc -l
How to diagnose it
Confirm the signal. Read
OfflineLogDirectoryCountvia JMX or your metrics platform. A value of 1 means one directory is offline; values above 1 mean multiple directories have failed. Check broker logs forLog directory ... failedandShutdown log directoryto identify which path failed and when.Determine the broker scope. Check whether the broker process is still running and serving requests. If the entire broker is down, verify whether all configured log directories failed, or whether only one path was configured. The broker shuts down when no viable log directories remain.
Isolate the hardware failure. Run
dmesgfor kernel-level disk errors on the device backing the failed directory. Cross-reference the mount point fromdfor/proc/mountswith the device name. Checkiostat -xz 1for sustained highawaitor queue depth on that specific device. Healthy JBOD siblings should show normal latency.Assess partition impact. Run
kafka-log-dirs.sh --describeto see which partitions were hosted on the offline directory. Cross-reference withkafka-topics.sh --describe --under-replicated-partitionsandkafka-topics.sh --describe --unavailable-partitions. If the offline directory held leaders for topics withreplication.factor=1, those partitions are fully unavailable.Check for cascading effects. Look at
IsrShrinksPerSec,UnderReplicatedPartitions, andUnderMinIsrPartitionCounton this broker and across the cluster. A single bad disk can trigger ISR shrinks that push replication belowmin.insync.replicas, blockingacks=allproducers even though some partitions remain online.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
OfflineLogDirectoryCount | Binary indicator of log directory failure | Any nonzero value sustained >0 seconds |
UnderReplicatedPartitions | Replicas on the failed dir are not being kept in sync | Rising count on brokers that led partitions on the failed disk |
UnderMinIsrPartitionCount | Confirms writes are being rejected due to insufficient replicas | Nonzero value means acks=all producers are failing |
OfflinePartitionsCount | Partitions with no available leader | Nonzero means complete unavailability for those partitions |
IsrShrinksPerSec | Velocity of replicas leaving ISR | Sustained >0 indicates the failure is spreading or persisting |
Disk I/O await | Root-cause indicator for disk-level degradation | Sustained >20 ms for SSDs or >50 ms for HDDs |
RequestHandlerAvgIdlePercent | Broker processing capacity | Drop below 0.3 suggests the broker is under pressure from recovery or replication catch-up |
Fixes
JBOD disk failure with surviving directories
If the broker is online and other log directories are healthy, evacuate the broker rather than attempting hot recovery. An offline log directory cannot be brought back into service without a broker restart.
- Evacuate leadership from the affected broker to move leaders elsewhere. This reduces client impact during the recovery window.
- Stop the Kafka process gracefully. A controlled shutdown gives leaders time to migrate cleanly.
- Replace or repair the failed disk, recreate the filesystem, and remount the log directory path.
- Restart the broker. On startup, it recreates the directory structure. Partitions assigned to this broker re-fetch from their leaders. Expect high
UnderReplicatedPartitionsand disk I/O as replicas catch up. - Run preferred replica election to restore the original leadership balance once the broker is fully caught up and back in ISR.
During the rebuild, the broker carries no replicas for the affected directories, so the cluster operates with reduced replica capacity. Ensure no other broker fails during this window.
Full broker shutdown from log directory failure
If the broker shut down entirely, check whether all log.dirs failed or whether only one directory was configured. If the disk is unrecoverable, provision a replacement host, assign the same broker ID, and let the controller reassign partitions.
Disk full on one JBOD volume
If the directory went offline due to 100% disk utilization rather than hardware failure:
- Verify whether retention or compaction should have reclaimed space. Check
log.retention.check.interval.msand whether the log cleaner thread is alive. Grep logs for cleaner errors and reviewmin.cleanable.dirty.ratio. - If retention is misconfigured, adjust
retention.msorretention.bytesand restart the broker after freeing space. Changing topic retention affects all partitions, not just the full disk. - If one disk is disproportionately full because of partition placement skew, run
kafka-reassign-partitions.shto move heavy partitions to other disks or brokers.
Expanding a JBOD volume online is OS-dependent. Kafka does not rebalance existing segments across directories automatically.
Preventing unclean leader elections during recovery
While a broker is recovering from an offline log directory, never enable unclean.leader.election.enable=true to force availability. Doing so risks data loss by promoting an out-of-sync follower to leader. If partitions are offline because all ISR members are on failed directories, accept the outage and fix the hardware rather than sacrificing durability.
Prevention
- Monitor per-directory disk space and I/O latency, not just aggregate broker metrics. JBOD means one disk can fail silently in cluster-level dashboards.
- Set
unclean.leader.election.enable=falseand keep it false. Temporary unavailability during disk failure is preferable to silent data loss. - Keep
min.insync.replicas=2for topics withreplication.factor=3andacks=all. This ensures a single disk failure does not immediately block the write path. - Avoid placing Kafka under mixed I/O workloads that share JBOD disks with other services. RAID rebuilds, backup jobs, or co-located databases can spike
awaitand trigger false offline events. - Confirm disk health before major operational changes. A single failing disk during maintenance can halt a broker and block cluster operations.
How Netdata helps
- Netdata surfaces
OfflineLogDirectoryCountalongside per-diskawaitand utilization from/proc/diskstats. Correlate these to confirm whether a Kafka-reported failure matches kernel-level disk errors in the same interval. - The disk latency heatmap and
pgmajfaultrate help distinguish hardware failure from page cache pressure caused by backfill consumers. Both raise Kafka request latency but have different root causes. - Service discovery alerts on broker process uptime alongside JMX health metrics make it easier to spot the difference between partial JBOD failure (broker up) and full shutdown (broker down).
- Custom alerts on
UnderReplicatedPartitionsandIsrShrinksPerSecper broker catch the cascading impact of a single bad disk beforeOfflinePartitionsCountrises.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka enable.auto.commit data loss: committed offsets that outrun processing
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka fetch request latency high: FetchConsumer vs FetchFollower and page cache misses
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka JVM heap and Full GC pauses: ISR drops, session timeouts, and right-sizing the heap
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind







