Kafka Log directory failed / OfflineLogDirectoryCount > 0: disk errors and JBOD recovery

What this means

When Kafka catches an IOException on a log.dirs path, it marks that log directory offline. The broker increments kafka.log:type=LogManager,name=OfflineLogDirectoryCount and logs the failure. Partitions with replicas on the failed directory lose those replicas. If the partition leader was on that directory and unclean.leader.election.enable=false, the partition becomes unavailable until the controller elects a new leader from the remaining ISR. Producers with acks=all see NotEnoughReplicasException when the surviving ISR drops below min.insync.replicas.

On JBOD hosts with multiple log.dirs, the failure is scoped to the bad disk. The broker stays online while any log directory remains healthy and only shuts down when all configured directories fail, or when only one directory is configured. A broker can therefore present a mix of healthy and unavailable partitions, which is easy to miss in aggregate broker-level dashboards. Per-directory and per-disk metrics are essential.

flowchart TD
    A[Disk I/O error or filesystem corruption] --> B[Kafka catches IOException]
    B --> C[Broker marks log directory offline]
    C --> D[OfflineLogDirectoryCount increments]
    C --> E[Partitions on failed dir lose replicas]
    E --> F{ISR above min.insync.replicas?}
    F -->|Yes| G[Writes continue with reduced durability]
    F -->|No| H[NotEnoughReplicasException or offline partitions]
    D --> I[Operator investigates dmesg and disk health]
    I --> J{JBOD with surviving dirs?}
    J -->|Yes| K[Broker stays online partial degradation]
    J -->|No| L[Full broker impact or shutdown]

Common causes

Cause	What it looks like	First thing to check
Failing disk or SSD on a JBOD host	`OfflineLogDirectoryCount` = 1 on a broker; other dirs healthy; dmesg shows ATA/SCSI errors	`dmesg` and `/proc/diskstats` for the specific device
Filesystem corruption or remounted read-only	Kernel remounted filesystem read-only after errors; Kafka cannot append to segments	`mount` output and dmesg for remount events
Disk full on one JBOD volume	`df` shows 100% on one `log.dirs` path; others have space	Per-directory disk usage, not aggregate
RAID rebuild or heavy non-Kafka I/O	Elevated `await` across all disks; Kafka request latency spikes	`iostat -xz 1` for queue depth and latency

Quick checks

# Confirm offline log directories via JMX
echo "get -b kafka.log:type=LogManager,name=OfflineLogDirectoryCount Value" | java -jar jmxterm.jar -l localhost:9999

# Check broker logs for the exact failure strings
grep -E "Shutdown log directory|Log directory .* failed" /var/log/kafka/server.log
<!-- TODO: verify exact log message strings in target Kafka version -->

# Inspect kernel disk errors
dmesg | grep -i "error" | tail -n 50

# Check per-directory disk usage
grep '^log.dirs=' /etc/kafka/server.properties | cut -d= -f2 | tr ',' '\n' | sed 's/^ *//' | while read -r d; do df -h "$d"; done

# Check disk I/O latency for each log dir device
iostat -xz 1 5

# Verify partition availability across log directories
kafka-log-dirs.sh --bootstrap-server localhost:9092 --describe

# Check if the broker process is still serving connections
PID=$(pgrep -f 'kafka\.Kafka' | head -n 1); test -n "$PID" && ss -tnp | grep "pid=${PID}" | wc -l

How to diagnose it

Confirm the signal. Read OfflineLogDirectoryCount via JMX or your metrics platform. A value of 1 means one directory is offline; values above 1 mean multiple directories have failed. Check broker logs for Log directory ... failed and Shutdown log directory to identify which path failed and when.
Determine the broker scope. Check whether the broker process is still running and serving requests. If the entire broker is down, verify whether all configured log directories failed, or whether only one path was configured. The broker shuts down when no viable log directories remain.
Isolate the hardware failure. Run dmesg for kernel-level disk errors on the device backing the failed directory. Cross-reference the mount point from df or /proc/mounts with the device name. Check iostat -xz 1 for sustained high await or queue depth on that specific device. Healthy JBOD siblings should show normal latency.
Assess partition impact. Run kafka-log-dirs.sh --describe to see which partitions were hosted on the offline directory. Cross-reference with kafka-topics.sh --describe --under-replicated-partitions and kafka-topics.sh --describe --unavailable-partitions. If the offline directory held leaders for topics with replication.factor=1, those partitions are fully unavailable.
Check for cascading effects. Look at IsrShrinksPerSec, UnderReplicatedPartitions, and UnderMinIsrPartitionCount on this broker and across the cluster. A single bad disk can trigger ISR shrinks that push replication below min.insync.replicas, blocking acks=all producers even though some partitions remain online.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`OfflineLogDirectoryCount`	Binary indicator of log directory failure	Any nonzero value sustained >0 seconds
`UnderReplicatedPartitions`	Replicas on the failed dir are not being kept in sync	Rising count on brokers that led partitions on the failed disk
`UnderMinIsrPartitionCount`	Confirms writes are being rejected due to insufficient replicas	Nonzero value means `acks=all` producers are failing
`OfflinePartitionsCount`	Partitions with no available leader	Nonzero means complete unavailability for those partitions
`IsrShrinksPerSec`	Velocity of replicas leaving ISR	Sustained >0 indicates the failure is spreading or persisting
Disk I/O `await`	Root-cause indicator for disk-level degradation	Sustained >20 ms for SSDs or >50 ms for HDDs
`RequestHandlerAvgIdlePercent`	Broker processing capacity	Drop below 0.3 suggests the broker is under pressure from recovery or replication catch-up

Fixes

JBOD disk failure with surviving directories

If the broker is online and other log directories are healthy, evacuate the broker rather than attempting hot recovery. An offline log directory cannot be brought back into service without a broker restart.

Evacuate leadership from the affected broker to move leaders elsewhere. This reduces client impact during the recovery window.
Stop the Kafka process gracefully. A controlled shutdown gives leaders time to migrate cleanly.
Replace or repair the failed disk, recreate the filesystem, and remount the log directory path.
Restart the broker. On startup, it recreates the directory structure. Partitions assigned to this broker re-fetch from their leaders. Expect high UnderReplicatedPartitions and disk I/O as replicas catch up.
Run preferred replica election to restore the original leadership balance once the broker is fully caught up and back in ISR.

During the rebuild, the broker carries no replicas for the affected directories, so the cluster operates with reduced replica capacity. Ensure no other broker fails during this window.

Full broker shutdown from log directory failure

If the broker shut down entirely, check whether all log.dirs failed or whether only one directory was configured. If the disk is unrecoverable, provision a replacement host, assign the same broker ID, and let the controller reassign partitions.

Disk full on one JBOD volume

If the directory went offline due to 100% disk utilization rather than hardware failure:

Verify whether retention or compaction should have reclaimed space. Check log.retention.check.interval.ms and whether the log cleaner thread is alive. Grep logs for cleaner errors and review min.cleanable.dirty.ratio.
If retention is misconfigured, adjust retention.ms or retention.bytes and restart the broker after freeing space. Changing topic retention affects all partitions, not just the full disk.
If one disk is disproportionately full because of partition placement skew, run kafka-reassign-partitions.sh to move heavy partitions to other disks or brokers.

Expanding a JBOD volume online is OS-dependent. Kafka does not rebalance existing segments across directories automatically.

Preventing unclean leader elections during recovery

While a broker is recovering from an offline log directory, never enable unclean.leader.election.enable=true to force availability. Doing so risks data loss by promoting an out-of-sync follower to leader. If partitions are offline because all ISR members are on failed directories, accept the outage and fix the hardware rather than sacrificing durability.

Prevention

Monitor per-directory disk space and I/O latency, not just aggregate broker metrics. JBOD means one disk can fail silently in cluster-level dashboards.
Set unclean.leader.election.enable=false and keep it false. Temporary unavailability during disk failure is preferable to silent data loss.
Keep min.insync.replicas=2 for topics with replication.factor=3 and acks=all. This ensures a single disk failure does not immediately block the write path.
Avoid placing Kafka under mixed I/O workloads that share JBOD disks with other services. RAID rebuilds, backup jobs, or co-located databases can spike await and trigger false offline events.
Confirm disk health before major operational changes. A single failing disk during maintenance can halt a broker and block cluster operations.

How Netdata helps

Netdata surfaces OfflineLogDirectoryCount alongside per-disk await and utilization from /proc/diskstats. Correlate these to confirm whether a Kafka-reported failure matches kernel-level disk errors in the same interval.
The disk latency heatmap and pgmajfault rate help distinguish hardware failure from page cache pressure caused by backfill consumers. Both raise Kafka request latency but have different root causes.
Service discovery alerts on broker process uptime alongside JMX health metrics make it easier to spot the difference between partial JBOD failure (broker up) and full shutdown (broker down).
Custom alerts on UnderReplicatedPartitions and IsrShrinksPerSec per broker catch the cascading impact of a single bad disk before OfflinePartitionsCount rises.

Kafka Log directory failed / OfflineLogDirectoryCount > 0: disk errors and JBOD recovery

Kafka Log directory failed / OfflineLogDirectoryCount > 0: disk errors and JBOD recovery

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

JBOD disk failure with surviving directories

Full broker shutdown from log directory failure

Disk full on one JBOD volume

Preventing unclean leader elections during recovery

Prevention

How Netdata helps

Related guides