Kafka Log directory failed / OfflineLogDirectoryCount > 0: disk errors and JBOD recovery

What this means

When Kafka catches an IOException on a log.dirs path, it marks that log directory offline. The broker increments kafka.log:type=LogManager,name=OfflineLogDirectoryCount and logs the failure. Partitions with replicas on the failed directory lose those replicas. If the partition leader was on that directory and unclean.leader.election.enable=false, the partition becomes unavailable until the controller elects a new leader from the remaining ISR. Producers with acks=all see NotEnoughReplicasException when the surviving ISR drops below min.insync.replicas.

On JBOD hosts with multiple log.dirs, the failure is scoped to the bad disk. The broker stays online while any log directory remains healthy and only shuts down when all configured directories fail, or when only one directory is configured. A broker can therefore present a mix of healthy and unavailable partitions, which is easy to miss in aggregate broker-level dashboards. Per-directory and per-disk metrics are essential.

flowchart TD
    A[Disk I/O error or filesystem corruption] --> B[Kafka catches IOException]
    B --> C[Broker marks log directory offline]
    C --> D[OfflineLogDirectoryCount increments]
    C --> E[Partitions on failed dir lose replicas]
    E --> F{ISR above min.insync.replicas?}
    F -->|Yes| G[Writes continue with reduced durability]
    F -->|No| H[NotEnoughReplicasException or offline partitions]
    D --> I[Operator investigates dmesg and disk health]
    I --> J{JBOD with surviving dirs?}
    J -->|Yes| K[Broker stays online partial degradation]
    J -->|No| L[Full broker impact or shutdown]

Common causes

CauseWhat it looks likeFirst thing to check
Failing disk or SSD on a JBOD hostOfflineLogDirectoryCount = 1 on a broker; other dirs healthy; dmesg shows ATA/SCSI errorsdmesg and /proc/diskstats for the specific device
Filesystem corruption or remounted read-onlyKernel remounted filesystem read-only after errors; Kafka cannot append to segmentsmount output and dmesg for remount events
Disk full on one JBOD volumedf shows 100% on one log.dirs path; others have spacePer-directory disk usage, not aggregate
RAID rebuild or heavy non-Kafka I/OElevated await across all disks; Kafka request latency spikesiostat -xz 1 for queue depth and latency

Quick checks

# Confirm offline log directories via JMX
echo "get -b kafka.log:type=LogManager,name=OfflineLogDirectoryCount Value" | java -jar jmxterm.jar -l localhost:9999

# Check broker logs for the exact failure strings
grep -E "Shutdown log directory|Log directory .* failed" /var/log/kafka/server.log
<!-- TODO: verify exact log message strings in target Kafka version -->

# Inspect kernel disk errors
dmesg | grep -i "error" | tail -n 50

# Check per-directory disk usage
grep '^log.dirs=' /etc/kafka/server.properties | cut -d= -f2 | tr ',' '\n' | sed 's/^ *//' | while read -r d; do df -h "$d"; done

# Check disk I/O latency for each log dir device
iostat -xz 1 5

# Verify partition availability across log directories
kafka-log-dirs.sh --bootstrap-server localhost:9092 --describe

# Check if the broker process is still serving connections
PID=$(pgrep -f 'kafka\.Kafka' | head -n 1); test -n "$PID" && ss -tnp | grep "pid=${PID}" | wc -l

How to diagnose it

  1. Confirm the signal. Read OfflineLogDirectoryCount via JMX or your metrics platform. A value of 1 means one directory is offline; values above 1 mean multiple directories have failed. Check broker logs for Log directory ... failed and Shutdown log directory to identify which path failed and when.

  2. Determine the broker scope. Check whether the broker process is still running and serving requests. If the entire broker is down, verify whether all configured log directories failed, or whether only one path was configured. The broker shuts down when no viable log directories remain.

  3. Isolate the hardware failure. Run dmesg for kernel-level disk errors on the device backing the failed directory. Cross-reference the mount point from df or /proc/mounts with the device name. Check iostat -xz 1 for sustained high await or queue depth on that specific device. Healthy JBOD siblings should show normal latency.

  4. Assess partition impact. Run kafka-log-dirs.sh --describe to see which partitions were hosted on the offline directory. Cross-reference with kafka-topics.sh --describe --under-replicated-partitions and kafka-topics.sh --describe --unavailable-partitions. If the offline directory held leaders for topics with replication.factor=1, those partitions are fully unavailable.

  5. Check for cascading effects. Look at IsrShrinksPerSec, UnderReplicatedPartitions, and UnderMinIsrPartitionCount on this broker and across the cluster. A single bad disk can trigger ISR shrinks that push replication below min.insync.replicas, blocking acks=all producers even though some partitions remain online.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
OfflineLogDirectoryCountBinary indicator of log directory failureAny nonzero value sustained >0 seconds
UnderReplicatedPartitionsReplicas on the failed dir are not being kept in syncRising count on brokers that led partitions on the failed disk
UnderMinIsrPartitionCountConfirms writes are being rejected due to insufficient replicasNonzero value means acks=all producers are failing
OfflinePartitionsCountPartitions with no available leaderNonzero means complete unavailability for those partitions
IsrShrinksPerSecVelocity of replicas leaving ISRSustained >0 indicates the failure is spreading or persisting
Disk I/O awaitRoot-cause indicator for disk-level degradationSustained >20 ms for SSDs or >50 ms for HDDs
RequestHandlerAvgIdlePercentBroker processing capacityDrop below 0.3 suggests the broker is under pressure from recovery or replication catch-up

Fixes

JBOD disk failure with surviving directories

If the broker is online and other log directories are healthy, evacuate the broker rather than attempting hot recovery. An offline log directory cannot be brought back into service without a broker restart.

  1. Evacuate leadership from the affected broker to move leaders elsewhere. This reduces client impact during the recovery window.
  2. Stop the Kafka process gracefully. A controlled shutdown gives leaders time to migrate cleanly.
  3. Replace or repair the failed disk, recreate the filesystem, and remount the log directory path.
  4. Restart the broker. On startup, it recreates the directory structure. Partitions assigned to this broker re-fetch from their leaders. Expect high UnderReplicatedPartitions and disk I/O as replicas catch up.
  5. Run preferred replica election to restore the original leadership balance once the broker is fully caught up and back in ISR.

During the rebuild, the broker carries no replicas for the affected directories, so the cluster operates with reduced replica capacity. Ensure no other broker fails during this window.

Full broker shutdown from log directory failure

If the broker shut down entirely, check whether all log.dirs failed or whether only one directory was configured. If the disk is unrecoverable, provision a replacement host, assign the same broker ID, and let the controller reassign partitions.

Disk full on one JBOD volume

If the directory went offline due to 100% disk utilization rather than hardware failure:

  1. Verify whether retention or compaction should have reclaimed space. Check log.retention.check.interval.ms and whether the log cleaner thread is alive. Grep logs for cleaner errors and review min.cleanable.dirty.ratio.
  2. If retention is misconfigured, adjust retention.ms or retention.bytes and restart the broker after freeing space. Changing topic retention affects all partitions, not just the full disk.
  3. If one disk is disproportionately full because of partition placement skew, run kafka-reassign-partitions.sh to move heavy partitions to other disks or brokers.

Expanding a JBOD volume online is OS-dependent. Kafka does not rebalance existing segments across directories automatically.

Preventing unclean leader elections during recovery

While a broker is recovering from an offline log directory, never enable unclean.leader.election.enable=true to force availability. Doing so risks data loss by promoting an out-of-sync follower to leader. If partitions are offline because all ISR members are on failed directories, accept the outage and fix the hardware rather than sacrificing durability.

Prevention

  • Monitor per-directory disk space and I/O latency, not just aggregate broker metrics. JBOD means one disk can fail silently in cluster-level dashboards.
  • Set unclean.leader.election.enable=false and keep it false. Temporary unavailability during disk failure is preferable to silent data loss.
  • Keep min.insync.replicas=2 for topics with replication.factor=3 and acks=all. This ensures a single disk failure does not immediately block the write path.
  • Avoid placing Kafka under mixed I/O workloads that share JBOD disks with other services. RAID rebuilds, backup jobs, or co-located databases can spike await and trigger false offline events.
  • Confirm disk health before major operational changes. A single failing disk during maintenance can halt a broker and block cluster operations.

How Netdata helps

  • Netdata surfaces OfflineLogDirectoryCount alongside per-disk await and utilization from /proc/diskstats. Correlate these to confirm whether a Kafka-reported failure matches kernel-level disk errors in the same interval.
  • The disk latency heatmap and pgmajfault rate help distinguish hardware failure from page cache pressure caused by backfill consumers. Both raise Kafka request latency but have different root causes.
  • Service discovery alerts on broker process uptime alongside JMX health metrics make it easier to spot the difference between partial JBOD failure (broker up) and full shutdown (broker down).
  • Custom alerts on UnderReplicatedPartitions and IsrShrinksPerSec per broker catch the cascading impact of a single bad disk before OfflinePartitionsCount rises.