Kafka disk I/O latency high: await, LocalTimeMs, and the slow-disk broker

iostat shows await climbing, maybe a disk alert fired. But await alone is not a pageable event. On SSDs and RAID arrays, %util hits 100% under modest load because it measures device busy time, not saturation. What matters is await, the average time for I/O requests to be served. At the Kafka layer, the mirror image is LocalTimeMs in the request latency breakdown. Both spike during normal operations – broker restart with cold page cache, log compaction, and partition reassignment all drive up disk latency without indicating hardware fault. This guide shows how to distinguish a transient spike from a slow disk that will shrink your ISR and block produce requests.

What this means

await is the weighted average of r_await and w_await. It captures queue time plus service time. For Kafka, write latency (w_await) reflects the durability path for produce requests and log flushes. Read latency (r_await) reflects the consumer fetch path. Because Kafka relies on the OS page cache for reads, healthy tail consumers should show r_await near zero. When r_await jumps, data has left the cache.

LocalTimeMs measures how long the broker spends processing a request locally, including appending to the log or reading from it. When LocalTimeMs rises alongside OS await, you are looking at the same bottleneck from two perspectives. LocalTimeMs can also rise for non-disk reasons, such as message format conversion. Correlate with RequestQueueTimeMs and RequestHandlerAvgIdlePercent. If LocalTimeMs is high but idle percent is healthy and the request queue is empty, the disk is not the problem.

flowchart TD
    A[OS await elevated] -->|check| B{Broker impact?}
    B -->|idle% low / URP rising| C[LocalTimeMs high]
    B -->|idle% normal| D[Transient]
    C -->|w_await up| E[Disk degradation]
    C -->|r_await up| F[Page cache miss]
    D -->|cause| G[Cold start / compaction / reassignment]
    E -->|confirm| H[LogFlushRateAndTimeMs]
    F -->|confirm| I[pgmajfault / consumer lag]

Common causes

CauseWhat it looks likeFirst thing to check
Disk degradation or hardware wearw_await grows steadily; LocalTimeMs for Produce spikes; LogFlushRateAndTimeMs p99 exceeds 500 ms; RequestHandlerAvgIdlePercent drops below 0.2iostat -xz 1 and JMX LogFlushRateAndTimeMs
Page cache eviction from backfill consumerr_await jumps; FetchConsumer LocalTimeMs spikes; BytesOutPerSec rises without BytesInPerSec increase; pgmajfault rate doublesConsumer group lag and /proc/vmstat pgmajfault
Log compaction burstTransient await spikes; max-dirty-percent climbing; no growth in RequestQueueTimeMs or URPJMX kafka.log:type=LogCleanerManager,name=max-dirty-percent
Cold start or partition reassignmentawait high after broker restart or during reassignment; RequestHandlerAvgIdlePercent normal; URP transientBroker uptime and ReassigningPartitions status
Swap pressure from JVM heapawait elevated with swap activity; long GC pauses; si and so visible in vmstatvmstat 1 and GC logs

Quick checks

# True disk saturation indicator: await, not %util
iostat -xz 1

# Page cache pressure: compare two samples 10 seconds apart
cat /proc/vmstat | grep pgmajfault

# Kafka broker impact: request handler idle and queue time
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Local disk processing time
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=FetchConsumer 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Cluster durability status
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Disk space on log directories
grep log.dirs /etc/kafka/server.properties | tr ',' '\n' | while read d; do df -h "$d"; done

# Log cleaner health (compacted topics)
echo "get -b kafka.log:type=LogCleaner,name=DeadThreadCount Value" | java -jar jmxterm.jar -l localhost:9999

How to diagnose it

  1. Establish the hardware baseline. SSD await should normally be under 5 ms; HDD under 10 ms. Sustained await above 20 ms for SSDs or 50 ms for HDDs is abnormal.
  2. Split reads and writes. Use r_await and w_await. Write spikes point to disk degradation. Read spikes point to page cache misses.
  3. Map to Kafka request type. If w_await is high, check LocalTimeMs for Produce. If r_await is high, check LocalTimeMs for FetchConsumer.
  4. Confirm broker impact. Check RequestHandlerAvgIdlePercent. If it is below 0.2 and falling, or RequestQueueTimeMs is growing, the disk problem is backing up the broker. If idle percent is above 0.5, the spike may be transient background I/O.
  5. Check for transient explanations. Look at broker uptime (cold start under 600 s), reassignment status, and compaction dirty ratio. If any of these match, the latency is expected and self-healing.
  6. Check for disk failure signals. OfflineLogDirectoryCount above 0 is a binary failure. LogFlushRateAndTimeMs p99 above 500 ms confirms the write path is struggling.
  7. Identify the consumer culprit for read spikes. Look for consumer groups with lag that is both large and actively shrinking, indicating a backfill. Corroborate with BytesOutPerSec rising without BytesInPerSec.
  8. Check for swap. Run vmstat 1. If si and so are nonzero, JVM heap pages may be swapping to disk, compounding I/O latency. Confirm vm.swappiness is set to 1.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
iostat awaitTrue disk saturation indicator; %util misleads on SSD and RAIDSSD above 20 ms sustained; HDD above 50 ms sustained
r_await vs w_awaitDistinguishes read cache misses from write path degradationw_await up signals disk wear; r_await up signals cache miss
LocalTimeMs (Produce / Fetch)Broker-level view of time spent in local I/Op99 above 2-3x baseline
RequestHandlerAvgIdlePercentThreads blocking on I/O vs computeBelow 0.3 sustained; below 0.2 critical
RequestQueueTimeMsQueue between network and I/O threadsGrowing while idle percent falls
UnderReplicatedPartitionsDurability degradation from slow followersNonzero above 5 minutes outside maintenance
pgmajfault ratePage cache effectiveness2x baseline or higher
LogFlushRateAndTimeMsFsync latency on log segmentsp99 above 500 ms
OfflineLogDirectoryCountBinary disk failure signalAny nonzero value

Fixes

Disk degradation

If w_await is sustained and RequestHandlerAvgIdlePercent is below 0.2, the disk is the bottleneck. Do not restart the broker as a first fix; restarting loses the page cache and generates a wave of follower fetches. Instead, trigger a controlled shutdown to take the broker out of the data path. This moves leadership away cleanly and lets replicas catch up on healthier brokers. Tradeoff: you will see transient URP during migration. If the disk is JBOD and only one directory is slow, you may be able to move partitions off that specific log directory, but this requires reassignment planning.

Page cache thrashing

If a backfill consumer is driving r_await and evicting hot data, throttle the consumer using Kafka quotas on consumer_byte_rate. This caps the read bandwidth without stopping the job. Alternatively, if running Kafka 2.4+, enable follower fetching so backfill reads hit follower replicas instead of the leader. Tradeoff: backfill takes longer, but tail consumer latency recovers immediately.

Compaction I/O saturation

If log cleaner threads are driving spikes but DeadThreadCount is zero, the disk itself is too slow for the compaction workload. Adding log.cleaner.threads will not help an I/O-bound cleaner. The fix is faster storage or reducing compacted topic throughput. Tradeoff: infrastructure cost.

Silent cleaner failure

If DeadThreadCount is above 0, compaction has stopped. A broker restart resurrects the cleaner thread. Before restarting, grep logs for the root cause. Tradeoff: brief URP while the broker rejoins.

Disk space pressure

If await is high because the disk is above 90% full, emergency retention reduction or volume expansion is required. Be aware that compacted topics free space less predictably under retention.bytes because segments must be compacted before deletion. Tradeoff: reducing retention risks data loss for consumers that have not caught up.

Prevention

  • Alert on await, not %util. Set thresholds relative to your hardware baseline.
  • Monitor the Kafka request latency breakdown, not just TotalTimeMs.
  • Set vm.swappiness = 1 to prevent the OS from swapping JVM heap pages.
  • Monitor DeadThreadCount and max-dirty-percent to catch silent compaction failure before disk fills.
  • Maintain at least 15-20% free space on each log.dirs volume to account for compaction doubling and reassignment copies.
  • Run game-day tests: shut down a broker intentionally and measure ISR recovery time and page cache warmup duration. This establishes your real baselines for await during failure.

How Netdata helps

  • Correlates OS disk await with Kafka LocalTimeMs to show whether a broker-level latency spike matches the disk or a different layer.
  • Shows RequestHandlerAvgIdlePercent and RequestQueueTimeMs per broker to confirm impact before paging.
  • Collects UnderReplicatedPartitions, OfflineLogDirectoryCount, and LogFlushRateAndTimeMs without JMX scripting.
  • Tracks page cache pressure through major page fault metrics, highlighting backfill consumers before they degrade tail latency.
  • Baselines disk latency per broker and flags deviations from historical norms, catching disk wear early.