Kafka disk I/O latency high: await, LocalTimeMs, and the slow-disk broker
iostat shows await climbing, maybe a disk alert fired. But await alone is not a pageable event. On SSDs and RAID arrays, %util hits 100% under modest load because it measures device busy time, not saturation. What matters is await, the average time for I/O requests to be served. At the Kafka layer, the mirror image is LocalTimeMs in the request latency breakdown. Both spike during normal operations – broker restart with cold page cache, log compaction, and partition reassignment all drive up disk latency without indicating hardware fault. This guide shows how to distinguish a transient spike from a slow disk that will shrink your ISR and block produce requests.
What this means
await is the weighted average of r_await and w_await. It captures queue time plus service time. For Kafka, write latency (w_await) reflects the durability path for produce requests and log flushes. Read latency (r_await) reflects the consumer fetch path. Because Kafka relies on the OS page cache for reads, healthy tail consumers should show r_await near zero. When r_await jumps, data has left the cache.
LocalTimeMs measures how long the broker spends processing a request locally, including appending to the log or reading from it. When LocalTimeMs rises alongside OS await, you are looking at the same bottleneck from two perspectives. LocalTimeMs can also rise for non-disk reasons, such as message format conversion. Correlate with RequestQueueTimeMs and RequestHandlerAvgIdlePercent. If LocalTimeMs is high but idle percent is healthy and the request queue is empty, the disk is not the problem.
flowchart TD
A[OS await elevated] -->|check| B{Broker impact?}
B -->|idle% low / URP rising| C[LocalTimeMs high]
B -->|idle% normal| D[Transient]
C -->|w_await up| E[Disk degradation]
C -->|r_await up| F[Page cache miss]
D -->|cause| G[Cold start / compaction / reassignment]
E -->|confirm| H[LogFlushRateAndTimeMs]
F -->|confirm| I[pgmajfault / consumer lag]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Disk degradation or hardware wear | w_await grows steadily; LocalTimeMs for Produce spikes; LogFlushRateAndTimeMs p99 exceeds 500 ms; RequestHandlerAvgIdlePercent drops below 0.2 | iostat -xz 1 and JMX LogFlushRateAndTimeMs |
| Page cache eviction from backfill consumer | r_await jumps; FetchConsumer LocalTimeMs spikes; BytesOutPerSec rises without BytesInPerSec increase; pgmajfault rate doubles | Consumer group lag and /proc/vmstat pgmajfault |
| Log compaction burst | Transient await spikes; max-dirty-percent climbing; no growth in RequestQueueTimeMs or URP | JMX kafka.log:type=LogCleanerManager,name=max-dirty-percent |
| Cold start or partition reassignment | await high after broker restart or during reassignment; RequestHandlerAvgIdlePercent normal; URP transient | Broker uptime and ReassigningPartitions status |
| Swap pressure from JVM heap | await elevated with swap activity; long GC pauses; si and so visible in vmstat | vmstat 1 and GC logs |
Quick checks
# True disk saturation indicator: await, not %util
iostat -xz 1
# Page cache pressure: compare two samples 10 seconds apart
cat /proc/vmstat | grep pgmajfault
# Kafka broker impact: request handler idle and queue time
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# Local disk processing time
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=FetchConsumer 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# Cluster durability status
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Disk space on log directories
grep log.dirs /etc/kafka/server.properties | tr ',' '\n' | while read d; do df -h "$d"; done
# Log cleaner health (compacted topics)
echo "get -b kafka.log:type=LogCleaner,name=DeadThreadCount Value" | java -jar jmxterm.jar -l localhost:9999
How to diagnose it
- Establish the hardware baseline. SSD
awaitshould normally be under 5 ms; HDD under 10 ms. Sustainedawaitabove 20 ms for SSDs or 50 ms for HDDs is abnormal. - Split reads and writes. Use
r_awaitandw_await. Write spikes point to disk degradation. Read spikes point to page cache misses. - Map to Kafka request type. If
w_awaitis high, checkLocalTimeMsforProduce. Ifr_awaitis high, checkLocalTimeMsforFetchConsumer. - Confirm broker impact. Check
RequestHandlerAvgIdlePercent. If it is below 0.2 and falling, orRequestQueueTimeMsis growing, the disk problem is backing up the broker. If idle percent is above 0.5, the spike may be transient background I/O. - Check for transient explanations. Look at broker uptime (cold start under 600 s), reassignment status, and compaction dirty ratio. If any of these match, the latency is expected and self-healing.
- Check for disk failure signals.
OfflineLogDirectoryCountabove 0 is a binary failure.LogFlushRateAndTimeMsp99 above 500 ms confirms the write path is struggling. - Identify the consumer culprit for read spikes. Look for consumer groups with lag that is both large and actively shrinking, indicating a backfill. Corroborate with
BytesOutPerSecrising withoutBytesInPerSec. - Check for swap. Run
vmstat 1. Ifsiandsoare nonzero, JVM heap pages may be swapping to disk, compounding I/O latency. Confirmvm.swappinessis set to 1.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
iostat await | True disk saturation indicator; %util misleads on SSD and RAID | SSD above 20 ms sustained; HDD above 50 ms sustained |
r_await vs w_await | Distinguishes read cache misses from write path degradation | w_await up signals disk wear; r_await up signals cache miss |
LocalTimeMs (Produce / Fetch) | Broker-level view of time spent in local I/O | p99 above 2-3x baseline |
RequestHandlerAvgIdlePercent | Threads blocking on I/O vs compute | Below 0.3 sustained; below 0.2 critical |
RequestQueueTimeMs | Queue between network and I/O threads | Growing while idle percent falls |
UnderReplicatedPartitions | Durability degradation from slow followers | Nonzero above 5 minutes outside maintenance |
pgmajfault rate | Page cache effectiveness | 2x baseline or higher |
LogFlushRateAndTimeMs | Fsync latency on log segments | p99 above 500 ms |
OfflineLogDirectoryCount | Binary disk failure signal | Any nonzero value |
Fixes
Disk degradation
If w_await is sustained and RequestHandlerAvgIdlePercent is below 0.2, the disk is the bottleneck. Do not restart the broker as a first fix; restarting loses the page cache and generates a wave of follower fetches. Instead, trigger a controlled shutdown to take the broker out of the data path. This moves leadership away cleanly and lets replicas catch up on healthier brokers. Tradeoff: you will see transient URP during migration. If the disk is JBOD and only one directory is slow, you may be able to move partitions off that specific log directory, but this requires reassignment planning.
Page cache thrashing
If a backfill consumer is driving r_await and evicting hot data, throttle the consumer using Kafka quotas on consumer_byte_rate. This caps the read bandwidth without stopping the job. Alternatively, if running Kafka 2.4+, enable follower fetching so backfill reads hit follower replicas instead of the leader. Tradeoff: backfill takes longer, but tail consumer latency recovers immediately.
Compaction I/O saturation
If log cleaner threads are driving spikes but DeadThreadCount is zero, the disk itself is too slow for the compaction workload. Adding log.cleaner.threads will not help an I/O-bound cleaner. The fix is faster storage or reducing compacted topic throughput. Tradeoff: infrastructure cost.
Silent cleaner failure
If DeadThreadCount is above 0, compaction has stopped. A broker restart resurrects the cleaner thread. Before restarting, grep logs for the root cause. Tradeoff: brief URP while the broker rejoins.
Disk space pressure
If await is high because the disk is above 90% full, emergency retention reduction or volume expansion is required. Be aware that compacted topics free space less predictably under retention.bytes because segments must be compacted before deletion. Tradeoff: reducing retention risks data loss for consumers that have not caught up.
Prevention
- Alert on
await, not%util. Set thresholds relative to your hardware baseline. - Monitor the Kafka request latency breakdown, not just
TotalTimeMs. - Set
vm.swappiness = 1to prevent the OS from swapping JVM heap pages. - Monitor
DeadThreadCountandmax-dirty-percentto catch silent compaction failure before disk fills. - Maintain at least 15-20% free space on each
log.dirsvolume to account for compaction doubling and reassignment copies. - Run game-day tests: shut down a broker intentionally and measure ISR recovery time and page cache warmup duration. This establishes your real baselines for
awaitduring failure.
How Netdata helps
- Correlates OS disk
awaitwith KafkaLocalTimeMsto show whether a broker-level latency spike matches the disk or a different layer. - Shows
RequestHandlerAvgIdlePercentandRequestQueueTimeMsper broker to confirm impact before paging. - Collects
UnderReplicatedPartitions,OfflineLogDirectoryCount, andLogFlushRateAndTimeMswithout JMX scripting. - Tracks page cache pressure through major page fault metrics, highlighting backfill consumers before they degrade tail latency.
- Baselines disk latency per broker and flags deviations from historical norms, catching disk wear early.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka enable.auto.commit data loss: committed offsets that outrun processing
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka fetch request latency high: FetchConsumer vs FetchFollower and page cache misses
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka JVM heap and Full GC pauses: ISR drops, session timeouts, and right-sizing the heap
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind







