Kafka LogFlushRateAndTimeMs high: fsync latency and a failing disk
When kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs p99 stays above 200 ms, the broker’s fsync path is slow. On SSD-backed clusters, p99 should stay below 50 ms; sustained values above 2 s point to disk degradation. Because most deployments leave log.flush.interval.messages and log.flush.interval.ms unset and rely on replication plus OS lazy flush, this metric reflects kernel-driven flushes or explicit fsyncs. A slow flush raises produce LocalTimeMs, blocks request handler threads, and can push followers out of ISR. This guide shows how to tell whether the cause is a failing disk, a bad flush policy, or transient I/O contention.
What this means
LogFlushRateAndTimeMs measures wall-clock time for log segment fsyncs, exposed as JMX MBean kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs. Watch 99thPercentile; Count tells you how often flushes happen.
With application-level flush unset, Kafka appends to the OS page cache and lets kernel flusher threads write to disk. In this mode, the metric is a direct signal of disk write performance. Expect SSD p99 < 50 ms, >200 ms abnormal, >2 s likely hardware trouble. On HDDs, typical p99 is <100 ms; sustained >500 ms warrants investigation.
The metric is broker-wide. On JBOD, one slow log.dirs disk can raise broker-level flush latency while others remain healthy. Correlate it with per-device OS metrics, not cluster aggregates.
flowchart TD
A[LogFlushRateAndTimeMs p99 elevated] --> B{Explicit flush configured?}
B -->|Yes| C[Unset log.flush.interval.ms and log.flush.interval.messages]
B -->|No| D[iostat await elevated?]
D -->|Yes| E{SMART errors or RAID event?}
E -->|Yes| F[Evacuate broker and replace or rebuild disk]
E -->|No| G[Check co-tenancy, filesystem, RAID cache]
D -->|No| H[Check compaction spikes and page cache pressure]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Failing or degraded disk/SSD | p99 >2 s sustained; iostat await high; SMART reallocated or pending sectors rising | sudo smartctl -a /dev/sdX and sudo dmesg for I/O errors |
| Explicit application flush configured | Count is high and p99 tracks flush interval; settings force frequent fsyncs | server.properties for log.flush.interval.messages or log.flush.interval.ms |
| RAID rebuild or controller cache failure | Sudden latency jump across all devices on the host; no SMART errors; rebuild logged | RAID controller logs and sudo dmesg |
| Filesystem or OS-level contention | Spiky write latency on ext4 or shared disks; OS journal or other workloads co-located with log.dirs | Per-device iostat and mount options |
| Log compaction or segment roll spikes | Periodic flush latency spikes aligned with cleaner runs; max-dirty-percent climbing | Log cleaner metrics and compaction schedule |
Read-only checks
These checks are safe to run during an incident. Some require root to access kernel logs or SMART data.
# Check log flush p99 and flush rate from JMX
echo "get -b kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs Count" | java -jar jmxterm.jar -l localhost:9999
# Check disk latency for devices backing log.dirs
iostat -xz 1
# Check for offline log directories
echo "get -b kafka.log:type=LogManager,name=OfflineLogDirectoryCount Value" | java -jar jmxterm.jar -l localhost:9999
# Check disk space per Kafka log directory
grep log.dirs /etc/kafka/server.properties | tr ',' '\n' | while read d; do df -h "$d"; done
# Check kernel disk and controller errors
sudo dmesg -T | grep -iE 'error|fail|sector|i/o' | tail -n 20
# Check SMART health for the suspect device
sudo smartctl -a /dev/sdX
How to diagnose it
- Confirm the signal is sustained. Look at
99thPercentileover several minutes. Brief spikes during log rolling or retention are normal. Sustained elevation above 200 ms is the concern. - Correlate with produce request latency. Check
kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce. If it rises withLogFlushRateAndTimeMs, the broker’s local write path is bottlenecked on disk. - Check OS disk latency. Run
iostat -xz 1for at least 30 seconds on devices backinglog.dirs. For SSDs,awaitabove 20 ms is a warning; above 100 ms is critical. For HDDs, use 50 ms and 100 ms respectively. - Look for explicit flush settings. In
server.properties, iflog.flush.interval.messagesorlog.flush.interval.msare set, plan a rolling restart to unset them. Most deployments should delegate durability to replication and the OS flusher. - Isolate the device in JBOD. If the broker has multiple
log.dirson separate devices, compare per-deviceiostat. One device with highawaitwhile others are normal points to a single failing disk rather than host-wide saturation. - Inspect disk health. Use
smartctl -ato look for reallocated sectors, spin retries, pending sectors, or offline uncorrectable errors. Any positive value means the disk is failing. - Check whether Kafka has already given up.
OfflineLogDirectoryCountgreater than zero means alog.dirsentry is offline due to I/O errors. That is a binary failure signal. - Rule out RAID events and controller cache issues. If
awaitspiked suddenly across all data devices and SMART is clean, check RAID controller logs for rebuild, battery failure, or write-back cache disablement. - Consider compaction spikes. If
LogFlushRateAndTimeMsonly spikes during cleaner runs andmax-dirty-percentis climbing, the disk may be healthy but struggling with random I/O from compaction. Increaselog.cleaner.threadsor reduce compaction backlog.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs p99 | Direct fsync latency | SSD >200 ms; HDD >500 ms; any sustained >2 s |
kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce p99 | Broker write path latency including flush | Correlated spike with log flush metric |
iostat await | OS-level disk latency | SSD sustained >20 ms; HDD sustained >50 ms |
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent | I/O threads blocking on slow disk | Sustained <0.3 |
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec | Followers falling behind due to leader flush delay | Non-zero sustained outside maintenance |
kafka.log:type=LogManager,name=OfflineLogDirectoryCount | Kafka has taken a log directory offline | Any nonzero value |
kafka.log:type=LogCleanerManager,name=max-dirty-percent | Compaction backlog causing random I/O | Sustained above 50% |
Fixes
Warning: The fixes below can be disruptive. Verify replica placement and cluster capacity before stopping a broker or changing configuration.
Remove explicit flush configuration
If log.flush.interval.messages or log.flush.interval.ms are configured, comment them out and roll-restart the broker. Rolling restarts trigger leader elections and ISR changes; schedule them during low throughput. Replication with acks=all and adequate min.insync.replicas is the standard durability mechanism, not frequent fsync.
Replace or isolate a failing disk
If smartctl shows reallocated or pending sectors, or if await stays above 2 seconds with correlated LocalTimeMs spikes, treat the disk as failing. Before shutting down the broker, confirm the partition has in-sync replicas on other brokers and that the remaining brokers can absorb the load. Stopping the broker reduces ISR; if it hosts the only in-sync replica, writes to that partition will stall. After shutdown, replace the disk, format it XFS, update log.dirs, and restart the broker.
Mitigate RAID and filesystem issues
If a RAID rebuild is active, expect degraded performance. Throttle the rebuild if the controller supports it, or schedule replacement during a maintenance window. If the RAID write-back cache is disabled due to battery failure, restore cache protection or plan a move to local SSDs. Use XFS for log.dirs; it is the filesystem most operators use for Kafka.
Reduce I/O contention
Keep Kafka log.dirs on dedicated disks that do not share spindles with the OS root filesystem, JVM heap, or other workloads. In virtualized environments, switch to provisioned IOPS or local SSDs if noisy-neighbor effects are suspected.
Address compaction-related spikes
If latency spikes align with log cleaner runs, increase log.cleaner.threads, verify log.cleaner.dedupe.buffer.size is sufficient, and grep broker logs for cleaner thread crashes. A dead cleaner requires a broker restart after you resolve the underlying corrupt segment or OOM cause.
Prevention
- Leave
log.flush.interval.messagesandlog.flush.interval.msunset. Rely on replication and OS lazy flush. - Use XFS on dedicated volumes for each
log.dirsentry. - Monitor
LogFlushRateAndTimeMsp99 with warning threshold 200 ms and critical threshold 2 s. - Track per-device
awaittrends, not just instantaneous values. - Watch
LogCleanerManagermax-dirty-percentandLogCleanerDeadThreadCountto prevent compaction backlog. - In JBOD deployments, monitor per-disk latency separately; broker-wide aggregates hide single-disk failures.
- Maintain at least 15-20% free space on each data volume to absorb compaction and reassignment spikes.
- Correlate flush latency with
RequestHandlerAvgIdlePercentandLocalTimeMsto catch degradation before ISR shrinks begin.
How Netdata helps
- Collects
LogFlushRateAndTimeMsp99 andLocalTimeMsper broker so you can correlate fsync latency with write-path delay on the same chart. - Surfaces OS disk metrics such as
awaitand utilization alongside Kafka JMX data on the same node view. - Tracks
RequestHandlerAvgIdlePercent,IsrShrinksPerSec, andOfflineLogDirectoryCountto show when disk latency is becoming a cluster-level problem. - Supports composite alerts that combine high flush latency with high disk
await, reducing noise from normal OS flush spikes. - Retains high-resolution metrics to help distinguish transient compaction bursts from sustained disk degradation.
Related guides
- How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
- Kafka enable.auto.commit data loss: committed offsets that outrun processing: /guides/kafka/kafka-auto-commit-silent-data-loss/
- Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery: /guides/kafka/kafka-broker-out-of-disk/
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts: /guides/kafka/kafka-commit-failed-exception/
- Kafka consumer group stuck Empty or Dead: no members consuming: /guides/kafka/kafka-consumer-group-empty-stuck/
- Kafka consumer group lag growing: detection, lag-as-time, and root causes: /guides/kafka/kafka-consumer-group-lag-growing/
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors: /guides/kafka/kafka-consumer-group-rebalancing-frequently/
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms: /guides/kafka/kafka-consumer-rebalance-storm/
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata: /guides/kafka/kafka-controller-event-queue-backup/
- Kafka fetch request latency high: FetchConsumer vs FetchFollower and page cache misses: /guides/kafka/kafka-fetch-request-latency-high/
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
- Kafka JVM heap and Full GC pauses: ISR drops, session timeouts, and right-sizing the heap: /guides/kafka/kafka-jvm-heap-full-gc-pauses/







