Kafka LogFlushRateAndTimeMs high: fsync latency and a failing disk

When kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs p99 stays above 200 ms, the broker’s fsync path is slow. On SSD-backed clusters, p99 should stay below 50 ms; sustained values above 2 s point to disk degradation. Because most deployments leave log.flush.interval.messages and log.flush.interval.ms unset and rely on replication plus OS lazy flush, this metric reflects kernel-driven flushes or explicit fsyncs. A slow flush raises produce LocalTimeMs, blocks request handler threads, and can push followers out of ISR. This guide shows how to tell whether the cause is a failing disk, a bad flush policy, or transient I/O contention.

What this means

LogFlushRateAndTimeMs measures wall-clock time for log segment fsyncs, exposed as JMX MBean kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs. Watch 99thPercentile; Count tells you how often flushes happen.

With application-level flush unset, Kafka appends to the OS page cache and lets kernel flusher threads write to disk. In this mode, the metric is a direct signal of disk write performance. Expect SSD p99 < 50 ms, >200 ms abnormal, >2 s likely hardware trouble. On HDDs, typical p99 is <100 ms; sustained >500 ms warrants investigation.

The metric is broker-wide. On JBOD, one slow log.dirs disk can raise broker-level flush latency while others remain healthy. Correlate it with per-device OS metrics, not cluster aggregates.

flowchart TD
    A[LogFlushRateAndTimeMs p99 elevated] --> B{Explicit flush configured?}
    B -->|Yes| C[Unset log.flush.interval.ms and log.flush.interval.messages]
    B -->|No| D[iostat await elevated?]
    D -->|Yes| E{SMART errors or RAID event?}
    E -->|Yes| F[Evacuate broker and replace or rebuild disk]
    E -->|No| G[Check co-tenancy, filesystem, RAID cache]
    D -->|No| H[Check compaction spikes and page cache pressure]

Common causes

CauseWhat it looks likeFirst thing to check
Failing or degraded disk/SSDp99 >2 s sustained; iostat await high; SMART reallocated or pending sectors risingsudo smartctl -a /dev/sdX and sudo dmesg for I/O errors
Explicit application flush configuredCount is high and p99 tracks flush interval; settings force frequent fsyncsserver.properties for log.flush.interval.messages or log.flush.interval.ms
RAID rebuild or controller cache failureSudden latency jump across all devices on the host; no SMART errors; rebuild loggedRAID controller logs and sudo dmesg
Filesystem or OS-level contentionSpiky write latency on ext4 or shared disks; OS journal or other workloads co-located with log.dirsPer-device iostat and mount options
Log compaction or segment roll spikesPeriodic flush latency spikes aligned with cleaner runs; max-dirty-percent climbingLog cleaner metrics and compaction schedule

Read-only checks

These checks are safe to run during an incident. Some require root to access kernel logs or SMART data.

# Check log flush p99 and flush rate from JMX
echo "get -b kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs Count" | java -jar jmxterm.jar -l localhost:9999
# Check disk latency for devices backing log.dirs
iostat -xz 1
# Check for offline log directories
echo "get -b kafka.log:type=LogManager,name=OfflineLogDirectoryCount Value" | java -jar jmxterm.jar -l localhost:9999
# Check disk space per Kafka log directory
grep log.dirs /etc/kafka/server.properties | tr ',' '\n' | while read d; do df -h "$d"; done
# Check kernel disk and controller errors
sudo dmesg -T | grep -iE 'error|fail|sector|i/o' | tail -n 20
# Check SMART health for the suspect device
sudo smartctl -a /dev/sdX

How to diagnose it

  1. Confirm the signal is sustained. Look at 99thPercentile over several minutes. Brief spikes during log rolling or retention are normal. Sustained elevation above 200 ms is the concern.
  2. Correlate with produce request latency. Check kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce. If it rises with LogFlushRateAndTimeMs, the broker’s local write path is bottlenecked on disk.
  3. Check OS disk latency. Run iostat -xz 1 for at least 30 seconds on devices backing log.dirs. For SSDs, await above 20 ms is a warning; above 100 ms is critical. For HDDs, use 50 ms and 100 ms respectively.
  4. Look for explicit flush settings. In server.properties, if log.flush.interval.messages or log.flush.interval.ms are set, plan a rolling restart to unset them. Most deployments should delegate durability to replication and the OS flusher.
  5. Isolate the device in JBOD. If the broker has multiple log.dirs on separate devices, compare per-device iostat. One device with high await while others are normal points to a single failing disk rather than host-wide saturation.
  6. Inspect disk health. Use smartctl -a to look for reallocated sectors, spin retries, pending sectors, or offline uncorrectable errors. Any positive value means the disk is failing.
  7. Check whether Kafka has already given up. OfflineLogDirectoryCount greater than zero means a log.dirs entry is offline due to I/O errors. That is a binary failure signal.
  8. Rule out RAID events and controller cache issues. If await spiked suddenly across all data devices and SMART is clean, check RAID controller logs for rebuild, battery failure, or write-back cache disablement.
  9. Consider compaction spikes. If LogFlushRateAndTimeMs only spikes during cleaner runs and max-dirty-percent is climbing, the disk may be healthy but struggling with random I/O from compaction. Increase log.cleaner.threads or reduce compaction backlog.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs p99Direct fsync latencySSD >200 ms; HDD >500 ms; any sustained >2 s
kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce p99Broker write path latency including flushCorrelated spike with log flush metric
iostat awaitOS-level disk latencySSD sustained >20 ms; HDD sustained >50 ms
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercentI/O threads blocking on slow diskSustained <0.3
kafka.server:type=ReplicaManager,name=IsrShrinksPerSecFollowers falling behind due to leader flush delayNon-zero sustained outside maintenance
kafka.log:type=LogManager,name=OfflineLogDirectoryCountKafka has taken a log directory offlineAny nonzero value
kafka.log:type=LogCleanerManager,name=max-dirty-percentCompaction backlog causing random I/OSustained above 50%

Fixes

Warning: The fixes below can be disruptive. Verify replica placement and cluster capacity before stopping a broker or changing configuration.

Remove explicit flush configuration

If log.flush.interval.messages or log.flush.interval.ms are configured, comment them out and roll-restart the broker. Rolling restarts trigger leader elections and ISR changes; schedule them during low throughput. Replication with acks=all and adequate min.insync.replicas is the standard durability mechanism, not frequent fsync.

Replace or isolate a failing disk

If smartctl shows reallocated or pending sectors, or if await stays above 2 seconds with correlated LocalTimeMs spikes, treat the disk as failing. Before shutting down the broker, confirm the partition has in-sync replicas on other brokers and that the remaining brokers can absorb the load. Stopping the broker reduces ISR; if it hosts the only in-sync replica, writes to that partition will stall. After shutdown, replace the disk, format it XFS, update log.dirs, and restart the broker.

Mitigate RAID and filesystem issues

If a RAID rebuild is active, expect degraded performance. Throttle the rebuild if the controller supports it, or schedule replacement during a maintenance window. If the RAID write-back cache is disabled due to battery failure, restore cache protection or plan a move to local SSDs. Use XFS for log.dirs; it is the filesystem most operators use for Kafka.

Reduce I/O contention

Keep Kafka log.dirs on dedicated disks that do not share spindles with the OS root filesystem, JVM heap, or other workloads. In virtualized environments, switch to provisioned IOPS or local SSDs if noisy-neighbor effects are suspected.

If latency spikes align with log cleaner runs, increase log.cleaner.threads, verify log.cleaner.dedupe.buffer.size is sufficient, and grep broker logs for cleaner thread crashes. A dead cleaner requires a broker restart after you resolve the underlying corrupt segment or OOM cause.

Prevention

  • Leave log.flush.interval.messages and log.flush.interval.ms unset. Rely on replication and OS lazy flush.
  • Use XFS on dedicated volumes for each log.dirs entry.
  • Monitor LogFlushRateAndTimeMs p99 with warning threshold 200 ms and critical threshold 2 s.
  • Track per-device await trends, not just instantaneous values.
  • Watch LogCleanerManager max-dirty-percent and LogCleaner DeadThreadCount to prevent compaction backlog.
  • In JBOD deployments, monitor per-disk latency separately; broker-wide aggregates hide single-disk failures.
  • Maintain at least 15-20% free space on each data volume to absorb compaction and reassignment spikes.
  • Correlate flush latency with RequestHandlerAvgIdlePercent and LocalTimeMs to catch degradation before ISR shrinks begin.

How Netdata helps

  • Collects LogFlushRateAndTimeMs p99 and LocalTimeMs per broker so you can correlate fsync latency with write-path delay on the same chart.
  • Surfaces OS disk metrics such as await and utilization alongside Kafka JMX data on the same node view.
  • Tracks RequestHandlerAvgIdlePercent, IsrShrinksPerSec, and OfflineLogDirectoryCount to show when disk latency is becoming a cluster-level problem.
  • Supports composite alerts that combine high flush latency with high disk await, reducing noise from normal OS flush spikes.
  • Retains high-resolution metrics to help distinguish transient compaction bursts from sustained disk degradation.
  • How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
  • Kafka enable.auto.commit data loss: committed offsets that outrun processing: /guides/kafka/kafka-auto-commit-silent-data-loss/
  • Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery: /guides/kafka/kafka-broker-out-of-disk/
  • Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts: /guides/kafka/kafka-commit-failed-exception/
  • Kafka consumer group stuck Empty or Dead: no members consuming: /guides/kafka/kafka-consumer-group-empty-stuck/
  • Kafka consumer group lag growing: detection, lag-as-time, and root causes: /guides/kafka/kafka-consumer-group-lag-growing/
  • Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors: /guides/kafka/kafka-consumer-group-rebalancing-frequently/
  • Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms: /guides/kafka/kafka-consumer-rebalance-storm/
  • Kafka controller event queue backing up: overwhelmed controller and stalled metadata: /guides/kafka/kafka-controller-event-queue-backup/
  • Kafka fetch request latency high: FetchConsumer vs FetchFollower and page cache misses: /guides/kafka/kafka-fetch-request-latency-high/
  • Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
  • Kafka JVM heap and Full GC pauses: ISR drops, session timeouts, and right-sizing the heap: /guides/kafka/kafka-jvm-heap-full-gc-pauses/