Kafka RequestHandlerAvgIdlePercent low: I/O thread saturation and overload

You get paged because RequestHandlerAvgIdlePercent is below 0.3 and falling. This is an exponentially weighted moving average, not a spike metric. A low value means the broker’s I/O handler threads have been saturated long enough that the request queue is backing up and clients are timing out.

Treat above 0.5 as healthy, below 0.3 as critical, and below 0.1 as active overload. Even a stable 0.45 is dangerous: one broker failure can shift enough load to collapse the survivors.

Confirm the signal, find the bottleneck layer, and relieve pressure without making it worse.

What this means

Kafka brokers use a two-stage thread model. Network threads (num.network.threads, default 3) accept connections and read requests onto a bounded request queue (queued.max.requests, default 500). I/O handler threads (num.io.threads, default 8) dequeue and process those requests, handling produce appends, fetch reads, metadata requests, and anything else that touches the log.

RequestHandlerAvgIdlePercent measures only the I/O thread pool. When it drops, the pool is the bottleneck: requests sit in the queue longer than they should, and eventually they time out. Producers with retries enabled then resend, adding load and creating a positive feedback loop.

flowchart LR
    Client --> Network[Network threads num.network.threads]
    Network --> Queue[Request queue queued.max.requests]
    Queue --> IO[I/O threads num.io.threads]
    IO --> Disk[(Log dirs page cache)]
    IO --> Purgatory[Delayed op purgatory]

Because the metric is an EWMA, brief bursts do not push it to 0.1. A value below 0.3 reflects sustained pressure. This makes it a reliable alert signal, but the problem has been building for some time by the time the page fires.

Common causes

CauseWhat it looks likeFirst thing to check
Disk I/O saturationLocalTimeMs for Produce is high; iostat shows elevated await; RequestHandlerAvgIdlePercent low on one or a few brokersDisk await per log dir device
Concurrency-bound CPU or thread starvationDisk latency is fine but CPU is high; RequestQueueTimeMs is rising; many partitions or message down-conversion on the brokerCPU, GC, and MessageConversionsPerSec
Producer retry cascadeBytesInPerSec rises while useful throughput does not; RequestQueueSize grows; producer-side retry rate climbsProducer record-retry-rate and FailedProduceRequestsPerSec
Large produce batches with compressionHigh BytesInPerSec but normal MessagesInPerSec; CPU elevated from decompressionRatio of bytes to messages and compression codecs
Slow replication backing up purgatoryRemoteTimeMs high; Produce purgatory growing; UnderReplicatedPartitions nonzeroUnderReplicatedPartitions and follower disk or network metrics

Quick checks

Run these safe, read-only checks to triage before making changes.

# Read the saturation metric directly from JMX
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check pressure between network and I/O threads
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

# Distinguish queue wait from disk wait from replication wait
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Rule out network-thread saturation on the same broker
echo "get -b kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Real disk saturation signal
iostat -xz 1

# JVM pressure that steals thread time
jstat -gcutil $(pgrep -f kafka.Kafka) 1000

# Produce purgatory for acks=all backup
echo "get -b kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce Value" | java -jar jmxterm.jar -l localhost:9999

# Find partitions whose followers are struggling
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

How to diagnose it

  1. Confirm the signal is sustained. Because RequestHandlerAvgIdlePercent is an EWMA, one bad minute is not enough. Look for values below 0.3 across several sampling intervals or a steady downward trend.
  2. Locate the affected broker or brokers. Compare the metric across the cluster. If only one broker is low, that broker is the problem. If many are low, the cluster is genuinely over capacity or a retry cascade is spreading load.
  3. Check the produce latency breakdown. High RequestQueueTimeMs means I/O threads cannot keep up. High LocalTimeMs means disk writes are slow. High RemoteTimeMs means followers are slow to acknowledge acks=all requests.
  4. Validate disk health. Use iostat -xz 1 and focus on await for the devices backing log.dirs. For SSDs, sustained await above 20 ms is concerning. For HDDs, above 50 ms is concerning. %util is misleading for SSDs and arrays; prefer await.
  5. Check CPU and GC. If disk is fine but CPU is high or GC pauses are long, the broker may be CPU-bound from compression, TLS, message down-conversion, or simply too many partitions. High GC pause times show up as correlated spikes in request latency.
  6. Inspect replication health. If RemoteTimeMs is high, look at UnderReplicatedPartitions, IsrShrinksPerSec, and the follower brokers’ disk and network metrics. Slow followers back up the produce purgatory and starve I/O threads.
  7. Look for a producer retry cascade. Check producer metrics for elevated record-retry-rate. On the broker, BytesInPerSec may climb while useful throughput stays flat because the same messages are being retried.
  8. Rule out network thread saturation. If NetworkProcessorAvgIdlePercent is also low, the bottleneck is higher up the stack. Adding num.io.threads will not help and can make contention worse.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
RequestHandlerAvgIdlePercentBest single indicator of broker I/O capacitySustained below 0.3; below 0.1 is active overload
RequestQueueSizePressure gauge between network and I/O threadsSustained above 250 with default queued.max.requests=500
RequestQueueTimeMs ProduceTime requests wait for an I/O threadp99 above 2-3x baseline for more than 5 minutes
LocalTimeMs ProduceDisk write path latencyp99 above 20 ms on SSD, 50 ms on HDD
RemoteTimeMs ProduceTime waiting for follower replicationp99 above 50% of replica.lag.time.max.ms
NetworkProcessorAvgIdlePercentDistinguishes network threads from I/O threadsSustained below 0.3
Produce purgatory sizeacks=all requests blocked on replicationSustained above 2x baseline
UnderReplicatedPartitionsFollower health and replication pressureNonzero outside maintenance windows
BytesInPerSec / MessagesInPerSec ratioDetects retry floods or giant batchesSharp ratio shift without traffic change
Disk awaitReal disk saturation signalSustained above 20 ms SSD / 50 ms HDD

Fixes

If disk I/O is the bottleneck

Do not increase num.io.threads. More threads hitting a slow disk adds contention, not throughput.

  • Identify whether one log directory is degraded. With JBOD, a single slow disk can degrade the whole broker while other directories are fine.
  • If the disk is failing, plan replacement or migration. A controlled broker shutdown triggers leader election. This is disruptive; do it during a maintenance window or when the cluster can absorb the shift.
  • Reduce random I/O pressure by checking for runaway compaction, excessive partition count on the broker, or a backfill consumer thrashing page cache.
  • If the cluster is uniformly disk-bound, add brokers and rebalance partitions, or move to faster storage.

If concurrency or CPU is the bottleneck

  • Increase num.io.threads only if you have idle CPU cores. The default is 8, but the right value depends on core count and workload. This helps when threads are genuinely the limit, not when they are blocked on disk.
  • Reduce broker-side message down-conversion. Old clients force the broker to decompress and recompress on heap. Upgrade clients or align compression codecs so zero-copy paths stay active.
  • Offload TLS termination if network threads are also under pressure, or move TLS handshake work off the network threads if your Kafka version supports it.
  • Rebalance leadership if one broker is leading disproportionately many partitions. High LeaderCount relative to the cluster mean drives I/O thread load. Leadership movement is disruptive; plan it carefully.

If a producer retry cascade is feeding the overload

  • Find the slow broker using RequestHandlerAvgIdlePercent and LocalTimeMs across the cluster.
  • Apply producer quotas temporarily to reduce inbound retry traffic. This breaks the positive feedback loop.
  • Fix the root cause, whether it is disk, GC, or a slow follower, before removing quotas.

If slow replication is backing up the purgatory

  • Fix the follower. The root cause is usually disk, network, or GC on the follower broker, not the leader.
  • If a follower is consistently behind, consider a controlled shutdown to remove it from the ISR cleanly rather than letting it flap. This is disruptive and reduces replica count until recovery finishes.

Prevention

  • Keep peak RequestHandlerAvgIdlePercent above 0.5. A stable 0.45 means the cluster has no headroom for a broker failure.
  • Monitor the trend, not just the threshold. Plot idle percent over weeks to catch gradual erosion before it becomes an incident.
  • Size partitions per broker conservatively. Each partition adds file descriptors, memory, and request processing overhead. Imbalanced leadership is often the real culprit behind low idle percent.
  • Run game-day failures. Shut down one broker intentionally and measure how low the surviving brokers’ idle percent drops. If they fall below 0.3 during the test, your cluster is already overcommitted.
  • Enforce quotas for bursty or backfill workloads so one misbehaving client cannot starve I/O threads.
  • Align producer and broker compression codecs to avoid on-heap decompression.

How Netdata helps

  • Charts RequestHandlerAvgIdlePercent, RequestQueueSize, and produce latency breakdown components so you can correlate them without switching tools.
  • Correlates broker JMX saturation with OS-level signals such as disk await, CPU utilization, and GC pause time to distinguish disk-bound from concurrency-bound overload.
  • Per-broker views let you spot a single hot broker before cluster-wide retries amplify the problem.
  • Alerting on sustained thresholds matches the EWMA nature of the metric better than spike-based triggers.
  • Tracks consumer lag alongside broker fetch latency to separate read-path saturation from write-path saturation.