Kafka RequestHandlerAvgIdlePercent low: I/O thread saturation and overload
You get paged because RequestHandlerAvgIdlePercent is below 0.3 and falling. This is an exponentially weighted moving average, not a spike metric. A low value means the broker’s I/O handler threads have been saturated long enough that the request queue is backing up and clients are timing out.
Treat above 0.5 as healthy, below 0.3 as critical, and below 0.1 as active overload. Even a stable 0.45 is dangerous: one broker failure can shift enough load to collapse the survivors.
Confirm the signal, find the bottleneck layer, and relieve pressure without making it worse.
What this means
Kafka brokers use a two-stage thread model. Network threads (num.network.threads, default 3) accept connections and read requests onto a bounded request queue (queued.max.requests, default 500). I/O handler threads (num.io.threads, default 8) dequeue and process those requests, handling produce appends, fetch reads, metadata requests, and anything else that touches the log.
RequestHandlerAvgIdlePercent measures only the I/O thread pool. When it drops, the pool is the bottleneck: requests sit in the queue longer than they should, and eventually they time out. Producers with retries enabled then resend, adding load and creating a positive feedback loop.
flowchart LR
Client --> Network[Network threads num.network.threads]
Network --> Queue[Request queue queued.max.requests]
Queue --> IO[I/O threads num.io.threads]
IO --> Disk[(Log dirs page cache)]
IO --> Purgatory[Delayed op purgatory]Because the metric is an EWMA, brief bursts do not push it to 0.1. A value below 0.3 reflects sustained pressure. This makes it a reliable alert signal, but the problem has been building for some time by the time the page fires.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Disk I/O saturation | LocalTimeMs for Produce is high; iostat shows elevated await; RequestHandlerAvgIdlePercent low on one or a few brokers | Disk await per log dir device |
| Concurrency-bound CPU or thread starvation | Disk latency is fine but CPU is high; RequestQueueTimeMs is rising; many partitions or message down-conversion on the broker | CPU, GC, and MessageConversionsPerSec |
| Producer retry cascade | BytesInPerSec rises while useful throughput does not; RequestQueueSize grows; producer-side retry rate climbs | Producer record-retry-rate and FailedProduceRequestsPerSec |
| Large produce batches with compression | High BytesInPerSec but normal MessagesInPerSec; CPU elevated from decompression | Ratio of bytes to messages and compression codecs |
| Slow replication backing up purgatory | RemoteTimeMs high; Produce purgatory growing; UnderReplicatedPartitions nonzero | UnderReplicatedPartitions and follower disk or network metrics |
Quick checks
Run these safe, read-only checks to triage before making changes.
# Read the saturation metric directly from JMX
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Check pressure between network and I/O threads
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
# Distinguish queue wait from disk wait from replication wait
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# Rule out network-thread saturation on the same broker
echo "get -b kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Real disk saturation signal
iostat -xz 1
# JVM pressure that steals thread time
jstat -gcutil $(pgrep -f kafka.Kafka) 1000
# Produce purgatory for acks=all backup
echo "get -b kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce Value" | java -jar jmxterm.jar -l localhost:9999
# Find partitions whose followers are struggling
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
How to diagnose it
- Confirm the signal is sustained. Because
RequestHandlerAvgIdlePercentis an EWMA, one bad minute is not enough. Look for values below 0.3 across several sampling intervals or a steady downward trend. - Locate the affected broker or brokers. Compare the metric across the cluster. If only one broker is low, that broker is the problem. If many are low, the cluster is genuinely over capacity or a retry cascade is spreading load.
- Check the produce latency breakdown. High
RequestQueueTimeMsmeans I/O threads cannot keep up. HighLocalTimeMsmeans disk writes are slow. HighRemoteTimeMsmeans followers are slow to acknowledgeacks=allrequests. - Validate disk health. Use
iostat -xz 1and focus onawaitfor the devices backinglog.dirs. For SSDs, sustainedawaitabove 20 ms is concerning. For HDDs, above 50 ms is concerning.%utilis misleading for SSDs and arrays; preferawait. - Check CPU and GC. If disk is fine but CPU is high or GC pauses are long, the broker may be CPU-bound from compression, TLS, message down-conversion, or simply too many partitions. High GC pause times show up as correlated spikes in request latency.
- Inspect replication health. If
RemoteTimeMsis high, look atUnderReplicatedPartitions,IsrShrinksPerSec, and the follower brokers’ disk and network metrics. Slow followers back up the produce purgatory and starve I/O threads. - Look for a producer retry cascade. Check producer metrics for elevated
record-retry-rate. On the broker,BytesInPerSecmay climb while useful throughput stays flat because the same messages are being retried. - Rule out network thread saturation. If
NetworkProcessorAvgIdlePercentis also low, the bottleneck is higher up the stack. Addingnum.io.threadswill not help and can make contention worse.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
RequestHandlerAvgIdlePercent | Best single indicator of broker I/O capacity | Sustained below 0.3; below 0.1 is active overload |
RequestQueueSize | Pressure gauge between network and I/O threads | Sustained above 250 with default queued.max.requests=500 |
RequestQueueTimeMs Produce | Time requests wait for an I/O thread | p99 above 2-3x baseline for more than 5 minutes |
LocalTimeMs Produce | Disk write path latency | p99 above 20 ms on SSD, 50 ms on HDD |
RemoteTimeMs Produce | Time waiting for follower replication | p99 above 50% of replica.lag.time.max.ms |
NetworkProcessorAvgIdlePercent | Distinguishes network threads from I/O threads | Sustained below 0.3 |
| Produce purgatory size | acks=all requests blocked on replication | Sustained above 2x baseline |
UnderReplicatedPartitions | Follower health and replication pressure | Nonzero outside maintenance windows |
BytesInPerSec / MessagesInPerSec ratio | Detects retry floods or giant batches | Sharp ratio shift without traffic change |
Disk await | Real disk saturation signal | Sustained above 20 ms SSD / 50 ms HDD |
Fixes
If disk I/O is the bottleneck
Do not increase num.io.threads. More threads hitting a slow disk adds contention, not throughput.
- Identify whether one log directory is degraded. With JBOD, a single slow disk can degrade the whole broker while other directories are fine.
- If the disk is failing, plan replacement or migration. A controlled broker shutdown triggers leader election. This is disruptive; do it during a maintenance window or when the cluster can absorb the shift.
- Reduce random I/O pressure by checking for runaway compaction, excessive partition count on the broker, or a backfill consumer thrashing page cache.
- If the cluster is uniformly disk-bound, add brokers and rebalance partitions, or move to faster storage.
If concurrency or CPU is the bottleneck
- Increase
num.io.threadsonly if you have idle CPU cores. The default is 8, but the right value depends on core count and workload. This helps when threads are genuinely the limit, not when they are blocked on disk. - Reduce broker-side message down-conversion. Old clients force the broker to decompress and recompress on heap. Upgrade clients or align compression codecs so zero-copy paths stay active.
- Offload TLS termination if network threads are also under pressure, or move TLS handshake work off the network threads if your Kafka version supports it.
- Rebalance leadership if one broker is leading disproportionately many partitions. High
LeaderCountrelative to the cluster mean drives I/O thread load. Leadership movement is disruptive; plan it carefully.
If a producer retry cascade is feeding the overload
- Find the slow broker using
RequestHandlerAvgIdlePercentandLocalTimeMsacross the cluster. - Apply producer quotas temporarily to reduce inbound retry traffic. This breaks the positive feedback loop.
- Fix the root cause, whether it is disk, GC, or a slow follower, before removing quotas.
If slow replication is backing up the purgatory
- Fix the follower. The root cause is usually disk, network, or GC on the follower broker, not the leader.
- If a follower is consistently behind, consider a controlled shutdown to remove it from the ISR cleanly rather than letting it flap. This is disruptive and reduces replica count until recovery finishes.
Prevention
- Keep peak
RequestHandlerAvgIdlePercentabove 0.5. A stable 0.45 means the cluster has no headroom for a broker failure. - Monitor the trend, not just the threshold. Plot idle percent over weeks to catch gradual erosion before it becomes an incident.
- Size partitions per broker conservatively. Each partition adds file descriptors, memory, and request processing overhead. Imbalanced leadership is often the real culprit behind low idle percent.
- Run game-day failures. Shut down one broker intentionally and measure how low the surviving brokers’ idle percent drops. If they fall below 0.3 during the test, your cluster is already overcommitted.
- Enforce quotas for bursty or backfill workloads so one misbehaving client cannot starve I/O threads.
- Align producer and broker compression codecs to avoid on-heap decompression.
How Netdata helps
- Charts
RequestHandlerAvgIdlePercent,RequestQueueSize, and produce latency breakdown components so you can correlate them without switching tools. - Correlates broker JMX saturation with OS-level signals such as disk
await, CPU utilization, and GC pause time to distinguish disk-bound from concurrency-bound overload. - Per-broker views let you spot a single hot broker before cluster-wide retries amplify the problem.
- Alerting on sustained thresholds matches the EWMA nature of the metric better than spike-based triggers.
- Tracks consumer lag alongside broker fetch latency to separate read-path saturation from write-path saturation.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind
- Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata
- Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections
- Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation







