Kafka RequestHandlerAvgIdlePercent low: I/O thread saturation and overload

You get paged because RequestHandlerAvgIdlePercent is below 0.3 and falling. This is an exponentially weighted moving average, not a spike metric. A low value means the broker’s I/O handler threads have been saturated long enough that the request queue is backing up and clients are timing out.

Treat above 0.5 as healthy, below 0.3 as critical, and below 0.1 as active overload. Even a stable 0.45 is dangerous: one broker failure can shift enough load to collapse the survivors.

Confirm the signal, find the bottleneck layer, and relieve pressure without making it worse.

What this means

Kafka brokers use a two-stage thread model. Network threads (num.network.threads, default 3) accept connections and read requests onto a bounded request queue (queued.max.requests, default 500). I/O handler threads (num.io.threads, default 8) dequeue and process those requests, handling produce appends, fetch reads, metadata requests, and anything else that touches the log.

RequestHandlerAvgIdlePercent measures only the I/O thread pool. When it drops, the pool is the bottleneck: requests sit in the queue longer than they should, and eventually they time out. Producers with retries enabled then resend, adding load and creating a positive feedback loop.

flowchart LR
    Client --> Network[Network threads num.network.threads]
    Network --> Queue[Request queue queued.max.requests]
    Queue --> IO[I/O threads num.io.threads]
    IO --> Disk[(Log dirs page cache)]
    IO --> Purgatory[Delayed op purgatory]

Because the metric is an EWMA, brief bursts do not push it to 0.1. A value below 0.3 reflects sustained pressure. This makes it a reliable alert signal, but the problem has been building for some time by the time the page fires.

Common causes

Cause	What it looks like	First thing to check
Disk I/O saturation	`LocalTimeMs` for Produce is high; `iostat` shows elevated `await`; `RequestHandlerAvgIdlePercent` low on one or a few brokers	Disk `await` per log dir device
Concurrency-bound CPU or thread starvation	Disk latency is fine but CPU is high; `RequestQueueTimeMs` is rising; many partitions or message down-conversion on the broker	CPU, GC, and `MessageConversionsPerSec`
Producer retry cascade	`BytesInPerSec` rises while useful throughput does not; `RequestQueueSize` grows; producer-side retry rate climbs	Producer `record-retry-rate` and `FailedProduceRequestsPerSec`
Large produce batches with compression	High `BytesInPerSec` but normal `MessagesInPerSec`; CPU elevated from decompression	Ratio of bytes to messages and compression codecs
Slow replication backing up purgatory	`RemoteTimeMs` high; Produce purgatory growing; `UnderReplicatedPartitions` nonzero	`UnderReplicatedPartitions` and follower disk or network metrics

Quick checks

Run these safe, read-only checks to triage before making changes.

# Read the saturation metric directly from JMX
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check pressure between network and I/O threads
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

# Distinguish queue wait from disk wait from replication wait
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Rule out network-thread saturation on the same broker
echo "get -b kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Real disk saturation signal
iostat -xz 1

# JVM pressure that steals thread time
jstat -gcutil $(pgrep -f kafka.Kafka) 1000

# Produce purgatory for acks=all backup
echo "get -b kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce Value" | java -jar jmxterm.jar -l localhost:9999

# Find partitions whose followers are struggling
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

How to diagnose it

Confirm the signal is sustained. Because RequestHandlerAvgIdlePercent is an EWMA, one bad minute is not enough. Look for values below 0.3 across several sampling intervals or a steady downward trend.
Locate the affected broker or brokers. Compare the metric across the cluster. If only one broker is low, that broker is the problem. If many are low, the cluster is genuinely over capacity or a retry cascade is spreading load.
Check the produce latency breakdown. High RequestQueueTimeMs means I/O threads cannot keep up. High LocalTimeMs means disk writes are slow. High RemoteTimeMs means followers are slow to acknowledge acks=all requests.
Validate disk health. Use iostat -xz 1 and focus on await for the devices backing log.dirs. For SSDs, sustained await above 20 ms is concerning. For HDDs, above 50 ms is concerning. %util is misleading for SSDs and arrays; prefer await.
Check CPU and GC. If disk is fine but CPU is high or GC pauses are long, the broker may be CPU-bound from compression, TLS, message down-conversion, or simply too many partitions. High GC pause times show up as correlated spikes in request latency.
Inspect replication health. If RemoteTimeMs is high, look at UnderReplicatedPartitions, IsrShrinksPerSec, and the follower brokers’ disk and network metrics. Slow followers back up the produce purgatory and starve I/O threads.
Look for a producer retry cascade. Check producer metrics for elevated record-retry-rate. On the broker, BytesInPerSec may climb while useful throughput stays flat because the same messages are being retried.
Rule out network thread saturation. If NetworkProcessorAvgIdlePercent is also low, the bottleneck is higher up the stack. Adding num.io.threads will not help and can make contention worse.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`RequestHandlerAvgIdlePercent`	Best single indicator of broker I/O capacity	Sustained below 0.3; below 0.1 is active overload
`RequestQueueSize`	Pressure gauge between network and I/O threads	Sustained above 250 with default `queued.max.requests=500`
`RequestQueueTimeMs` Produce	Time requests wait for an I/O thread	p99 above 2-3x baseline for more than 5 minutes
`LocalTimeMs` Produce	Disk write path latency	p99 above 20 ms on SSD, 50 ms on HDD
`RemoteTimeMs` Produce	Time waiting for follower replication	p99 above 50% of `replica.lag.time.max.ms`
`NetworkProcessorAvgIdlePercent`	Distinguishes network threads from I/O threads	Sustained below 0.3
Produce purgatory size	`acks=all` requests blocked on replication	Sustained above 2x baseline
`UnderReplicatedPartitions`	Follower health and replication pressure	Nonzero outside maintenance windows
`BytesInPerSec` / `MessagesInPerSec` ratio	Detects retry floods or giant batches	Sharp ratio shift without traffic change
Disk `await`	Real disk saturation signal	Sustained above 20 ms SSD / 50 ms HDD

Fixes

If disk I/O is the bottleneck

Do not increase num.io.threads. More threads hitting a slow disk adds contention, not throughput.

Identify whether one log directory is degraded. With JBOD, a single slow disk can degrade the whole broker while other directories are fine.
If the disk is failing, plan replacement or migration. A controlled broker shutdown triggers leader election. This is disruptive; do it during a maintenance window or when the cluster can absorb the shift.
Reduce random I/O pressure by checking for runaway compaction, excessive partition count on the broker, or a backfill consumer thrashing page cache.
If the cluster is uniformly disk-bound, add brokers and rebalance partitions, or move to faster storage.

If concurrency or CPU is the bottleneck

Increase num.io.threads only if you have idle CPU cores. The default is 8, but the right value depends on core count and workload. This helps when threads are genuinely the limit, not when they are blocked on disk.
Reduce broker-side message down-conversion. Old clients force the broker to decompress and recompress on heap. Upgrade clients or align compression codecs so zero-copy paths stay active.
Offload TLS termination if network threads are also under pressure, or move TLS handshake work off the network threads if your Kafka version supports it.
Rebalance leadership if one broker is leading disproportionately many partitions. High LeaderCount relative to the cluster mean drives I/O thread load. Leadership movement is disruptive; plan it carefully.

If a producer retry cascade is feeding the overload

Find the slow broker using RequestHandlerAvgIdlePercent and LocalTimeMs across the cluster.
Apply producer quotas temporarily to reduce inbound retry traffic. This breaks the positive feedback loop.
Fix the root cause, whether it is disk, GC, or a slow follower, before removing quotas.

If slow replication is backing up the purgatory

Fix the follower. The root cause is usually disk, network, or GC on the follower broker, not the leader.
If a follower is consistently behind, consider a controlled shutdown to remove it from the ISR cleanly rather than letting it flap. This is disruptive and reduces replica count until recovery finishes.

Prevention

Keep peak RequestHandlerAvgIdlePercent above 0.5. A stable 0.45 means the cluster has no headroom for a broker failure.
Monitor the trend, not just the threshold. Plot idle percent over weeks to catch gradual erosion before it becomes an incident.
Size partitions per broker conservatively. Each partition adds file descriptors, memory, and request processing overhead. Imbalanced leadership is often the real culprit behind low idle percent.
Run game-day failures. Shut down one broker intentionally and measure how low the surviving brokers’ idle percent drops. If they fall below 0.3 during the test, your cluster is already overcommitted.
Enforce quotas for bursty or backfill workloads so one misbehaving client cannot starve I/O threads.
Align producer and broker compression codecs to avoid on-heap decompression.

How Netdata helps

Charts RequestHandlerAvgIdlePercent, RequestQueueSize, and produce latency breakdown components so you can correlate them without switching tools.
Correlates broker JMX saturation with OS-level signals such as disk await, CPU utilization, and GC pause time to distinguish disk-bound from concurrency-bound overload.
Per-broker views let you spot a single hot broker before cluster-wide retries amplify the problem.
Alerting on sustained thresholds matches the EWMA nature of the metric better than spike-based triggers.
Tracks consumer lag alongside broker fetch latency to separate read-path saturation from write-path saturation.

Kafka RequestHandlerAvgIdlePercent low: I/O thread saturation and overload

Kafka RequestHandlerAvgIdlePercent low: I/O thread saturation and overload

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

If disk I/O is the bottleneck

If concurrency or CPU is the bottleneck

If a producer retry cascade is feeding the overload

If slow replication is backing up the purgatory

Prevention

How Netdata helps

Related guides