Kafka request queue filling up: RequestQueueSize, queued.max.requests, and backpressure
Your Kafka producers are timing out. Broker logs show no errors, but client-side metrics reveal growing latency and retries. On the broker, RequestQueueSize climbs toward queued.max.requests (default 500). Once the queue fills, network threads block on enqueue and stop reading from their sockets. Clients see TCP backpressure, time out, and retry. Retries add load, deepening the queue. This feedback loop correlates with falling RequestHandlerAvgIdlePercent.
What this means
Kafka brokers use a reactor pattern. Network threads (num.network.threads, default 3) read requests into a bounded queue. I/O handler threads (num.io.threads, default 8) dequeue and process them. The queue capacity is queued.max.requests (default 500).
When I/O threads process slower than network threads read, the queue grows. Sustained elevation above 50% of capacity signals saturation. At the cap, network threads block on enqueue and stop draining their selectors. TCP backpressure propagates to clients.
Producers hit request.timeout.ms (default 30s) and retry if configured. With retries > 0, duplicate batches increase request volume while successful throughput falls.
RequestQueueSize (kafka.network:type=RequestChannel,name=RequestQueueSize) is an instantaneous gauge. Brief spikes are normal; sustained elevation is the signal. RequestHandlerAvgIdlePercent (kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent) measures the same pressure from the I/O thread side: the fraction of time threads are idle. Below 0.3 is severe; below 0.1 is active overload.
If control.plane.listener.name is not configured, controller requests share this queue. Under heavy load, LeaderAndISR and similar operations delay, compounding metadata staleness.
flowchart TD
A[I/O threads slow] --> B[RequestQueueSize grows]
B --> C[Queue approaches queued.max.requests]
C --> D[Network threads block on enqueue]
D --> E[Client requests timeout]
E --> F[Producers retry]
F --> G[More requests hit network threads]
G --> ACommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| I/O thread saturation from disk latency | RequestQueueTimeMs and LocalTimeMs spike; RequestHandlerAvgIdlePercent drops below 0.3 | Disk await via iostat -xz 1 on the broker |
| Producer timeout cascade | BytesInPerSec rises while MessagesInPerSec stays flat; queue grows on one broker | Per-broker RequestHandlerAvgIdlePercent and producer retry metrics |
| Insufficient I/O threads for peak load | Queue grows predictably during traffic peaks; idle percent hovers near 0.3 | num.io.threads versus per-broker partition count |
| Control-plane requests competing for queue space | Queue elevates during controller events without matching traffic spikes | Whether control.plane.listener.name is configured |
| Slow replication blocking I/O threads in purgatory | Produce purgatory size grows; RemoteTimeMs high for acks=all; UnderReplicatedPartitions elevated | Follower FetchFollower latency and ISR health |
Quick checks
# Current request queue depth
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
# I/O thread idle percent (1.0 = fully idle)
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Value" | java -jar jmxterm.jar -l localhost:9999
# Produce request queue wait time (p99)
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# Disk latency for log directories
iostat -xz 1
# Produce purgatory size
echo "get -b kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce Value" | java -jar jmxterm.jar -l localhost:9999
# Approximate connection count to the broker
ss -tnp | grep $(pgrep -f kafka.Kafka) | wc -l
How to diagnose it
- Confirm sustained queue growth. Check
RequestQueueSizeover 5 minutes. A flat line at 400 is different from a momentary spike to 300. - Correlate with
RequestHandlerAvgIdlePercent. Above 0.5 suggests a transient burst; below 0.3 confirms I/O thread saturation. - Decompose request latency.
RequestQueueTimeMshigh means the queue is the bottleneck.LocalTimeMshigh means disk I/O is slow.RemoteTimeMshigh means replication lag (foracks=all).ResponseQueueTimeMshigh means network threads cannot send responses fast enough. - Check disk I/O and page cache. Run
iostat -xz 1.awaitabove 20ms on SSDs or 50ms on HDDs indicates blocking disk I/O. Check/proc/vmstatforpgmajfaultrate to detect page cache thrashing. - Detect retry cascades. Compare
BytesInPerSectoMessagesInPerSec. If bytes rise while messages stay flat, producers are retrying large batches. Check client-siderecord-retry-rateif available. - Rule out control-plane interference. If
control.plane.listener.nameis absent, check whether controller events (ISR changes, leader elections) coincide with queue growth. Controller requests in the shared queue are blocked by data-plane backlog. - Inspect purgatory and replication. Growing produce purgatory with high
RemoteTimeMspoints to slow followers. CheckUnderReplicatedPartitionsandFetchFollowerlatency.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
RequestQueueSize | Instantaneous pressure between network and I/O threads | Sustained >50% of queued.max.requests (default 500) |
RequestHandlerAvgIdlePercent | I/O thread saturation; inverse view of queue pressure | Sustained <0.3; active overload <0.1 |
RequestQueueTimeMs (Produce) | Time spent waiting for an I/O thread | p99 exceeds 2-3x baseline for >5 minutes |
NetworkProcessorAvgIdlePercent | Network thread saturation; can mimic queue symptoms | Sustained <0.3 |
BytesInPerSec vs MessagesInPerSec | Detects producer retry cascades | Bytes rise while messages flat or fall |
| Produce purgatory size | acks=all requests stuck waiting for replication | Sustained >2x baseline with growing queue |
LocalTimeMs (Produce) | Disk write latency inside request handling | p99 spikes correlate with queue growth |
Fixes
Disk I/O is the bottleneck
If LocalTimeMs and disk await are elevated, more I/O threads will not help. Additional threads contending on a slow disk worsen latency. Identify whether slowness is hardware degradation, page cache thrashing, or a competing workload. For JBOD, one degraded log directory slows the whole broker because I/O threads are shared. Consider a controlled shutdown to trigger leader migration if a specific disk is degraded.
I/O thread pool is too small
If RequestHandlerAvgIdlePercent is low and disk I/O is healthy, increase num.io.threads in server.properties and perform a rolling restart.
Queue capacity is too small for burst traffic
If the queue hits 500 only during predictable bursts and I/O threads recover quickly, raise queued.max.requests in server.properties to absorb bursts without blocking network threads. Do not raise the cap if RequestHandlerAvgIdlePercent is already low; you will only delay overload and increase memory pressure.
Producer retry cascade is feeding the loop
Break the loop by throttling producers via quotas. You can temporarily increase request.timeout.ms to reduce premature retries, but fix the root broker bottleneck immediately or you increase memory pressure from in-flight requests.
Control-plane requests are competing
Configure control.plane.listener.name (KIP-291) to isolate controller traffic. Without this, heavy data-plane traffic blocks LeaderAndISR and UpdateMetadata, causing metadata propagation stalls and additional elections.
Prevention
- Monitor queue depth and idle percent together.
RequestQueueSizeandRequestHandlerAvgIdlePercentare two views of the same bottleneck. Alerting on only one misses bursty-but-healthy traffic or saturated threads with a temporarily short queue. - Maintain I/O thread headroom above 50% idle during peak. Surviving brokers must absorb partitions from a failed peer. If they normally run below 0.5 idle, the shifted load drops them into overload.
- Isolate controller traffic with a dedicated listener. Configure
control.plane.listener.nameso controller requests bypass the data-plane queue. - Set producer byte-rate quotas as circuit breakers. Quotas throttle runaway producers before retries fill the queue.
- Validate broker version before thresholding idle percent. On Kafka versions older than 2.1,
RequestHandlerAvgIdlePercentcan exceed 1.0 due to KAFKA-7295, making threshold-based alerting unreliable.
How Netdata helps
- Correlates
RequestQueueSizewithRequestHandlerAvgIdlePercenton one chart. - Surfaces per-broker
RequestQueueTimeMs,LocalTimeMs, andRemoteTimeMsto separate queue, disk, and replication delays. - Tracks OS-level disk
awaitand page fault rates alongside Kafka metrics. - Alerts on sustained queue depth relative to
queued.max.requestswithout firing on transient bursts. - Maps
Producepurgatory size andUnderReplicatedPartitionsto expose replication lag root causes.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind
- Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata
- Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections
- Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation







