$ guides / kafka / kafka-request-queue-full ▌

Operations Guides

Kafka request queue filling up: RequestQueueSize, queued.max.requests, and backpressure

Your Kafka producers are timing out. Broker logs show no errors, but client-side metrics reveal growing latency and retries. On the broker, RequestQueueSize climbs toward queued.max.requests (default 500). Once the queue fills, network threads block on enqueue and stop reading from their sockets. Clients see TCP backpressure, time out, and retry. Retries add load, deepening the queue. This feedback loop correlates with falling RequestHandlerAvgIdlePercent.

What this means

Kafka brokers use a reactor pattern. Network threads (num.network.threads, default 3) read requests into a bounded queue. I/O handler threads (num.io.threads, default 8) dequeue and process them. The queue capacity is queued.max.requests (default 500).

When I/O threads process slower than network threads read, the queue grows. Sustained elevation above 50% of capacity signals saturation. At the cap, network threads block on enqueue and stop draining their selectors. TCP backpressure propagates to clients.

Producers hit request.timeout.ms (default 30s) and retry if configured. With retries > 0, duplicate batches increase request volume while successful throughput falls.

RequestQueueSize (kafka.network:type=RequestChannel,name=RequestQueueSize) is an instantaneous gauge. Brief spikes are normal; sustained elevation is the signal. RequestHandlerAvgIdlePercent (kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent) measures the same pressure from the I/O thread side: the fraction of time threads are idle. Below 0.3 is severe; below 0.1 is active overload.

If control.plane.listener.name is not configured, controller requests share this queue. Under heavy load, LeaderAndISR and similar operations delay, compounding metadata staleness.

flowchart TD
    A[I/O threads slow] --> B[RequestQueueSize grows]
    B --> C[Queue approaches queued.max.requests]
    C --> D[Network threads block on enqueue]
    D --> E[Client requests timeout]
    E --> F[Producers retry]
    F --> G[More requests hit network threads]
    G --> A

Common causes

Cause	What it looks like	First thing to check
I/O thread saturation from disk latency	`RequestQueueTimeMs` and `LocalTimeMs` spike; `RequestHandlerAvgIdlePercent` drops below 0.3	Disk `await` via `iostat -xz 1` on the broker
Producer timeout cascade	`BytesInPerSec` rises while `MessagesInPerSec` stays flat; queue grows on one broker	Per-broker `RequestHandlerAvgIdlePercent` and producer retry metrics
Insufficient I/O threads for peak load	Queue grows predictably during traffic peaks; idle percent hovers near 0.3	`num.io.threads` versus per-broker partition count
Control-plane requests competing for queue space	Queue elevates during controller events without matching traffic spikes	Whether `control.plane.listener.name` is configured
Slow replication blocking I/O threads in purgatory	Produce purgatory size grows; `RemoteTimeMs` high for `acks=all`; `UnderReplicatedPartitions` elevated	Follower `FetchFollower` latency and ISR health

Quick checks

# Current request queue depth
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

# I/O thread idle percent (1.0 = fully idle)
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Value" | java -jar jmxterm.jar -l localhost:9999

# Produce request queue wait time (p99)
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Disk latency for log directories
iostat -xz 1

# Produce purgatory size
echo "get -b kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce Value" | java -jar jmxterm.jar -l localhost:9999

# Approximate connection count to the broker
ss -tnp | grep $(pgrep -f kafka.Kafka) | wc -l

How to diagnose it

Confirm sustained queue growth. Check RequestQueueSize over 5 minutes. A flat line at 400 is different from a momentary spike to 300.
Correlate with RequestHandlerAvgIdlePercent. Above 0.5 suggests a transient burst; below 0.3 confirms I/O thread saturation.
Decompose request latency. RequestQueueTimeMs high means the queue is the bottleneck. LocalTimeMs high means disk I/O is slow. RemoteTimeMs high means replication lag (for acks=all). ResponseQueueTimeMs high means network threads cannot send responses fast enough.
Check disk I/O and page cache. Run iostat -xz 1. await above 20ms on SSDs or 50ms on HDDs indicates blocking disk I/O. Check /proc/vmstat for pgmajfault rate to detect page cache thrashing.
Detect retry cascades. Compare BytesInPerSec to MessagesInPerSec. If bytes rise while messages stay flat, producers are retrying large batches. Check client-side record-retry-rate if available.
Rule out control-plane interference. If control.plane.listener.name is absent, check whether controller events (ISR changes, leader elections) coincide with queue growth. Controller requests in the shared queue are blocked by data-plane backlog.
Inspect purgatory and replication. Growing produce purgatory with high RemoteTimeMs points to slow followers. Check UnderReplicatedPartitions and FetchFollower latency.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`RequestQueueSize`	Instantaneous pressure between network and I/O threads	Sustained >50% of `queued.max.requests` (default 500)
`RequestHandlerAvgIdlePercent`	I/O thread saturation; inverse view of queue pressure	Sustained <0.3; active overload <0.1
`RequestQueueTimeMs` (Produce)	Time spent waiting for an I/O thread	p99 exceeds 2-3x baseline for >5 minutes
`NetworkProcessorAvgIdlePercent`	Network thread saturation; can mimic queue symptoms	Sustained <0.3
`BytesInPerSec` vs `MessagesInPerSec`	Detects producer retry cascades	Bytes rise while messages flat or fall
Produce purgatory size	`acks=all` requests stuck waiting for replication	Sustained >2x baseline with growing queue
`LocalTimeMs` (Produce)	Disk write latency inside request handling	p99 spikes correlate with queue growth

Fixes

Disk I/O is the bottleneck

If LocalTimeMs and disk await are elevated, more I/O threads will not help. Additional threads contending on a slow disk worsen latency. Identify whether slowness is hardware degradation, page cache thrashing, or a competing workload. For JBOD, one degraded log directory slows the whole broker because I/O threads are shared. Consider a controlled shutdown to trigger leader migration if a specific disk is degraded.

I/O thread pool is too small

If RequestHandlerAvgIdlePercent is low and disk I/O is healthy, increase num.io.threads in server.properties and perform a rolling restart.

Queue capacity is too small for burst traffic

If the queue hits 500 only during predictable bursts and I/O threads recover quickly, raise queued.max.requests in server.properties to absorb bursts without blocking network threads. Do not raise the cap if RequestHandlerAvgIdlePercent is already low; you will only delay overload and increase memory pressure.

Producer retry cascade is feeding the loop

Break the loop by throttling producers via quotas. You can temporarily increase request.timeout.ms to reduce premature retries, but fix the root broker bottleneck immediately or you increase memory pressure from in-flight requests.

Control-plane requests are competing

Configure control.plane.listener.name (KIP-291) to isolate controller traffic. Without this, heavy data-plane traffic blocks LeaderAndISR and UpdateMetadata, causing metadata propagation stalls and additional elections.

Prevention

Monitor queue depth and idle percent together. RequestQueueSize and RequestHandlerAvgIdlePercent are two views of the same bottleneck. Alerting on only one misses bursty-but-healthy traffic or saturated threads with a temporarily short queue.
Maintain I/O thread headroom above 50% idle during peak. Surviving brokers must absorb partitions from a failed peer. If they normally run below 0.5 idle, the shifted load drops them into overload.
Isolate controller traffic with a dedicated listener. Configure control.plane.listener.name so controller requests bypass the data-plane queue.
Set producer byte-rate quotas as circuit breakers. Quotas throttle runaway producers before retries fill the queue.
Validate broker version before thresholding idle percent. On Kafka versions older than 2.1, RequestHandlerAvgIdlePercent can exceed 1.0 due to KAFKA-7295, making threshold-based alerting unreliable.

How Netdata helps

Correlates RequestQueueSize with RequestHandlerAvgIdlePercent on one chart.
Surfaces per-broker RequestQueueTimeMs, LocalTimeMs, and RemoteTimeMs to separate queue, disk, and replication delays.
Tracks OS-level disk await and page fault rates alongside Kafka metrics.
Alerts on sustained queue depth relative to queued.max.requests without firing on transient bursts.
Maps Produce purgatory size and UnderReplicatedPartitions to expose replication lag root causes.

Kafka request queue filling up: RequestQueueSize, queued.max.requests, and backpressure

Kafka request queue filling up: RequestQueueSize, queued.max.requests, and backpressure

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Disk I/O is the bottleneck

I/O thread pool is too small

Queue capacity is too small for burst traffic

Producer retry cascade is feeding the loop

Control-plane requests are competing

Prevention

How Netdata helps

Related guides