Kafka request queue filling up: RequestQueueSize, queued.max.requests, and backpressure

Your Kafka producers are timing out. Broker logs show no errors, but client-side metrics reveal growing latency and retries. On the broker, RequestQueueSize climbs toward queued.max.requests (default 500). Once the queue fills, network threads block on enqueue and stop reading from their sockets. Clients see TCP backpressure, time out, and retry. Retries add load, deepening the queue. This feedback loop correlates with falling RequestHandlerAvgIdlePercent.

What this means

Kafka brokers use a reactor pattern. Network threads (num.network.threads, default 3) read requests into a bounded queue. I/O handler threads (num.io.threads, default 8) dequeue and process them. The queue capacity is queued.max.requests (default 500).

When I/O threads process slower than network threads read, the queue grows. Sustained elevation above 50% of capacity signals saturation. At the cap, network threads block on enqueue and stop draining their selectors. TCP backpressure propagates to clients.

Producers hit request.timeout.ms (default 30s) and retry if configured. With retries > 0, duplicate batches increase request volume while successful throughput falls.

RequestQueueSize (kafka.network:type=RequestChannel,name=RequestQueueSize) is an instantaneous gauge. Brief spikes are normal; sustained elevation is the signal. RequestHandlerAvgIdlePercent (kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent) measures the same pressure from the I/O thread side: the fraction of time threads are idle. Below 0.3 is severe; below 0.1 is active overload.

If control.plane.listener.name is not configured, controller requests share this queue. Under heavy load, LeaderAndISR and similar operations delay, compounding metadata staleness.

flowchart TD
    A[I/O threads slow] --> B[RequestQueueSize grows]
    B --> C[Queue approaches queued.max.requests]
    C --> D[Network threads block on enqueue]
    D --> E[Client requests timeout]
    E --> F[Producers retry]
    F --> G[More requests hit network threads]
    G --> A

Common causes

CauseWhat it looks likeFirst thing to check
I/O thread saturation from disk latencyRequestQueueTimeMs and LocalTimeMs spike; RequestHandlerAvgIdlePercent drops below 0.3Disk await via iostat -xz 1 on the broker
Producer timeout cascadeBytesInPerSec rises while MessagesInPerSec stays flat; queue grows on one brokerPer-broker RequestHandlerAvgIdlePercent and producer retry metrics
Insufficient I/O threads for peak loadQueue grows predictably during traffic peaks; idle percent hovers near 0.3num.io.threads versus per-broker partition count
Control-plane requests competing for queue spaceQueue elevates during controller events without matching traffic spikesWhether control.plane.listener.name is configured
Slow replication blocking I/O threads in purgatoryProduce purgatory size grows; RemoteTimeMs high for acks=all; UnderReplicatedPartitions elevatedFollower FetchFollower latency and ISR health

Quick checks

# Current request queue depth
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

# I/O thread idle percent (1.0 = fully idle)
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Value" | java -jar jmxterm.jar -l localhost:9999

# Produce request queue wait time (p99)
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Disk latency for log directories
iostat -xz 1

# Produce purgatory size
echo "get -b kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce Value" | java -jar jmxterm.jar -l localhost:9999

# Approximate connection count to the broker
ss -tnp | grep $(pgrep -f kafka.Kafka) | wc -l

How to diagnose it

  1. Confirm sustained queue growth. Check RequestQueueSize over 5 minutes. A flat line at 400 is different from a momentary spike to 300.
  2. Correlate with RequestHandlerAvgIdlePercent. Above 0.5 suggests a transient burst; below 0.3 confirms I/O thread saturation.
  3. Decompose request latency. RequestQueueTimeMs high means the queue is the bottleneck. LocalTimeMs high means disk I/O is slow. RemoteTimeMs high means replication lag (for acks=all). ResponseQueueTimeMs high means network threads cannot send responses fast enough.
  4. Check disk I/O and page cache. Run iostat -xz 1. await above 20ms on SSDs or 50ms on HDDs indicates blocking disk I/O. Check /proc/vmstat for pgmajfault rate to detect page cache thrashing.
  5. Detect retry cascades. Compare BytesInPerSec to MessagesInPerSec. If bytes rise while messages stay flat, producers are retrying large batches. Check client-side record-retry-rate if available.
  6. Rule out control-plane interference. If control.plane.listener.name is absent, check whether controller events (ISR changes, leader elections) coincide with queue growth. Controller requests in the shared queue are blocked by data-plane backlog.
  7. Inspect purgatory and replication. Growing produce purgatory with high RemoteTimeMs points to slow followers. Check UnderReplicatedPartitions and FetchFollower latency.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
RequestQueueSizeInstantaneous pressure between network and I/O threadsSustained >50% of queued.max.requests (default 500)
RequestHandlerAvgIdlePercentI/O thread saturation; inverse view of queue pressureSustained <0.3; active overload <0.1
RequestQueueTimeMs (Produce)Time spent waiting for an I/O threadp99 exceeds 2-3x baseline for >5 minutes
NetworkProcessorAvgIdlePercentNetwork thread saturation; can mimic queue symptomsSustained <0.3
BytesInPerSec vs MessagesInPerSecDetects producer retry cascadesBytes rise while messages flat or fall
Produce purgatory sizeacks=all requests stuck waiting for replicationSustained >2x baseline with growing queue
LocalTimeMs (Produce)Disk write latency inside request handlingp99 spikes correlate with queue growth

Fixes

Disk I/O is the bottleneck

If LocalTimeMs and disk await are elevated, more I/O threads will not help. Additional threads contending on a slow disk worsen latency. Identify whether slowness is hardware degradation, page cache thrashing, or a competing workload. For JBOD, one degraded log directory slows the whole broker because I/O threads are shared. Consider a controlled shutdown to trigger leader migration if a specific disk is degraded.

I/O thread pool is too small

If RequestHandlerAvgIdlePercent is low and disk I/O is healthy, increase num.io.threads in server.properties and perform a rolling restart.

Queue capacity is too small for burst traffic

If the queue hits 500 only during predictable bursts and I/O threads recover quickly, raise queued.max.requests in server.properties to absorb bursts without blocking network threads. Do not raise the cap if RequestHandlerAvgIdlePercent is already low; you will only delay overload and increase memory pressure.

Producer retry cascade is feeding the loop

Break the loop by throttling producers via quotas. You can temporarily increase request.timeout.ms to reduce premature retries, but fix the root broker bottleneck immediately or you increase memory pressure from in-flight requests.

Control-plane requests are competing

Configure control.plane.listener.name (KIP-291) to isolate controller traffic. Without this, heavy data-plane traffic blocks LeaderAndISR and UpdateMetadata, causing metadata propagation stalls and additional elections.

Prevention

  • Monitor queue depth and idle percent together. RequestQueueSize and RequestHandlerAvgIdlePercent are two views of the same bottleneck. Alerting on only one misses bursty-but-healthy traffic or saturated threads with a temporarily short queue.
  • Maintain I/O thread headroom above 50% idle during peak. Surviving brokers must absorb partitions from a failed peer. If they normally run below 0.5 idle, the shifted load drops them into overload.
  • Isolate controller traffic with a dedicated listener. Configure control.plane.listener.name so controller requests bypass the data-plane queue.
  • Set producer byte-rate quotas as circuit breakers. Quotas throttle runaway producers before retries fill the queue.
  • Validate broker version before thresholding idle percent. On Kafka versions older than 2.1, RequestHandlerAvgIdlePercent can exceed 1.0 due to KAFKA-7295, making threshold-based alerting unreliable.

How Netdata helps

  • Correlates RequestQueueSize with RequestHandlerAvgIdlePercent on one chart.
  • Surfaces per-broker RequestQueueTimeMs, LocalTimeMs, and RemoteTimeMs to separate queue, disk, and replication delays.
  • Tracks OS-level disk await and page fault rates alongside Kafka metrics.
  • Alerts on sustained queue depth relative to queued.max.requests without firing on transient bursts.
  • Maps Produce purgatory size and UnderReplicatedPartitions to expose replication lag root causes.