Kafka REQUEST_TIMED_OUT: produce requests that expire before replication completes

Producers using acks=all throw TimeoutException (error code REQUEST_TIMED_OUT) when the broker accepts a produce request but cannot complete replication before request.timeout.ms expires. The leader appends the record to its local log and waits in purgatory for acknowledgments from all in-sync replicas. If the ISR ack does not arrive before the client deadline, the producer disconnects and surfaces the error. The broker may still finish the write, but the producer has already moved on, creating a hidden replication backlog and a potential retry storm.

This timeout is distinct from delivery.timeout.ms exhaustion, which covers the entire client-side retry loop including metadata fetches and queuing. REQUEST_TIMED_OUT is a per-request broker timeout. The fix depends on which phase consumed the window: request queueing, local log append, or remote follower acknowledgment.

What this means

With acks=all, the leader appends to its local log, places the request in the produce purgatory timer wheel, and blocks until every ISR member acknowledges. Broker produce latency breaks down into five JMX components: RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs, ResponseQueueTimeMs, and ResponseSendTimeMs.

If the total exceeds request.timeout.ms (default 30 s), the client closes the connection. The broker may still complete the write later, but the producer has already retried or errored. That added load can deepen the cascade.

flowchart TD
    A[Producer REQUEST_TIMED_OUT] --> B{Break down produce latency}
    B -->|RemoteTimeMs high| C[Followers slow to ack]
    B -->|LocalTimeMs high| D[Leader disk slow]
    B -->|RequestQueueTimeMs high| E[Request queue backup]
    C --> F[Check UnderReplicatedPartitions and follower disk or network]
    D --> G[Check disk await and page cache pressure]
    E --> H[Check RequestHandlerAvgIdlePercent and RequestQueueSize]

Common causes

CauseWhat it looks likeFirst thing to check
Slow follower replication (RemoteTimeMs)UnderReplicatedPartitions grows; ISR shrinks; one follower shows elevated disk await or network retransmitsRemoteTimeMs p99 and UnderReplicatedPartitions on leaders
Slow leader disk (LocalTimeMs)LocalTimeMs p99 spikes on the leader; OS disk await elevated above baselineLocalTimeMs p99 and iostat -x on the leader
Request queue backup (RequestQueueTimeMs)RequestQueueTimeMs grows before local or remote time; RequestHandlerAvgIdlePercent drops below 0.3RequestQueueSize and RequestHandlerAvgIdlePercent
Producer timeout cascadeProducer error rate climbs; broker FailedProduceRequestsPerSec rises; RequestQueueSize grows while useful throughput flatlinesProducer retry metrics and FailedProduceRequestsPerSec
Broker GC pause or overloadISR shrinks correlate with GC events; JVM heap usage high; request threads blockGC logs and java.lang:type=GarbageCollector metrics

Quick checks

# Produce latency: time waiting for ISR follower acks
echo "get -b kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Produce latency: time to write leader log and force to disk
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Produce latency: time stalled waiting for an I/O thread
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# List under-replicated partitions
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# I/O thread idle percent
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Request queue depth
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

# Disk latency on the broker
iostat -xz 1

# GC behavior (substitute broker PID)
jstat -gcutil <BROKER_PID> 1000

How to diagnose it

  1. Confirm the error pattern is broker-side timeout. Look for TimeoutException / REQUEST_TIMED_OUT in producer logs. Verify the client is hitting request.timeout.ms, not delivery.timeout.ms.
  2. Decompose broker produce latency. Query JMX for RequestQueueTimeMs, LocalTimeMs, and RemoteTimeMs on the Produce request type. Identify which component dominates p99.
  3. If RemoteTimeMs is elevated, list under-replicated partitions with kafka-topics.sh --under-replicated-partitions. Cross-reference to find the common lagging follower broker. Inspect that follower’s disk I/O (iostat) and network health (retransmits, packet loss).
  4. If LocalTimeMs is elevated, inspect the leader broker’s disk metrics. Check iostat for high await or %util. Look for page cache thrashing via /proc/vmstat pgmajfault.
  5. If RequestQueueTimeMs is elevated, check RequestHandlerAvgIdlePercent. Sustained values below 0.3 indicate I/O thread saturation. Check RequestQueueSize; sustained values above half of queued.max.requests confirm queue backup.
  6. Check for GC pauses aligned with latency spikes. Use jstat or GC logs. Full GC pauses over several seconds stall follower fetch threads and directly expand RemoteTimeMs.
  7. Detect timeout cascades. If FailedProduceRequestsPerSec rises while BytesInPerSec climbs without a matching rise in MessagesInPerSec, producers are retrying with larger batches and amplifying load.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
RemoteTimeMs (Produce)Time spent waiting for follower acks; dominates acks=all latencyp99 approaching replica.lag.time.max.ms (default 30 s)
LocalTimeMs (Produce)Time to write to leader log; proxy for disk healthp99 sustained above 2-3x baseline
RequestQueueTimeMs (Produce)Time waiting for an I/O thread before processingSustained growth above baseline; spikes precede saturation
UnderReplicatedPartitionsFollowers are not keeping up; widens the replication windowNonzero sustained > 2 min outside of maintenance
RequestHandlerAvgIdlePercentI/O thread saturation headroom; low values mean queue growthSustained below 0.3
RequestQueueSizePressure between network and I/O threadsConsistently above 250 (half the default queued.max.requests)
Produce purgatory sizeacks=all requests blocked in the timer wheel> 2x baseline sustained > 5 min
FailedProduceRequestsPerSecBroker-side count of failed writes; early signal of cascadesNonzero sustained rate

Fixes

Slow follower replication

When RemoteTimeMs is high, correlate UnderReplicatedPartitions across brokers to find the lagging follower. If that follower shows disk degradation, perform a controlled shutdown to trigger clean leader elections and isolate the sick broker. If the issue is a network partition, fix connectivity first. Do not restart additional brokers during recovery; extra controller events worsen queue backup.

Tradeoff: Removing a broker reduces available replicas. Ensure min.insync.replicas can still be met before shutting down.

Slow leader disk

When LocalTimeMs is high, inspect the leader’s disk await. If a single log directory on a JBOD volume is degraded, reassign partitions away from that path . If page cache thrashing is the cause, isolate backfill consumers or add RAM. For SSDs, sustained await above 20 ms is a ticket; above 100 ms with visible broker impact warrants immediate attention.

Tradeoff: Restarting a broker to move log directories loses the page cache and causes a cold-start latency spike.

Request queue saturation

If RequestQueueTimeMs is growing and RequestHandlerAvgIdlePercent is below 0.3, the broker cannot process requests fast enough. Increase num.io.threads only if CPU cores are available and the bottleneck is not disk I/O. If producers are overwhelming the broker, apply quotas to throttle ingress. If the root cause is disk or GC, fix that first; adding threads to a slow disk increases contention.

Tradeoff: More threads increase context-switch overhead and memory pressure.

Timeout cascade

If producers are retrying in a positive feedback loop, temporarily throttle them via producer quotas to break the spiral. Identify and remediate the original slow broker that triggered the timeouts. Monitor FailedProduceRequestsPerSec to confirm the rate drops after throttling.

Tradeoff: Quotas reduce throughput for all producers on the affected connection, not just the noisy ones.

GC pauses

If GC pauses correlate with latency spikes, right-size the JVM heap toward 4-8 GB. Oversized heaps cause longer pauses and steal page cache from the OS. Reduce message format down-conversion, which materializes large buffers on-heap. Use G1GC and monitor G1 Old Generation collection time.

Tradeoff: A smaller heap reduces GC pause time but limits metadata caching.

Prevention

Monitor the produce latency breakdown (RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs) rather than only TotalTimeMs. TotalTimeMs can mask whether the bottleneck is disk, replication, or thread saturation. Set ticket-level alerts on RequestHandlerAvgIdlePercent before it drops below 0.3. Correlate UnderReplicatedPartitions with UnderMinIsrPartitionCount to confirm write path impact. Run failure tests to measure how long broker death and leader election take in your cluster; if recovery exceeds your request.timeout.ms budget, reduce partition count per broker or improve disk throughput. Avoid oversized heaps; longer GC pauses directly expand RemoteTimeMs by stalling followers.

How Netdata helps

Netdata collects Kafka JMX metrics (RemoteTimeMs, LocalTimeMs, RequestQueueTimeMs) alongside OS disk latency and page cache pressure. Unified timelines show whether latency is disk, queue, or replication-bound. Track UnderReplicatedPartitions, FailedProduceRequestsPerSec, and RequestHandlerAvgIdlePercent against baselines. Overlay JVM GC pause charts with produce latency to spot heap-related delays without parsing GC logs manually.