Kafka REQUEST_TIMED_OUT: produce requests that expire before replication completes
Producers using acks=all throw TimeoutException (error code REQUEST_TIMED_OUT) when the broker accepts a produce request but cannot complete replication before request.timeout.ms expires. The leader appends the record to its local log and waits in purgatory for acknowledgments from all in-sync replicas. If the ISR ack does not arrive before the client deadline, the producer disconnects and surfaces the error. The broker may still finish the write, but the producer has already moved on, creating a hidden replication backlog and a potential retry storm.
This timeout is distinct from delivery.timeout.ms exhaustion, which covers the entire client-side retry loop including metadata fetches and queuing. REQUEST_TIMED_OUT is a per-request broker timeout. The fix depends on which phase consumed the window: request queueing, local log append, or remote follower acknowledgment.
What this means
With acks=all, the leader appends to its local log, places the request in the produce purgatory timer wheel, and blocks until every ISR member acknowledges. Broker produce latency breaks down into five JMX components: RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs, ResponseQueueTimeMs, and ResponseSendTimeMs.
If the total exceeds request.timeout.ms (default 30 s), the client closes the connection. The broker may still complete the write later, but the producer has already retried or errored. That added load can deepen the cascade.
flowchart TD
A[Producer REQUEST_TIMED_OUT] --> B{Break down produce latency}
B -->|RemoteTimeMs high| C[Followers slow to ack]
B -->|LocalTimeMs high| D[Leader disk slow]
B -->|RequestQueueTimeMs high| E[Request queue backup]
C --> F[Check UnderReplicatedPartitions and follower disk or network]
D --> G[Check disk await and page cache pressure]
E --> H[Check RequestHandlerAvgIdlePercent and RequestQueueSize]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
Slow follower replication (RemoteTimeMs) | UnderReplicatedPartitions grows; ISR shrinks; one follower shows elevated disk await or network retransmits | RemoteTimeMs p99 and UnderReplicatedPartitions on leaders |
Slow leader disk (LocalTimeMs) | LocalTimeMs p99 spikes on the leader; OS disk await elevated above baseline | LocalTimeMs p99 and iostat -x on the leader |
Request queue backup (RequestQueueTimeMs) | RequestQueueTimeMs grows before local or remote time; RequestHandlerAvgIdlePercent drops below 0.3 | RequestQueueSize and RequestHandlerAvgIdlePercent |
| Producer timeout cascade | Producer error rate climbs; broker FailedProduceRequestsPerSec rises; RequestQueueSize grows while useful throughput flatlines | Producer retry metrics and FailedProduceRequestsPerSec |
| Broker GC pause or overload | ISR shrinks correlate with GC events; JVM heap usage high; request threads block | GC logs and java.lang:type=GarbageCollector metrics |
Quick checks
# Produce latency: time waiting for ISR follower acks
echo "get -b kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# Produce latency: time to write leader log and force to disk
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# Produce latency: time stalled waiting for an I/O thread
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# List under-replicated partitions
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# I/O thread idle percent
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Request queue depth
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
# Disk latency on the broker
iostat -xz 1
# GC behavior (substitute broker PID)
jstat -gcutil <BROKER_PID> 1000
How to diagnose it
- Confirm the error pattern is broker-side timeout. Look for
TimeoutException/REQUEST_TIMED_OUTin producer logs. Verify the client is hittingrequest.timeout.ms, notdelivery.timeout.ms. - Decompose broker produce latency. Query JMX for
RequestQueueTimeMs,LocalTimeMs, andRemoteTimeMson the Produce request type. Identify which component dominates p99. - If
RemoteTimeMsis elevated, list under-replicated partitions withkafka-topics.sh --under-replicated-partitions. Cross-reference to find the common lagging follower broker. Inspect that follower’s disk I/O (iostat) and network health (retransmits, packet loss). - If
LocalTimeMsis elevated, inspect the leader broker’s disk metrics. Checkiostatfor highawaitor%util. Look for page cache thrashing via/proc/vmstatpgmajfault. - If
RequestQueueTimeMsis elevated, checkRequestHandlerAvgIdlePercent. Sustained values below 0.3 indicate I/O thread saturation. CheckRequestQueueSize; sustained values above half ofqueued.max.requestsconfirm queue backup. - Check for GC pauses aligned with latency spikes. Use
jstator GC logs. Full GC pauses over several seconds stall follower fetch threads and directly expandRemoteTimeMs. - Detect timeout cascades. If
FailedProduceRequestsPerSecrises whileBytesInPerSecclimbs without a matching rise inMessagesInPerSec, producers are retrying with larger batches and amplifying load.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
RemoteTimeMs (Produce) | Time spent waiting for follower acks; dominates acks=all latency | p99 approaching replica.lag.time.max.ms (default 30 s) |
LocalTimeMs (Produce) | Time to write to leader log; proxy for disk health | p99 sustained above 2-3x baseline |
RequestQueueTimeMs (Produce) | Time waiting for an I/O thread before processing | Sustained growth above baseline; spikes precede saturation |
UnderReplicatedPartitions | Followers are not keeping up; widens the replication window | Nonzero sustained > 2 min outside of maintenance |
RequestHandlerAvgIdlePercent | I/O thread saturation headroom; low values mean queue growth | Sustained below 0.3 |
RequestQueueSize | Pressure between network and I/O threads | Consistently above 250 (half the default queued.max.requests) |
| Produce purgatory size | acks=all requests blocked in the timer wheel | > 2x baseline sustained > 5 min |
FailedProduceRequestsPerSec | Broker-side count of failed writes; early signal of cascades | Nonzero sustained rate |
Fixes
Slow follower replication
When RemoteTimeMs is high, correlate UnderReplicatedPartitions across brokers to find the lagging follower. If that follower shows disk degradation, perform a controlled shutdown to trigger clean leader elections and isolate the sick broker. If the issue is a network partition, fix connectivity first. Do not restart additional brokers during recovery; extra controller events worsen queue backup.
Tradeoff: Removing a broker reduces available replicas. Ensure min.insync.replicas can still be met before shutting down.
Slow leader disk
When LocalTimeMs is high, inspect the leader’s disk await. If a single log directory on a JBOD volume is degraded, reassign partitions away from that path . If page cache thrashing is the cause, isolate backfill consumers or add RAM. For SSDs, sustained await above 20 ms is a ticket; above 100 ms with visible broker impact warrants immediate attention.
Tradeoff: Restarting a broker to move log directories loses the page cache and causes a cold-start latency spike.
Request queue saturation
If RequestQueueTimeMs is growing and RequestHandlerAvgIdlePercent is below 0.3, the broker cannot process requests fast enough. Increase num.io.threads only if CPU cores are available and the bottleneck is not disk I/O. If producers are overwhelming the broker, apply quotas to throttle ingress. If the root cause is disk or GC, fix that first; adding threads to a slow disk increases contention.
Tradeoff: More threads increase context-switch overhead and memory pressure.
Timeout cascade
If producers are retrying in a positive feedback loop, temporarily throttle them via producer quotas to break the spiral. Identify and remediate the original slow broker that triggered the timeouts. Monitor FailedProduceRequestsPerSec to confirm the rate drops after throttling.
Tradeoff: Quotas reduce throughput for all producers on the affected connection, not just the noisy ones.
GC pauses
If GC pauses correlate with latency spikes, right-size the JVM heap toward 4-8 GB. Oversized heaps cause longer pauses and steal page cache from the OS. Reduce message format down-conversion, which materializes large buffers on-heap. Use G1GC and monitor G1 Old Generation collection time.
Tradeoff: A smaller heap reduces GC pause time but limits metadata caching.
Prevention
Monitor the produce latency breakdown (RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs) rather than only TotalTimeMs. TotalTimeMs can mask whether the bottleneck is disk, replication, or thread saturation. Set ticket-level alerts on RequestHandlerAvgIdlePercent before it drops below 0.3. Correlate UnderReplicatedPartitions with UnderMinIsrPartitionCount to confirm write path impact. Run failure tests to measure how long broker death and leader election take in your cluster; if recovery exceeds your request.timeout.ms budget, reduce partition count per broker or improve disk throughput. Avoid oversized heaps; longer GC pauses directly expand RemoteTimeMs by stalling followers.
How Netdata helps
Netdata collects Kafka JMX metrics (RemoteTimeMs, LocalTimeMs, RequestQueueTimeMs) alongside OS disk latency and page cache pressure. Unified timelines show whether latency is disk, queue, or replication-bound. Track UnderReplicatedPartitions, FailedProduceRequestsPerSec, and RequestHandlerAvgIdlePercent against baselines. Overlay JVM GC pause charts with produce latency to spot heap-related delays without parsing GC logs manually.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka enable.auto.commit data loss: committed offsets that outrun processing
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka fetch request latency high: FetchConsumer vs FetchFollower and page cache misses
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka JVM heap and Full GC pauses: ISR drops, session timeouts, and right-sizing the heap
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind







