$ guides / kafka / kafka-request-timed-out-error ▌

Operations Guides

Kafka REQUEST_TIMED_OUT: produce requests that expire before replication completes

Producers using acks=all throw TimeoutException (error code REQUEST_TIMED_OUT) when the broker accepts a produce request but cannot complete replication before request.timeout.ms expires. The leader appends the record to its local log and waits in purgatory for acknowledgments from all in-sync replicas. If the ISR ack does not arrive before the client deadline, the producer disconnects and surfaces the error. The broker may still finish the write, but the producer has already moved on, creating a hidden replication backlog and a potential retry storm.

This timeout is distinct from delivery.timeout.ms exhaustion, which covers the entire client-side retry loop including metadata fetches and queuing. REQUEST_TIMED_OUT is a per-request broker timeout. The fix depends on which phase consumed the window: request queueing, local log append, or remote follower acknowledgment.

What this means

With acks=all, the leader appends to its local log, places the request in the produce purgatory timer wheel, and blocks until every ISR member acknowledges. Broker produce latency breaks down into five JMX components: RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs, ResponseQueueTimeMs, and ResponseSendTimeMs.

If the total exceeds request.timeout.ms (default 30 s), the client closes the connection. The broker may still complete the write later, but the producer has already retried or errored. That added load can deepen the cascade.

flowchart TD
    A[Producer REQUEST_TIMED_OUT] --> B{Break down produce latency}
    B -->|RemoteTimeMs high| C[Followers slow to ack]
    B -->|LocalTimeMs high| D[Leader disk slow]
    B -->|RequestQueueTimeMs high| E[Request queue backup]
    C --> F[Check UnderReplicatedPartitions and follower disk or network]
    D --> G[Check disk await and page cache pressure]
    E --> H[Check RequestHandlerAvgIdlePercent and RequestQueueSize]

Common causes

Cause	What it looks like	First thing to check
Slow follower replication (`RemoteTimeMs`)	`UnderReplicatedPartitions` grows; ISR shrinks; one follower shows elevated disk `await` or network retransmits	`RemoteTimeMs` p99 and `UnderReplicatedPartitions` on leaders
Slow leader disk (`LocalTimeMs`)	`LocalTimeMs` p99 spikes on the leader; OS disk `await` elevated above baseline	`LocalTimeMs` p99 and `iostat -x` on the leader
Request queue backup (`RequestQueueTimeMs`)	`RequestQueueTimeMs` grows before local or remote time; `RequestHandlerAvgIdlePercent` drops below 0.3	`RequestQueueSize` and `RequestHandlerAvgIdlePercent`
Producer timeout cascade	Producer error rate climbs; broker `FailedProduceRequestsPerSec` rises; `RequestQueueSize` grows while useful throughput flatlines	Producer retry metrics and `FailedProduceRequestsPerSec`
Broker GC pause or overload	ISR shrinks correlate with GC events; JVM heap usage high; request threads block	GC logs and `java.lang:type=GarbageCollector` metrics

Quick checks

# Produce latency: time waiting for ISR follower acks
echo "get -b kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Produce latency: time to write leader log and force to disk
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Produce latency: time stalled waiting for an I/O thread
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# List under-replicated partitions
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# I/O thread idle percent
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Request queue depth
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

# Disk latency on the broker
iostat -xz 1

# GC behavior (substitute broker PID)
jstat -gcutil <BROKER_PID> 1000

How to diagnose it

Confirm the error pattern is broker-side timeout. Look for TimeoutException / REQUEST_TIMED_OUT in producer logs. Verify the client is hitting request.timeout.ms, not delivery.timeout.ms.
Decompose broker produce latency. Query JMX for RequestQueueTimeMs, LocalTimeMs, and RemoteTimeMs on the Produce request type. Identify which component dominates p99.
If RemoteTimeMs is elevated, list under-replicated partitions with kafka-topics.sh --under-replicated-partitions. Cross-reference to find the common lagging follower broker. Inspect that follower’s disk I/O (iostat) and network health (retransmits, packet loss).
If LocalTimeMs is elevated, inspect the leader broker’s disk metrics. Check iostat for high await or %util. Look for page cache thrashing via /proc/vmstat pgmajfault.
If RequestQueueTimeMs is elevated, check RequestHandlerAvgIdlePercent. Sustained values below 0.3 indicate I/O thread saturation. Check RequestQueueSize; sustained values above half of queued.max.requests confirm queue backup.
Check for GC pauses aligned with latency spikes. Use jstat or GC logs. Full GC pauses over several seconds stall follower fetch threads and directly expand RemoteTimeMs.
Detect timeout cascades. If FailedProduceRequestsPerSec rises while BytesInPerSec climbs without a matching rise in MessagesInPerSec, producers are retrying with larger batches and amplifying load.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`RemoteTimeMs` (Produce)	Time spent waiting for follower acks; dominates `acks=all` latency	p99 approaching `replica.lag.time.max.ms` (default 30 s)
`LocalTimeMs` (Produce)	Time to write to leader log; proxy for disk health	p99 sustained above 2-3x baseline
`RequestQueueTimeMs` (Produce)	Time waiting for an I/O thread before processing	Sustained growth above baseline; spikes precede saturation
`UnderReplicatedPartitions`	Followers are not keeping up; widens the replication window	Nonzero sustained > 2 min outside of maintenance
`RequestHandlerAvgIdlePercent`	I/O thread saturation headroom; low values mean queue growth	Sustained below 0.3
`RequestQueueSize`	Pressure between network and I/O threads	Consistently above 250 (half the default `queued.max.requests`)
Produce purgatory size	`acks=all` requests blocked in the timer wheel	> 2x baseline sustained > 5 min
`FailedProduceRequestsPerSec`	Broker-side count of failed writes; early signal of cascades	Nonzero sustained rate

Fixes

Slow follower replication

When RemoteTimeMs is high, correlate UnderReplicatedPartitions across brokers to find the lagging follower. If that follower shows disk degradation, perform a controlled shutdown to trigger clean leader elections and isolate the sick broker. If the issue is a network partition, fix connectivity first. Do not restart additional brokers during recovery; extra controller events worsen queue backup.

Tradeoff: Removing a broker reduces available replicas. Ensure min.insync.replicas can still be met before shutting down.

Slow leader disk

When LocalTimeMs is high, inspect the leader’s disk await. If a single log directory on a JBOD volume is degraded, reassign partitions away from that path . If page cache thrashing is the cause, isolate backfill consumers or add RAM. For SSDs, sustained await above 20 ms is a ticket; above 100 ms with visible broker impact warrants immediate attention.

Tradeoff: Restarting a broker to move log directories loses the page cache and causes a cold-start latency spike.

Request queue saturation

If RequestQueueTimeMs is growing and RequestHandlerAvgIdlePercent is below 0.3, the broker cannot process requests fast enough. Increase num.io.threads only if CPU cores are available and the bottleneck is not disk I/O. If producers are overwhelming the broker, apply quotas to throttle ingress. If the root cause is disk or GC, fix that first; adding threads to a slow disk increases contention.

Tradeoff: More threads increase context-switch overhead and memory pressure.

Timeout cascade

If producers are retrying in a positive feedback loop, temporarily throttle them via producer quotas to break the spiral. Identify and remediate the original slow broker that triggered the timeouts. Monitor FailedProduceRequestsPerSec to confirm the rate drops after throttling.

Tradeoff: Quotas reduce throughput for all producers on the affected connection, not just the noisy ones.

GC pauses

If GC pauses correlate with latency spikes, right-size the JVM heap toward 4-8 GB. Oversized heaps cause longer pauses and steal page cache from the OS. Reduce message format down-conversion, which materializes large buffers on-heap. Use G1GC and monitor G1 Old Generation collection time.

Tradeoff: A smaller heap reduces GC pause time but limits metadata caching.

Prevention

Monitor the produce latency breakdown (RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs) rather than only TotalTimeMs. TotalTimeMs can mask whether the bottleneck is disk, replication, or thread saturation. Set ticket-level alerts on RequestHandlerAvgIdlePercent before it drops below 0.3. Correlate UnderReplicatedPartitions with UnderMinIsrPartitionCount to confirm write path impact. Run failure tests to measure how long broker death and leader election take in your cluster; if recovery exceeds your request.timeout.ms budget, reduce partition count per broker or improve disk throughput. Avoid oversized heaps; longer GC pauses directly expand RemoteTimeMs by stalling followers.

How Netdata helps

Netdata collects Kafka JMX metrics (RemoteTimeMs, LocalTimeMs, RequestQueueTimeMs) alongside OS disk latency and page cache pressure. Unified timelines show whether latency is disk, queue, or replication-bound. Track UnderReplicatedPartitions, FailedProduceRequestsPerSec, and RequestHandlerAvgIdlePercent against baselines. Overlay JVM GC pause charts with produce latency to spot heap-related delays without parsing GC logs manually.

Kafka REQUEST_TIMED_OUT: produce requests that expire before replication completes

Kafka REQUEST_TIMED_OUT: produce requests that expire before replication completes

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Slow follower replication

Slow leader disk

Request queue saturation

Timeout cascade

GC pauses

Prevention

How Netdata helps

Related guides