Kafka producer timeout cascade: when retries pile load onto a slow broker
Your producers are timing out. P99 produce latency is climbing. You see more requests hitting the brokers, yet actual throughput of new messages is flat or falling. This is not a traffic spike. It is a producer timeout cascade: one slow broker causes clients to retry, and those retries add load to the same overloaded broker, closing the loop until the cluster is pinned.
The cascade is dangerous because it inverts normal load signals. BytesInPerSec can rise while MessagesInPerSec stalls, making the problem look like a burst of large messages instead of a retry storm. If you scale brokers or restart producers without breaking the feedback loop, you prolong the outage.
What this means
Kafka producers retry failed or timed-out sends by default. When a broker slows from disk degradation, a long GC pause, or replication backlog, produce requests exceed request.timeout.ms. The client resends the same batches. Those retries burn network threads, I/O threads, and request queue slots on the already-stressed broker. Throughput drops while request volume rises. With acks=all, slow followers inflate RemoteTimeMs, giving the producer even more reason to retry.
The loop continues until the root cause is removed or the producers exhaust delivery.timeout.ms. In the worst case, retry traffic prevents the broker from recovering even after the original issue subsides.
flowchart TD
A[Broker slows: disk/GC/overload] --> B[Produce latency spikes]
B --> C[Clients hit request.timeout.ms]
C --> D[Producers retry by default]
D --> E[More requests hit the slow broker]
E --> F[Request queue fills]
F --> G[Throughput drops while retries rise]
G --> ACommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Disk I/O degradation on the broker | LocalTimeMs for Produce spikes; RequestHandlerAvgIdlePercent drops; OS disk await is high | iostat -xz 1 on the broker host |
Slow follower with acks=all | RemoteTimeMs for Produce spikes; UnderReplicatedPartitions rises; produce purgatory grows | kafka-topics.sh --describe --under-replicated-partitions |
| GC pause on the broker | Latency jumps across all request types simultaneously; ISR shrinks may follow | jstat -gcutil or GC logs for Full GC events |
| Network partition or packet loss | Timeouts without high broker latency; TCP retransmits rise; follower fetches fail | ss -i and OS-level RetransSegs from /proc/net/snmp |
Quick checks
Run these read-only checks to confirm the cascade and locate the slow broker.
# Under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Broker request handler idle percentage (sustained low = overload)
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Value" | java -jar jmxterm.jar -l localhost:9999
# Produce request latency breakdown
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# Failed produce requests
echo "get -b kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Compare bytes in versus messages in to spot retries
echo "get -b kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Disk I/O latency on the broker host
iostat -xz 1
# GC behavior
jstat -gcutil $(pgrep -f kafka.Kafka) 1000
# Request queue depth
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
How to diagnose it
Work through these steps in order. Do not restart brokers until you know whether the bottleneck is local, remote, or a network path issue.
Confirm the cascade signature. Compare
BytesInPerSectoMessagesInPerSecon the broker. If bytes are rising while the message rate is flat or falling, the same batches are being retried. The message rate usually drops first; bytes may climb for one to twodelivery.timeout.mswindows as retries pile up. On the producer, checkrecord-retry-rateandrecord-error-rateif producer JMX is available. Both rising together confirm a retry loop, not a traffic increase.Find the slow broker. Pull
RequestHandlerAvgIdlePercentacross brokers. The slow broker has the lowest idle percentage, often paired with a highRequestQueueSize. If multiple brokers are affected, find the one whoseLocalTimeMsorRemoteTimeMsrose first.Isolate the bottleneck type.
LocalTimeMshigh and OS diskawaithigh indicates disk saturation.RemoteTimeMshigh andUnderReplicatedPartitionsrising indicates a slow follower.RequestQueueTimeMshigh with moderateLocalTimeMsindicates I/O thread exhaustion from request volume alone.
Correlate with producer timeouts. Producer logs should show
TimeoutExceptionorNotEnoughReplicasExceptionaligned with the broker latency spike. Producer errors should lag slightly behind the broker slowdown. If producer errors started first, look for a network path or DNS resolution problem rather than broker overload.Check for GC or network events. Overlay GC pause times and network retransmit metrics with the latency spike. A Full GC or NIC saturation within seconds of the first timeout strongly suggests the root cause.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
RequestHandlerAvgIdlePercent | Best single indicator of broker processing saturation | Sustained below 0.3; approaching 0.1 |
Produce LocalTimeMs / RemoteTimeMs / RequestQueueTimeMs | Locates the bottleneck: disk, replication, or thread pool | p99 approaching or exceeding request.timeout.ms |
BytesInPerSec vs MessagesInPerSec | Retries inflate bytes without increasing new message count | Bytes rise while messages stall or drop |
RequestQueueSize | Pressure between network and I/O threads | Sustained above 250 (half of default 500) |
FailedProduceRequestsPerSec | Direct measure of producer-visible broker errors | Sustained nonzero outside maintenance |
record-retry-rate (producer side) | Client-side confirmation of the retry loop | Sustained nonzero |
Disk I/O await | Underlying disk health | Above 20 ms for SSDs, 50 ms for HDDs sustained |
Fixes
Break the retry loop with producer quotas
The fastest way to stop the feedback cycle is to throttle the retrying producers. Apply a temporary broker-side produce byte-rate quota to the affected client IDs or users. This throttles the producer without changing client configuration. The tradeoff is that the producer will back up or drop records depending on its delivery.timeout.ms, so coordinate with application owners before enforcing a low quota.
Gracefully remove the slow broker
If one broker is clearly degraded, initiate a controlled shutdown. This triggers clean leader elections and moves partition leadership away from the sick node. Ensure controlled.shutdown.enable=true is set, or the shutdown may hang and force unclean elections. The tradeoff is a temporary increase in UnderReplicatedPartitions and a brief latency spike during leadership migration. Do not restart the broker until the root cause is fixed, and do not shut down additional brokers while the cluster is already under stress.
Fix disk or GC pressure
For disk I/O saturation, stop non-Kafka workloads on the volume, check for RAID degradation, and verify that log compaction is not running behind. If the disk is failing, replace it. For GC-related pauses, review heap sizing (4-8 GB is typical for brokers) and check for message format down-conversion that can inflate on-heap buffers. A broker restart clears transient memory pressure but costs a cold page cache and can trigger further ISR changes.
Address slow followers for acks=all
When RemoteTimeMs is the primary contributor, identify the lagging follower via UnderReplicatedPartitions and follower-side disk metrics. If the follower is transiently slow due to a restart, wait for it to catch up. If it is persistently degraded, remove it from the cluster. Do not lower min.insync.replicas to mask the problem; that trades durability for throughput and can lead to unclean elections.
Prevention
- Monitor the ratio of
BytesInPerSectoMessagesInPerSec. - Alert on
RequestHandlerAvgIdlePercentdropping below 0.5 before it reaches critical saturation. - Set producer
retriesanddelivery.timeout.msto bounded windows that match your latency SLA. - Use broker-side quotas proactively for high-volume producers to cap blast radius.
- Keep disk I/O headroom and monitor
LocalTimeMstrends; a sustained 2x baseline increase is an early warning. - Monitor follower health and ISR balance so that
acks=alltraffic does not pile onto one slow replica.
How Netdata helps
- Correlates broker
RequestHandlerAvgIdlePercentwith OS disk latency (disk.await) on the same node to pinpoint I/O saturation. - Surfaces the
Producerequest latency breakdown (LocalTimeMs,RemoteTimeMs,RequestQueueTimeMs) in one view without manual JMXterm queries. - Highlights divergence between
BytesInPerSecandMessagesInPerSecto detect retry cascades early. - Tracks
RequestQueueSizeand produce purgatory size so you can distinguish queue saturation from replication delays. - Maps producer-side retry rates against broker saturation metrics to confirm the direction of the feedback loop.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka enable.auto.commit data loss: committed offsets that outrun processing
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind
- Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata
- Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections







