Kafka producer timeout cascade: when retries pile load onto a slow broker

Your producers are timing out. P99 produce latency is climbing. You see more requests hitting the brokers, yet actual throughput of new messages is flat or falling. This is not a traffic spike. It is a producer timeout cascade: one slow broker causes clients to retry, and those retries add load to the same overloaded broker, closing the loop until the cluster is pinned.

The cascade is dangerous because it inverts normal load signals. BytesInPerSec can rise while MessagesInPerSec stalls, making the problem look like a burst of large messages instead of a retry storm. If you scale brokers or restart producers without breaking the feedback loop, you prolong the outage.

What this means

Kafka producers retry failed or timed-out sends by default. When a broker slows from disk degradation, a long GC pause, or replication backlog, produce requests exceed request.timeout.ms. The client resends the same batches. Those retries burn network threads, I/O threads, and request queue slots on the already-stressed broker. Throughput drops while request volume rises. With acks=all, slow followers inflate RemoteTimeMs, giving the producer even more reason to retry.

The loop continues until the root cause is removed or the producers exhaust delivery.timeout.ms. In the worst case, retry traffic prevents the broker from recovering even after the original issue subsides.

flowchart TD
    A[Broker slows: disk/GC/overload] --> B[Produce latency spikes]
    B --> C[Clients hit request.timeout.ms]
    C --> D[Producers retry by default]
    D --> E[More requests hit the slow broker]
    E --> F[Request queue fills]
    F --> G[Throughput drops while retries rise]
    G --> A

Common causes

CauseWhat it looks likeFirst thing to check
Disk I/O degradation on the brokerLocalTimeMs for Produce spikes; RequestHandlerAvgIdlePercent drops; OS disk await is highiostat -xz 1 on the broker host
Slow follower with acks=allRemoteTimeMs for Produce spikes; UnderReplicatedPartitions rises; produce purgatory growskafka-topics.sh --describe --under-replicated-partitions
GC pause on the brokerLatency jumps across all request types simultaneously; ISR shrinks may followjstat -gcutil or GC logs for Full GC events
Network partition or packet lossTimeouts without high broker latency; TCP retransmits rise; follower fetches failss -i and OS-level RetransSegs from /proc/net/snmp

Quick checks

Run these read-only checks to confirm the cascade and locate the slow broker.

# Under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Broker request handler idle percentage (sustained low = overload)
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Value" | java -jar jmxterm.jar -l localhost:9999

# Produce request latency breakdown
echo "get -b kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Failed produce requests
echo "get -b kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Compare bytes in versus messages in to spot retries
echo "get -b kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Disk I/O latency on the broker host
iostat -xz 1

# GC behavior
jstat -gcutil $(pgrep -f kafka.Kafka) 1000

# Request queue depth
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

How to diagnose it

Work through these steps in order. Do not restart brokers until you know whether the bottleneck is local, remote, or a network path issue.

  1. Confirm the cascade signature. Compare BytesInPerSec to MessagesInPerSec on the broker. If bytes are rising while the message rate is flat or falling, the same batches are being retried. The message rate usually drops first; bytes may climb for one to two delivery.timeout.ms windows as retries pile up. On the producer, check record-retry-rate and record-error-rate if producer JMX is available. Both rising together confirm a retry loop, not a traffic increase.

  2. Find the slow broker. Pull RequestHandlerAvgIdlePercent across brokers. The slow broker has the lowest idle percentage, often paired with a high RequestQueueSize. If multiple brokers are affected, find the one whose LocalTimeMs or RemoteTimeMs rose first.

  3. Isolate the bottleneck type.

    • LocalTimeMs high and OS disk await high indicates disk saturation.
    • RemoteTimeMs high and UnderReplicatedPartitions rising indicates a slow follower.
    • RequestQueueTimeMs high with moderate LocalTimeMs indicates I/O thread exhaustion from request volume alone.
  4. Correlate with producer timeouts. Producer logs should show TimeoutException or NotEnoughReplicasException aligned with the broker latency spike. Producer errors should lag slightly behind the broker slowdown. If producer errors started first, look for a network path or DNS resolution problem rather than broker overload.

  5. Check for GC or network events. Overlay GC pause times and network retransmit metrics with the latency spike. A Full GC or NIC saturation within seconds of the first timeout strongly suggests the root cause.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
RequestHandlerAvgIdlePercentBest single indicator of broker processing saturationSustained below 0.3; approaching 0.1
Produce LocalTimeMs / RemoteTimeMs / RequestQueueTimeMsLocates the bottleneck: disk, replication, or thread poolp99 approaching or exceeding request.timeout.ms
BytesInPerSec vs MessagesInPerSecRetries inflate bytes without increasing new message countBytes rise while messages stall or drop
RequestQueueSizePressure between network and I/O threadsSustained above 250 (half of default 500)
FailedProduceRequestsPerSecDirect measure of producer-visible broker errorsSustained nonzero outside maintenance
record-retry-rate (producer side)Client-side confirmation of the retry loopSustained nonzero
Disk I/O awaitUnderlying disk healthAbove 20 ms for SSDs, 50 ms for HDDs sustained

Fixes

Break the retry loop with producer quotas

The fastest way to stop the feedback cycle is to throttle the retrying producers. Apply a temporary broker-side produce byte-rate quota to the affected client IDs or users. This throttles the producer without changing client configuration. The tradeoff is that the producer will back up or drop records depending on its delivery.timeout.ms, so coordinate with application owners before enforcing a low quota.

Gracefully remove the slow broker

If one broker is clearly degraded, initiate a controlled shutdown. This triggers clean leader elections and moves partition leadership away from the sick node. Ensure controlled.shutdown.enable=true is set, or the shutdown may hang and force unclean elections. The tradeoff is a temporary increase in UnderReplicatedPartitions and a brief latency spike during leadership migration. Do not restart the broker until the root cause is fixed, and do not shut down additional brokers while the cluster is already under stress.

Fix disk or GC pressure

For disk I/O saturation, stop non-Kafka workloads on the volume, check for RAID degradation, and verify that log compaction is not running behind. If the disk is failing, replace it. For GC-related pauses, review heap sizing (4-8 GB is typical for brokers) and check for message format down-conversion that can inflate on-heap buffers. A broker restart clears transient memory pressure but costs a cold page cache and can trigger further ISR changes.

Address slow followers for acks=all

When RemoteTimeMs is the primary contributor, identify the lagging follower via UnderReplicatedPartitions and follower-side disk metrics. If the follower is transiently slow due to a restart, wait for it to catch up. If it is persistently degraded, remove it from the cluster. Do not lower min.insync.replicas to mask the problem; that trades durability for throughput and can lead to unclean elections.

Prevention

  • Monitor the ratio of BytesInPerSec to MessagesInPerSec.
  • Alert on RequestHandlerAvgIdlePercent dropping below 0.5 before it reaches critical saturation.
  • Set producer retries and delivery.timeout.ms to bounded windows that match your latency SLA.
  • Use broker-side quotas proactively for high-volume producers to cap blast radius.
  • Keep disk I/O headroom and monitor LocalTimeMs trends; a sustained 2x baseline increase is an early warning.
  • Monitor follower health and ISR balance so that acks=all traffic does not pile onto one slow replica.

How Netdata helps

  • Correlates broker RequestHandlerAvgIdlePercent with OS disk latency (disk.await) on the same node to pinpoint I/O saturation.
  • Surfaces the Produce request latency breakdown (LocalTimeMs, RemoteTimeMs, RequestQueueTimeMs) in one view without manual JMXterm queries.
  • Highlights divergence between BytesInPerSec and MessagesInPerSec to detect retry cascades early.
  • Tracks RequestQueueSize and produce purgatory size so you can distinguish queue saturation from replication delays.
  • Maps producer-side retry rates against broker saturation metrics to confirm the direction of the feedback loop.