Kafka produce request latency high: reading the TotalTimeMs breakdown
Producers are timing out or retrying. Client-side request-latency-avg is elevated and broker TotalTimeMs p99 is spiking. The total is unactionable by itself. Kafka breaks it into five sub-components, each implicating a different subsystem. You need all five via JMX or an equivalent metrics collector to know whether to fix disk, scale a thread pool, or replace a follower.
What this means
TotalTimeMs is the wall-clock time from when the broker’s network thread receives a produce request until the response is fully sent. It is the arithmetic sum of five phases:
- RequestQueueTimeMs - time queued before an I/O handler thread picks it up.
- LocalTimeMs - time the leader spends appending to the local log (disk write path, index updates).
- RemoteTimeMs - time waiting for follower acknowledgments. Non-zero only when producers use
acks=all(or-1). - ResponseQueueTimeMs - time the completed response waits for a network processor thread.
- ResponseSendTimeMs - time to serialize and write the response bytes onto the socket.
The MBean path for each is kafka.network:type=RequestMetrics,name=<Metric>,request=Produce. All expose 50thPercentile, 95thPercentile, 99thPercentile, and Mean.
When TotalTimeMs p99 approaches the producer’s request.timeout.ms (default 30 seconds), clients begin emitting TimeoutException and retrying. Those retries increase broker load and push latency higher. Escalate when p99 exceeds roughly 75 percent of request.timeout.ms.
flowchart TD
A[TotalTimeMs high] --> B{Which component?}
B -->|RequestQueueTimeMs| C[I/O thread saturation]
B -->|LocalTimeMs| D{Disk slow or CPU?}
D --> E[Check disk await]
B -->|RemoteTimeMs| F[Slow followers]
B -->|ResponseQueueTimeMs| G[Network thread backlog]
B -->|ResponseSendTimeMs| H[Network congestion or slow client]
C --> I[Check RequestHandlerAvgIdlePercent]
F --> J[Check UnderReplicatedPartitions]
G --> K[Check NetworkProcessorAvgIdlePercent]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| I/O thread saturation | RequestQueueTimeMs rises while LocalTimeMs stays flat | RequestHandlerAvgIdlePercent and RequestQueueSize |
| Disk I/O pressure | LocalTimeMs spikes, often with high await | iostat -xz 1 and pgmajfault rate |
Slow followers (acks=all) | RemoteTimeMs is the dominant component | UnderReplicatedPartitions and IsrShrinksPerSec on the leader |
| Network thread backlog | ResponseQueueTimeMs grows; ResponseSendTimeMs normal | NetworkProcessorAvgIdlePercent and ResponseQueueSize |
| Network congestion or slow client | ResponseSendTimeMs high in isolation | NIC utilization and producer-side record-retry-rate |
Quick checks
Run these read-only checks from a broker host or via JMX. None are destructive.
# Check total produce latency p99
echo "get -b kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -n -l localhost:9999
# Check the five sub-component p99 values
for metric in RequestQueueTimeMs LocalTimeMs RemoteTimeMs ResponseQueueTimeMs ResponseSendTimeMs; do
echo "$metric:"
echo "get -b kafka.network:type=RequestMetrics,name=$metric,request=Produce 99thPercentile" | java -jar jmxterm.jar -n -l localhost:9999
done
# Check I/O thread saturation
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -n -l localhost:9999
# Check network thread saturation
echo "get -b kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -n -l localhost:9999
# Check request queue depth
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -n -l localhost:9999
# List under-replicated partitions
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Check disk latency
iostat -xz 1
How to diagnose it
Establish the total and the baseline. Compare
TotalTimeMsp99 against the producer’srequest.timeout.ms. If the total is normal but clients still time out, investigate the network or client side.Read the breakdown. Identify which sub-component dominates the increase. One component is usually responsible for most of the delta.
If
RequestQueueTimeMsis high: The I/O threads cannot keep up. CheckRequestHandlerAvgIdlePercent. Sustained values below 0.3 indicate severe saturation. CheckRequestQueueSize; sustained values above 250 (half the defaultqueued.max.requestsof 500) confirm the bottleneck is between the network and I/O layers.If
LocalTimeMsis high: The leader is slow to append to its local log. Check disk I/O latency withiostat -xz 1.awaitabove 20 ms for SSDs or 50 ms for HDDs is abnormal. Also check for page cache pressure viapgmajfaultin/proc/vmstat. If disk metrics are healthy but CPU is saturated, the bottleneck is CPU, not disk.If
RemoteTimeMsis high: This applies only when producers useacks=all. The leader is waiting for followers to acknowledge. CheckUnderReplicatedPartitionson this broker. Cross-reference withIsrShrinksPerSecto see if followers are actively falling out of sync. CheckFetchFollowerlatency on the leader to confirm followers are being served slowly, then inspect the follower brokers’ disk and network metrics.If
ResponseQueueTimeMsis high: The network threads are slow to drain completed responses. CheckNetworkProcessorAvgIdlePercent; sustained values below 0.3 indicate network thread saturation. CheckResponseQueueSizefor confirmation.If
ResponseSendTimeMsis high: Produce responses are small; a spike here usually means the TCP send buffer is backing up. Check/proc/net/devfor NIC saturation. If the NIC is not saturated, the client is slow to read responses.Check for retry cascades. If
BytesInPerSecis rising whileMessagesInPerSecis flat, producers are likely retrying the same messages. This creates a positive feedback loop that inflates all components.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
TotalTimeMs p99 (Produce) | End-to-end broker-side latency | Sustained >2-3x baseline or approaching request.timeout.ms |
RequestQueueTimeMs | Pressure between network and I/O threads | Sustained elevation above baseline |
LocalTimeMs | Leader write-path health | Spike correlating with disk await or page faults |
RemoteTimeMs | Replication wait time for acks=all | Nonzero and spiking for acks=all topics |
ResponseQueueTimeMs | Network thread backlog | Growing while ResponseSendTimeMs remains flat |
ResponseSendTimeMs | Wire-time health | High in isolation |
RequestHandlerAvgIdlePercent | I/O thread pool headroom | Sustained below 0.3 |
NetworkProcessorAvgIdlePercent | Network thread pool headroom | Sustained below 0.3 |
RequestQueueSize | Queue depth between thread pools | Consistently above 250 |
UnderReplicatedPartitions | Follower replication health | Nonzero sustained outside maintenance |
Fixes
I/O thread saturation (RequestQueueTimeMs high)
Increase num.io.threads only if the broker has available CPU and the disk is not the bottleneck. Adding threads to a disk-bound broker increases contention. If the broker is already at capacity, shed load by reassigning partitions or adding brokers. Check for abnormally large produce batches that inflate per-request CPU.
Disk pressure (LocalTimeMs high)
Identify whether the issue is a single slow disk (one log.dirs entry degrading while others are healthy) or global saturation. For a failing disk, remove the broker via controlled shutdown and replace the hardware. For global saturation, reduce retention, add brokers, or move high-volume topics.
Slow followers (RemoteTimeMs high)
The follower is the problem, not the leader. Identify the lagging follower via UnderReplicatedPartitions correlation. Check that broker’s disk I/O and network. If the follower is terminally slow, a controlled shutdown will trigger clean leader elections and remove it from the replication path. If you cannot fix the follower immediately and durability requirements allow, consider whether the topic truly needs acks=all.
Network thread saturation (ResponseQueueTimeMs high)
Increase num.network.threads. The default of 3 is often too low for TLS-terminated traffic or high connection counts. If TLS handshakes dominate, consider offloading TLS or increasing threads significantly. Also check for slow clients that do not read responses promptly, causing backpressure.
Network congestion (ResponseSendTimeMs high)
Check for NIC saturation, packet loss, or cross-AZ bandwidth limits. If the client is slow, fix it on the client side. When large fetch responses share the same NIC, throttle consumers by reviewing max.partition.fetch.bytes.
Prevention
- Collect and baseline all five sub-components. Operators often alert only on
TotalTimeMs. Baseline each component separately so you can detect shifts before the total breaches threshold. - Maintain thread pool headroom. Keep
RequestHandlerAvgIdlePercentandNetworkProcessorAvgIdlePercentabove 0.5 during peak. Below 0.3, you have no buffer for failure-induced load shifts. - Correlate broker latency with OS signals. Disk
await,pgmajfault, and NIC utilization explain component-level spikes that Kafka metrics only summarize. - Test recovery before incidents. Gracefully shut down one broker in staging and measure how
LocalTimeMsandRemoteTimeMsshift across the remaining brokers. Know your headroom.
How Netdata helps
- Correlates the five
TotalTimeMscomponents per broker with per-disk I/O latency, NIC utilization, and CPU context switches in the same time slice. - Surfaces
RequestHandlerAvgIdlePercentandNetworkProcessorAvgIdlePercentalongside request queue sizes so you can distinguish I/O thread saturation from network thread saturation. - Tracks
UnderReplicatedPartitionsandIsrShrinksPerSecon the same dashboard as produce latency, linkingRemoteTimeMsspikes to specific follower brokers. - Alerts when produce latency p99 approaches
request.timeout.mswithout requiring manual JMX sampling.
Related guides
- How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
- Kafka enable.auto.commit data loss: committed offsets that outrun processing: /guides/kafka/kafka-auto-commit-silent-data-loss/
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts: /guides/kafka/kafka-commit-failed-exception/
- Kafka consumer group stuck Empty or Dead: no members consuming: /guides/kafka/kafka-consumer-group-empty-stuck/
- Kafka consumer group lag growing: detection, lag-as-time, and root causes: /guides/kafka/kafka-consumer-group-lag-growing/
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors: /guides/kafka/kafka-consumer-group-rebalancing-frequently/
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms: /guides/kafka/kafka-consumer-rebalance-storm/
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata: /guides/kafka/kafka-controller-event-queue-backup/
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind: /guides/kafka/kafka-kraft-metadata-log-lag/
- Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata: /guides/kafka/kafka-kraft-quorum-no-leader/
- Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections: /guides/kafka/kafka-leader-election-rate-high/







