Kafka produce request latency high: reading the TotalTimeMs breakdown

Producers are timing out or retrying. Client-side request-latency-avg is elevated and broker TotalTimeMs p99 is spiking. The total is unactionable by itself. Kafka breaks it into five sub-components, each implicating a different subsystem. You need all five via JMX or an equivalent metrics collector to know whether to fix disk, scale a thread pool, or replace a follower.

What this means

TotalTimeMs is the wall-clock time from when the broker’s network thread receives a produce request until the response is fully sent. It is the arithmetic sum of five phases:

  1. RequestQueueTimeMs - time queued before an I/O handler thread picks it up.
  2. LocalTimeMs - time the leader spends appending to the local log (disk write path, index updates).
  3. RemoteTimeMs - time waiting for follower acknowledgments. Non-zero only when producers use acks=all (or -1).
  4. ResponseQueueTimeMs - time the completed response waits for a network processor thread.
  5. ResponseSendTimeMs - time to serialize and write the response bytes onto the socket.

The MBean path for each is kafka.network:type=RequestMetrics,name=<Metric>,request=Produce. All expose 50thPercentile, 95thPercentile, 99thPercentile, and Mean.

When TotalTimeMs p99 approaches the producer’s request.timeout.ms (default 30 seconds), clients begin emitting TimeoutException and retrying. Those retries increase broker load and push latency higher. Escalate when p99 exceeds roughly 75 percent of request.timeout.ms.

flowchart TD
    A[TotalTimeMs high] --> B{Which component?}
    B -->|RequestQueueTimeMs| C[I/O thread saturation]
    B -->|LocalTimeMs| D{Disk slow or CPU?}
    D --> E[Check disk await]
    B -->|RemoteTimeMs| F[Slow followers]
    B -->|ResponseQueueTimeMs| G[Network thread backlog]
    B -->|ResponseSendTimeMs| H[Network congestion or slow client]
    C --> I[Check RequestHandlerAvgIdlePercent]
    F --> J[Check UnderReplicatedPartitions]
    G --> K[Check NetworkProcessorAvgIdlePercent]

Common causes

CauseWhat it looks likeFirst thing to check
I/O thread saturationRequestQueueTimeMs rises while LocalTimeMs stays flatRequestHandlerAvgIdlePercent and RequestQueueSize
Disk I/O pressureLocalTimeMs spikes, often with high awaitiostat -xz 1 and pgmajfault rate
Slow followers (acks=all)RemoteTimeMs is the dominant componentUnderReplicatedPartitions and IsrShrinksPerSec on the leader
Network thread backlogResponseQueueTimeMs grows; ResponseSendTimeMs normalNetworkProcessorAvgIdlePercent and ResponseQueueSize
Network congestion or slow clientResponseSendTimeMs high in isolationNIC utilization and producer-side record-retry-rate

Quick checks

Run these read-only checks from a broker host or via JMX. None are destructive.

# Check total produce latency p99
echo "get -b kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -n -l localhost:9999

# Check the five sub-component p99 values
for metric in RequestQueueTimeMs LocalTimeMs RemoteTimeMs ResponseQueueTimeMs ResponseSendTimeMs; do
  echo "$metric:"
  echo "get -b kafka.network:type=RequestMetrics,name=$metric,request=Produce 99thPercentile" | java -jar jmxterm.jar -n -l localhost:9999
done

# Check I/O thread saturation
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -n -l localhost:9999

# Check network thread saturation
echo "get -b kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -n -l localhost:9999

# Check request queue depth
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -n -l localhost:9999

# List under-replicated partitions
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Check disk latency
iostat -xz 1

How to diagnose it

  1. Establish the total and the baseline. Compare TotalTimeMs p99 against the producer’s request.timeout.ms. If the total is normal but clients still time out, investigate the network or client side.

  2. Read the breakdown. Identify which sub-component dominates the increase. One component is usually responsible for most of the delta.

  3. If RequestQueueTimeMs is high: The I/O threads cannot keep up. Check RequestHandlerAvgIdlePercent. Sustained values below 0.3 indicate severe saturation. Check RequestQueueSize; sustained values above 250 (half the default queued.max.requests of 500) confirm the bottleneck is between the network and I/O layers.

  4. If LocalTimeMs is high: The leader is slow to append to its local log. Check disk I/O latency with iostat -xz 1. await above 20 ms for SSDs or 50 ms for HDDs is abnormal. Also check for page cache pressure via pgmajfault in /proc/vmstat. If disk metrics are healthy but CPU is saturated, the bottleneck is CPU, not disk.

  5. If RemoteTimeMs is high: This applies only when producers use acks=all. The leader is waiting for followers to acknowledge. Check UnderReplicatedPartitions on this broker. Cross-reference with IsrShrinksPerSec to see if followers are actively falling out of sync. Check FetchFollower latency on the leader to confirm followers are being served slowly, then inspect the follower brokers’ disk and network metrics.

  6. If ResponseQueueTimeMs is high: The network threads are slow to drain completed responses. Check NetworkProcessorAvgIdlePercent; sustained values below 0.3 indicate network thread saturation. Check ResponseQueueSize for confirmation.

  7. If ResponseSendTimeMs is high: Produce responses are small; a spike here usually means the TCP send buffer is backing up. Check /proc/net/dev for NIC saturation. If the NIC is not saturated, the client is slow to read responses.

  8. Check for retry cascades. If BytesInPerSec is rising while MessagesInPerSec is flat, producers are likely retrying the same messages. This creates a positive feedback loop that inflates all components.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
TotalTimeMs p99 (Produce)End-to-end broker-side latencySustained >2-3x baseline or approaching request.timeout.ms
RequestQueueTimeMsPressure between network and I/O threadsSustained elevation above baseline
LocalTimeMsLeader write-path healthSpike correlating with disk await or page faults
RemoteTimeMsReplication wait time for acks=allNonzero and spiking for acks=all topics
ResponseQueueTimeMsNetwork thread backlogGrowing while ResponseSendTimeMs remains flat
ResponseSendTimeMsWire-time healthHigh in isolation
RequestHandlerAvgIdlePercentI/O thread pool headroomSustained below 0.3
NetworkProcessorAvgIdlePercentNetwork thread pool headroomSustained below 0.3
RequestQueueSizeQueue depth between thread poolsConsistently above 250
UnderReplicatedPartitionsFollower replication healthNonzero sustained outside maintenance

Fixes

I/O thread saturation (RequestQueueTimeMs high)

Increase num.io.threads only if the broker has available CPU and the disk is not the bottleneck. Adding threads to a disk-bound broker increases contention. If the broker is already at capacity, shed load by reassigning partitions or adding brokers. Check for abnormally large produce batches that inflate per-request CPU.

Disk pressure (LocalTimeMs high)

Identify whether the issue is a single slow disk (one log.dirs entry degrading while others are healthy) or global saturation. For a failing disk, remove the broker via controlled shutdown and replace the hardware. For global saturation, reduce retention, add brokers, or move high-volume topics.

Slow followers (RemoteTimeMs high)

The follower is the problem, not the leader. Identify the lagging follower via UnderReplicatedPartitions correlation. Check that broker’s disk I/O and network. If the follower is terminally slow, a controlled shutdown will trigger clean leader elections and remove it from the replication path. If you cannot fix the follower immediately and durability requirements allow, consider whether the topic truly needs acks=all.

Network thread saturation (ResponseQueueTimeMs high)

Increase num.network.threads. The default of 3 is often too low for TLS-terminated traffic or high connection counts. If TLS handshakes dominate, consider offloading TLS or increasing threads significantly. Also check for slow clients that do not read responses promptly, causing backpressure.

Network congestion (ResponseSendTimeMs high)

Check for NIC saturation, packet loss, or cross-AZ bandwidth limits. If the client is slow, fix it on the client side. When large fetch responses share the same NIC, throttle consumers by reviewing max.partition.fetch.bytes.

Prevention

  • Collect and baseline all five sub-components. Operators often alert only on TotalTimeMs. Baseline each component separately so you can detect shifts before the total breaches threshold.
  • Maintain thread pool headroom. Keep RequestHandlerAvgIdlePercent and NetworkProcessorAvgIdlePercent above 0.5 during peak. Below 0.3, you have no buffer for failure-induced load shifts.
  • Correlate broker latency with OS signals. Disk await, pgmajfault, and NIC utilization explain component-level spikes that Kafka metrics only summarize.
  • Test recovery before incidents. Gracefully shut down one broker in staging and measure how LocalTimeMs and RemoteTimeMs shift across the remaining brokers. Know your headroom.

How Netdata helps

  • Correlates the five TotalTimeMs components per broker with per-disk I/O latency, NIC utilization, and CPU context switches in the same time slice.
  • Surfaces RequestHandlerAvgIdlePercent and NetworkProcessorAvgIdlePercent alongside request queue sizes so you can distinguish I/O thread saturation from network thread saturation.
  • Tracks UnderReplicatedPartitions and IsrShrinksPerSec on the same dashboard as produce latency, linking RemoteTimeMs spikes to specific follower brokers.
  • Alerts when produce latency p99 approaches request.timeout.ms without requiring manual JMX sampling.
  • How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
  • Kafka enable.auto.commit data loss: committed offsets that outrun processing: /guides/kafka/kafka-auto-commit-silent-data-loss/
  • Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts: /guides/kafka/kafka-commit-failed-exception/
  • Kafka consumer group stuck Empty or Dead: no members consuming: /guides/kafka/kafka-consumer-group-empty-stuck/
  • Kafka consumer group lag growing: detection, lag-as-time, and root causes: /guides/kafka/kafka-consumer-group-lag-growing/
  • Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors: /guides/kafka/kafka-consumer-group-rebalancing-frequently/
  • Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms: /guides/kafka/kafka-consumer-rebalance-storm/
  • Kafka controller event queue backing up: overwhelmed controller and stalled metadata: /guides/kafka/kafka-controller-event-queue-backup/
  • Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
  • Kafka KRaft metadata log lag: standby controllers and brokers falling behind: /guides/kafka/kafka-kraft-metadata-log-lag/
  • Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata: /guides/kafka/kafka-kraft-quorum-no-leader/
  • Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections: /guides/kafka/kafka-leader-election-rate-high/