Kafka purgatory size growing: delayed produce and fetch operations

JMX shows kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize climbing on one or more brokers. Produce purgatory holds acks=all requests waiting for ISR completion. Fetch purgatory holds consumer and follower requests waiting for fetch.min.bytes. A growing queue means requests spend more time inside the broker than clients expected. Producers time out and retry. Consumers sit idle. The cluster is not dead, but it is backing up at a precise choke point. This guide distinguishes normal long-polling from replication crisis.

What this means

Kafka parks delayed operations in a timer wheel called the purgatory. Two queues matter on the data plane. The produce purgatory holds produce requests that specified acks=all. Each request stays until the partition leader receives acknowledgment from every in-sync replica, or until the request times out. The fetch purgatory holds consumer and follower fetch requests that cannot be satisfied immediately because fetch.min.bytes is not yet available. These wait up to fetch.max.wait.ms before the broker returns a partial batch.

High fetch purgatory is usually normal. Consumers on low-volume topics with fetch.max.wait.ms set to the default 500ms keep a request in purgatory by design. Growing produce purgatory, by contrast, is a leading indicator of replication trouble. If producers use acks=1 or acks=0, produce purgatory should stay near zero; a high value suggests metric misinterpretation or broker state corruption.

flowchart TD
    A[Purgatory size growing] --> B{Which purgatory?}
    B -->|Produce| C[Are producers using acks=all?]
    B -->|Fetch| D[Is topic volume low?]
    C -->|No| E[Metric should be near zero]
    C -->|Yes| F[Check UnderReplicatedPartitions]
    F -->|Rising| G[Follower replication lag]
    F -->|Zero| H[Check RemoteTimeMs and request queues]
    D -->|Yes| I[Normal long-poll behavior]
    D -->|No| J[Check FetchConsumer LocalTimeMs]

Common causes

CauseWhat it looks likeFirst thing to check
Follower replication lagProduce purgatory rising; UnderReplicatedPartitions nonzero; produce RemoteTimeMs elevatedUnderReplicatedPartitions aggregated across all brokers
Normal consumer long-pollingFetch purgatory stable and proportional to consumer count; low-volume topics; no fetch errorsTopic throughput and consumer fetch.max.wait.ms
Producer timeout cascadeBytesInPerSec rises while MessagesInPerSec does not; FailedProduceRequestsPerSec climbing; produce purgatory spikingRequestHandlerAvgIdlePercent and producer retry rate
ISR flappingIsrShrinksPerSec and IsrExpandsPerSec both nonzero; produce purgatory oscillatingFollower GC pause duration and disk latency

Quick checks

# Produce and fetch purgatory sizes via JMX
echo "get -b kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce Value" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Fetch Value" | java -jar jmxterm.jar -l localhost:9999

# List under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Produce request latency breakdown
echo "get -b kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Broker processing capacity
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Request queue depth between network and I/O threads
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

# Failed produce rate for visible producer impact
echo "get -b kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

How to diagnose it

  1. Identify which purgatory is growing. The JMX MBeans are distinct: delayedOperation=Produce and delayedOperation=Fetch. Do not aggregate them.
  2. For produce purgatory, confirm producers use acks=all. If they use acks=1 or acks=0, the metric should be near zero.
  3. Correlate produce purgatory with UnderReplicatedPartitions. If URP is rising, followers are slow. Cross-reference URP across all brokers to find the common lagging follower.
  4. On the leader broker reporting URP, check RemoteTimeMs in the produce latency breakdown. This measures how long the leader waits for follower acks. If RemoteTimeMs dominates, the problem is replication, not local disk.
  5. Inspect the lagging follower for disk I/O latency (await from iostat) and JVM GC pauses. A follower with await above 20ms on SSD, or Full GC pauses above 200ms, will fall behind and keep produce requests in purgatory.
  6. For fetch purgatory, check whether the affected topics are low-volume. If consumers have fetch.max.wait.ms set to 500ms and there is little data, each consumer connection will hold a request in purgatory for the wait duration. This is expected.
  7. If fetch purgatory grows on high-volume topics, check FetchConsumer LocalTimeMs. A spike here means reads are hitting disk instead of page cache, which slows fetch responses and keeps requests in purgatory longer.
  8. Check FailedFetchRequestsPerSec and consumer group lag. If consumers are erroring out or lag is growing while fetch purgatory is high, the broker read path is the bottleneck. If consumers are healthy, the purgatory size is likely benign.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
PurgatorySize (Produce)Count of acks=all produce requests waiting for ISR acksSustained growth above 2x baseline for more than 5 minutes
PurgatorySize (Fetch)Count of fetch requests in long-poll waitUnbounded growth on high-volume topics; sudden spikes above consumer count
UnderReplicatedPartitionsThe leading indicator for follower lagNonzero outside rolling restarts or reassignment
RemoteTimeMs (Produce)Time leader spends waiting for followersp99 above baseline or approaching producer request.timeout.ms
RequestHandlerAvgIdlePercentBroker I/O thread headroomSustained below 0.3
IsrShrinksPerSec / IsrExpandsPerSecVelocity of ISR changesSustained nonzero shrink rate, or simultaneous shrinks and expands
FailedProduceRequestsPerSecDirect measure of producer-visible failuresSustained nonzero rate
FetchConsumer LocalTimeMsRead path latency from broker to consumerSpike from near-zero to disk latency levels

Fixes

Follower replication lag

The root cause is usually a degraded follower. Use kafka-topics.sh --describe --under-replicated-partitions to identify which partitions are affected, then aggregate by follower broker to find the common target. On that follower, check disk await with iostat -xz 1. If await is elevated above 20ms for SSDs or 50ms for HDDs, the disk is the bottleneck. If the broker is containerized, check for CPU throttling or memory pressure causing GC storms.

Warning: If the broker is clearly degraded, initiate a controlled shutdown to let the controller elect clean leaders and allow replicas to catch up on healthy brokers. This reduces availability for the affected partitions until replication completes. Do not restart additional brokers during an active replication lag event; this generates more controller work and can expand the blast radius.

Normal long-polling behavior

If fetch purgatory is high because consumers are waiting on low-volume topics, no broker fix is required. If the behavior causes client-side timeouts, reduce fetch.min.bytes so the broker returns data sooner. Raising fetch.max.wait.ms reduces fetch round-trips at the cost of higher apparent latency.

Producer timeout cascade

When produce purgatory growth triggers producer timeouts, retrying producers can overload the broker further. Break the loop by throttling the affected producers with Kafka quotas. Then identify the slow broker and consider removing it from the cluster. Warning: Removing a broker reduces cluster capacity and triggers partition reassignment. Only do this after confirming the broker is unrecoverable. Once the retry storm subsides and RequestHandlerAvgIdlePercent recovers, re-enable normal throughput.

ISR flapping

If IsrShrinksPerSec and IsrExpandsPerSec are both elevated, a follower is intermittently falling behind and catching up. This is often caused by periodic GC pauses or bursty traffic. Check the follower’s GC logs for Young GC pauses exceeding 200ms or any Full GC events. If GC is clean, check for network packet loss or intermittent disk latency. Fix the underlying intermittent issue; do not simply increase replica.lag.time.max.ms to mask it, as this delays detection of real replication problems.

Prevention

  • Set min.insync.replicas=2 on topics with replication.factor=3 so that acks=all provides real durability. Without this, the leader can acknowledge with zero followers in ISR, making the produce purgatory effectively useless as a durability signal.
  • Monitor UnderReplicatedPartitions and IsrShrinksPerSec as leading indicators. Do not wait for purgatory size to grow before acting.
  • Maintain RequestHandlerAvgIdlePercent above 0.5 during peak load. This leaves headroom for follower fetch storms and partition reassignment.
  • For compacted topics, monitor log.cleaner.min.cleanable.dirty.ratio and log cleaner thread health.
  • Do not disable alerts during rolling restarts. Transient ISR shrinks are expected, but if purgatory size does not recover within one to two times replica.lag.time.max.ms after the restart completes, the broker is not catching up.

How Netdata helps

  • Surfaces PurgatorySize for Produce and Fetch alongside UnderReplicatedPartitions and produce RemoteTimeMs on the same charts, so you can see replication lag and purgatory growth in one view.
  • Correlates purgatory spikes with per-broker RequestHandlerAvgIdlePercent and OS disk latency to distinguish between a slow follower and local I/O saturation.
  • Tracks FetchConsumer latency breakdown and consumer lag against fetch purgatory size, helping you separate normal long-polling from read-path degradation.
  • Supports composite alerts that fire only when produce purgatory grows while UnderReplicatedPartitions is nonzero, reducing noise from low-volume test topics that naturally keep fetch requests in purgatory.
  • How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
  • Kafka enable.auto.commit data loss: committed offsets that outrun processing: /guides/kafka/kafka-auto-commit-silent-data-loss/
  • Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts: /guides/kafka/kafka-commit-failed-exception/
  • Kafka consumer group stuck Empty or Dead: no members consuming: /guides/kafka/kafka-consumer-group-empty-stuck/
  • Kafka consumer group lag growing: detection, lag-as-time, and root causes: /guides/kafka/kafka-consumer-group-lag-growing/
  • Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors: /guides/kafka/kafka-consumer-group-rebalancing-frequently/
  • Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms: /guides/kafka/kafka-consumer-rebalance-storm/
  • Kafka controller event queue backing up: overwhelmed controller and stalled metadata: /guides/kafka/kafka-controller-event-queue-backup/
  • Kafka fetch request latency high: FetchConsumer vs FetchFollower and page cache misses: /guides/kafka/kafka-fetch-request-latency-high/
  • Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
  • Kafka JVM heap and Full GC pauses: ISR drops, session timeouts, and right-sizing the heap: /guides/kafka/kafka-jvm-heap-full-gc-pauses/
  • Kafka KRaft metadata log lag: standby controllers and brokers falling behind: /guides/kafka/kafka-kraft-metadata-log-lag/