Kafka purgatory size growing: delayed produce and fetch operations
JMX shows kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize climbing on one or more brokers. Produce purgatory holds acks=all requests waiting for ISR completion. Fetch purgatory holds consumer and follower requests waiting for fetch.min.bytes. A growing queue means requests spend more time inside the broker than clients expected. Producers time out and retry. Consumers sit idle. The cluster is not dead, but it is backing up at a precise choke point. This guide distinguishes normal long-polling from replication crisis.
What this means
Kafka parks delayed operations in a timer wheel called the purgatory. Two queues matter on the data plane. The produce purgatory holds produce requests that specified acks=all. Each request stays until the partition leader receives acknowledgment from every in-sync replica, or until the request times out. The fetch purgatory holds consumer and follower fetch requests that cannot be satisfied immediately because fetch.min.bytes is not yet available. These wait up to fetch.max.wait.ms before the broker returns a partial batch.
High fetch purgatory is usually normal. Consumers on low-volume topics with fetch.max.wait.ms set to the default 500ms keep a request in purgatory by design. Growing produce purgatory, by contrast, is a leading indicator of replication trouble. If producers use acks=1 or acks=0, produce purgatory should stay near zero; a high value suggests metric misinterpretation or broker state corruption.
flowchart TD
A[Purgatory size growing] --> B{Which purgatory?}
B -->|Produce| C[Are producers using acks=all?]
B -->|Fetch| D[Is topic volume low?]
C -->|No| E[Metric should be near zero]
C -->|Yes| F[Check UnderReplicatedPartitions]
F -->|Rising| G[Follower replication lag]
F -->|Zero| H[Check RemoteTimeMs and request queues]
D -->|Yes| I[Normal long-poll behavior]
D -->|No| J[Check FetchConsumer LocalTimeMs]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Follower replication lag | Produce purgatory rising; UnderReplicatedPartitions nonzero; produce RemoteTimeMs elevated | UnderReplicatedPartitions aggregated across all brokers |
| Normal consumer long-polling | Fetch purgatory stable and proportional to consumer count; low-volume topics; no fetch errors | Topic throughput and consumer fetch.max.wait.ms |
| Producer timeout cascade | BytesInPerSec rises while MessagesInPerSec does not; FailedProduceRequestsPerSec climbing; produce purgatory spiking | RequestHandlerAvgIdlePercent and producer retry rate |
| ISR flapping | IsrShrinksPerSec and IsrExpandsPerSec both nonzero; produce purgatory oscillating | Follower GC pause duration and disk latency |
Quick checks
# Produce and fetch purgatory sizes via JMX
echo "get -b kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce Value" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Fetch Value" | java -jar jmxterm.jar -l localhost:9999
# List under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Produce request latency breakdown
echo "get -b kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# Broker processing capacity
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Request queue depth between network and I/O threads
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
# Failed produce rate for visible producer impact
echo "get -b kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
How to diagnose it
- Identify which purgatory is growing. The JMX MBeans are distinct:
delayedOperation=ProduceanddelayedOperation=Fetch. Do not aggregate them. - For produce purgatory, confirm producers use
acks=all. If they useacks=1oracks=0, the metric should be near zero. - Correlate produce purgatory with
UnderReplicatedPartitions. If URP is rising, followers are slow. Cross-reference URP across all brokers to find the common lagging follower. - On the leader broker reporting URP, check
RemoteTimeMsin the produce latency breakdown. This measures how long the leader waits for follower acks. IfRemoteTimeMsdominates, the problem is replication, not local disk. - Inspect the lagging follower for disk I/O latency (
awaitfromiostat) and JVM GC pauses. A follower withawaitabove 20ms on SSD, or Full GC pauses above 200ms, will fall behind and keep produce requests in purgatory. - For fetch purgatory, check whether the affected topics are low-volume. If consumers have
fetch.max.wait.msset to 500ms and there is little data, each consumer connection will hold a request in purgatory for the wait duration. This is expected. - If fetch purgatory grows on high-volume topics, check
FetchConsumerLocalTimeMs. A spike here means reads are hitting disk instead of page cache, which slows fetch responses and keeps requests in purgatory longer. - Check
FailedFetchRequestsPerSecand consumer group lag. If consumers are erroring out or lag is growing while fetch purgatory is high, the broker read path is the bottleneck. If consumers are healthy, the purgatory size is likely benign.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
PurgatorySize (Produce) | Count of acks=all produce requests waiting for ISR acks | Sustained growth above 2x baseline for more than 5 minutes |
PurgatorySize (Fetch) | Count of fetch requests in long-poll wait | Unbounded growth on high-volume topics; sudden spikes above consumer count |
UnderReplicatedPartitions | The leading indicator for follower lag | Nonzero outside rolling restarts or reassignment |
RemoteTimeMs (Produce) | Time leader spends waiting for followers | p99 above baseline or approaching producer request.timeout.ms |
RequestHandlerAvgIdlePercent | Broker I/O thread headroom | Sustained below 0.3 |
IsrShrinksPerSec / IsrExpandsPerSec | Velocity of ISR changes | Sustained nonzero shrink rate, or simultaneous shrinks and expands |
FailedProduceRequestsPerSec | Direct measure of producer-visible failures | Sustained nonzero rate |
FetchConsumer LocalTimeMs | Read path latency from broker to consumer | Spike from near-zero to disk latency levels |
Fixes
Follower replication lag
The root cause is usually a degraded follower. Use kafka-topics.sh --describe --under-replicated-partitions to identify which partitions are affected, then aggregate by follower broker to find the common target. On that follower, check disk await with iostat -xz 1. If await is elevated above 20ms for SSDs or 50ms for HDDs, the disk is the bottleneck. If the broker is containerized, check for CPU throttling or memory pressure causing GC storms.
Warning: If the broker is clearly degraded, initiate a controlled shutdown to let the controller elect clean leaders and allow replicas to catch up on healthy brokers. This reduces availability for the affected partitions until replication completes. Do not restart additional brokers during an active replication lag event; this generates more controller work and can expand the blast radius.
Normal long-polling behavior
If fetch purgatory is high because consumers are waiting on low-volume topics, no broker fix is required. If the behavior causes client-side timeouts, reduce fetch.min.bytes so the broker returns data sooner. Raising fetch.max.wait.ms reduces fetch round-trips at the cost of higher apparent latency.
Producer timeout cascade
When produce purgatory growth triggers producer timeouts, retrying producers can overload the broker further. Break the loop by throttling the affected producers with Kafka quotas. Then identify the slow broker and consider removing it from the cluster. Warning: Removing a broker reduces cluster capacity and triggers partition reassignment. Only do this after confirming the broker is unrecoverable. Once the retry storm subsides and RequestHandlerAvgIdlePercent recovers, re-enable normal throughput.
ISR flapping
If IsrShrinksPerSec and IsrExpandsPerSec are both elevated, a follower is intermittently falling behind and catching up. This is often caused by periodic GC pauses or bursty traffic. Check the follower’s GC logs for Young GC pauses exceeding 200ms or any Full GC events. If GC is clean, check for network packet loss or intermittent disk latency. Fix the underlying intermittent issue; do not simply increase replica.lag.time.max.ms to mask it, as this delays detection of real replication problems.
Prevention
- Set
min.insync.replicas=2on topics withreplication.factor=3so thatacks=allprovides real durability. Without this, the leader can acknowledge with zero followers in ISR, making the produce purgatory effectively useless as a durability signal. - Monitor
UnderReplicatedPartitionsandIsrShrinksPerSecas leading indicators. Do not wait for purgatory size to grow before acting. - Maintain
RequestHandlerAvgIdlePercentabove 0.5 during peak load. This leaves headroom for follower fetch storms and partition reassignment. - For compacted topics, monitor
log.cleaner.min.cleanable.dirty.ratioand log cleaner thread health. - Do not disable alerts during rolling restarts. Transient ISR shrinks are expected, but if purgatory size does not recover within one to two times
replica.lag.time.max.msafter the restart completes, the broker is not catching up.
How Netdata helps
- Surfaces
PurgatorySizefor Produce and Fetch alongsideUnderReplicatedPartitionsand produceRemoteTimeMson the same charts, so you can see replication lag and purgatory growth in one view. - Correlates purgatory spikes with per-broker
RequestHandlerAvgIdlePercentand OS disk latency to distinguish between a slow follower and local I/O saturation. - Tracks
FetchConsumerlatency breakdown and consumer lag against fetch purgatory size, helping you separate normal long-polling from read-path degradation. - Supports composite alerts that fire only when produce purgatory grows while
UnderReplicatedPartitionsis nonzero, reducing noise from low-volume test topics that naturally keep fetch requests in purgatory.
Related guides
- How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
- Kafka enable.auto.commit data loss: committed offsets that outrun processing: /guides/kafka/kafka-auto-commit-silent-data-loss/
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts: /guides/kafka/kafka-commit-failed-exception/
- Kafka consumer group stuck Empty or Dead: no members consuming: /guides/kafka/kafka-consumer-group-empty-stuck/
- Kafka consumer group lag growing: detection, lag-as-time, and root causes: /guides/kafka/kafka-consumer-group-lag-growing/
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors: /guides/kafka/kafka-consumer-group-rebalancing-frequently/
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms: /guides/kafka/kafka-consumer-rebalance-storm/
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata: /guides/kafka/kafka-controller-event-queue-backup/
- Kafka fetch request latency high: FetchConsumer vs FetchFollower and page cache misses: /guides/kafka/kafka-fetch-request-latency-high/
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
- Kafka JVM heap and Full GC pauses: ISR drops, session timeouts, and right-sizing the heap: /guides/kafka/kafka-jvm-heap-full-gc-pauses/
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind: /guides/kafka/kafka-kraft-metadata-log-lag/







