$ guides / kafka / kafka-isr-shrink-storm ▌

Operations Guides

Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline

IsrShrinksPerSec is climbing on your leaders and UnderReplicatedPartitions is no longer zero. If it is flapping – shrinks followed by expands every few minutes – the path ends with OfflinePartitionsCount rising and acks=all producers throwing NotEnoughReplicasException. This guide covers that path: how a lagging follower becomes a cluster-wide problem, how to separate flapping from one-way degradation, and how to stop the cascade before partitions go offline.

What this means

Kafka leaders maintain an In-Sync Replica set (ISR): followers that have fetched within replica.lag.time.max.ms (default 30 seconds since Kafka 2.5.0; 10 seconds before 2.5.0). When a follower stops fetching or falls behind beyond that window, the leader removes it. IsrShrinksPerSec measures the velocity of these removals. IsrExpandsPerSec measures replicas catching up and rejoining.

A shrink without a matching expand means a replica fell out and stayed out. Sustained one-way shrinks degrade durability: if the leader dies, fewer copies exist, and with unclean.leader.election.enable=false (default since 0.11.0.0) the partition can become unavailable. When shrinks and expands happen together in tight loops, you have ISR flapping: an intermittent follower that keeps crossing the lag threshold.

The worst case is the cascade: a broker’s disk degrades or a long GC pause hits, the follower falls behind, ISR shrinks, and under-replication spreads. If a second event hits while the cluster is already degraded, partitions can lose their last viable replicas and go offline. In the KAFKA-12241 pattern, an ISR shrink to the leader followed immediately by a LogDirFailure can offline the partition and require manual recovery.

flowchart TD
    A[Follower disk slows or GC pauses] --> B[Follower fetch lags]
    B --> C[Leader removes follower from ISR]
    C --> D[IsrShrinksPerSec spikes]
    D --> E[UnderReplicatedPartitions rises]
    E --> F{Second failure?}
    F -->|Yes| G[ISR below min.insync.replicas]
    G --> H[acks=all writes rejected]
    F -->|No| I[ISR keeps shrinking]
    I --> J[No viable replicas remain]
    J --> K[OfflinePartitionsCount > 0]

Common causes

Cause	What it looks like	First thing to check
Follower disk I/O degradation	Shrinks clustered on partitions whose follower is on the same broker; disk `await` elevated	`iostat -xz 1` on the follower broker
GC pause on follower broker	ISR flapping at regular intervals matching GC frequency; `CollectionTime` spikes	`jstat -gcutil $(pgrep -f kafka.Kafka) 1000`
Network partition or saturation between leader and follower	Cross-broker `FetchFollower` latency jumps; connection errors in logs	`ss -s` and per-broker network metrics
Broker overload (too many partitions)	`RequestHandlerAvgIdlePercent` drops on the follower; high request queue	Per-broker `RequestHandlerAvgIdlePercent`
`replica.lag.time.max.ms` too tight for workload	Flapping during traffic bursts; shrinks recover quickly	Broker config vs Kafka version default

Quick checks

These are read-only checks you can run without risk.

# Check under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Check unavailable partitions (offline)
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions

# Check ISR shrink rate on a leader broker via JMX
echo "get -b kafka.server:type=ReplicaManager,name=IsrShrinksPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check ISR expand rate (flapping indicator)
echo "get -b kafka.server:type=ReplicaManager,name=IsrExpandsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check follower disk latency
iostat -xz 1

# Check GC behavior on the follower
jstat -gcutil $(pgrep -f kafka.Kafka) 1000

# Check active TCP connections to the broker
ss -tnp | grep $(pgrep -f kafka.Kafka) | wc -l

How to diagnose it

Confirm the scope. Run --under-replicated-partitions. If the list is long and spans many leaders, look for the common follower broker. If it is short and localized, the problem is likely a single leader-follower pair.
Correlate shrinks with expands. Pull IsrShrinksPerSec and IsrExpandsPerSec OneMinuteRate from every broker. If both are elevated on the same leaders, you have flapping. If shrinks are high but expands are flat, replicas are falling out and staying out.
Find the sick follower. Cross-reference UnderReplicatedPartitions across brokers. The broker that appears most often as a follower in the URP list is the common denominator. Check its disk I/O (await via iostat), RequestHandlerAvgIdlePercent, and GC pauses (G1 Old Generation CollectionTime via JMX).
Check leader-side replication latency. On the leaders showing shrinks, inspect kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower. If p99 approaches replica.lag.time.max.ms, the follower is on the edge of eviction.
Assess cascade risk. Check OfflinePartitionsCount on the controller. If it is rising, the cascade is active. Check UnderMinIsrPartitionCount to see if acks=all writes are already being rejected.
Check for controller queue backup. On the active controller, check ControllerEventQueueSize. If it is growing while offline partitions rise, leader elections are queued and the cluster cannot self-heal quickly.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`IsrShrinksPerSec` OneMinuteRate	Velocity of replicas leaving ISR	Sustained > 0 outside maintenance
`IsrExpandsPerSec` OneMinuteRate	Recovery velocity; paired with shrinks indicates flapping	Non-zero while shrinks are also non-zero
`UnderReplicatedPartitions`	Cumulative effect of ISR loss; durability window is open	Nonzero on any broker in steady state
`UnderMinIsrPartitionCount`	Direct measure of write rejection for `acks=all`	Nonzero for > 2 minutes
`OfflinePartitionsCount`	Partitions with no leader; complete unavailability	Nonzero sustained > 60 seconds
`FetchFollower` TotalTimeMs p99	Leader-side view of replication latency	Exceeds 50% of `replica.lag.time.max.ms`
Disk `await` (follower)	Root cause of many slow followers	> 20 ms for SSDs, > 50 ms for HDDs sustained
`RequestHandlerAvgIdlePercent`	Follower broker processing saturation	Sustained < 0.3
GC `CollectionTime` (Old Gen)	Long pauses cause ISR eviction	Pauses > 200 ms regularly
`ControllerEventQueueSize`	Metadata plane backlog during failures	Growing continuously > 1000 events

Fixes

Slow follower disk or saturated I/O

If one broker’s disk is degraded, shut it down with a controlled shutdown. This triggers leader election for any partitions it hosted and removes the slow follower from the replication path. Do not kill the process directly; use controlled shutdown so the controller reassigns leadership gracefully. If the broker hosts leaders, ensure you have enough remaining replicas before shutting it down.

If the disk is merely saturated (not failing), check for competing workloads on the same volume or unexpected page cache thrashing. Reducing follower fetch load by reassigning some partitions away may help, but that is a longer operation.

GC pauses causing flapping

If jstat or JMX shows Old Generation collections exceeding 200 ms, the JVM heap is likely undersized or misconfigured. Kafka brokers should run with modest heap (4-8 GB is typical). Oversized heaps cause longer pauses. Reduce heap size if it is too large, or investigate memory leaks and message down-conversion that materializes large on-heap buffers.

Network issues between brokers

Check for packet loss or interface saturation. If replication traffic competes with consumer traffic on the same interface, consider dedicated inter-broker listeners or reducing consumer fetch size temporarily. Verify that num.network.threads is adequate for the connection count, especially with TLS.

ISR flapping from tight `replica.lag.time.max.ms`

If shrinks and expands cycle continuously and follower resources look healthy, the lag threshold may be too aggressive for your workload. Raising replica.lag.time.max.ms reduces spurious evictions caused by bursty traffic or transient latency. The tradeoff is that a real failure takes longer to detect: writes with acks=all will wait longer before the leader acts. Do not raise it above your application’s tolerance for failover time.

Clusters upgraded from pre-2.5 that retained the 10-second default are especially prone to this. The current default is 30 seconds.

Offline partitions and KAFKA-12241

If OfflinePartitionsCount is nonzero and the partition has no ISR members, you may need to elect a leader manually. If unclean.leader.election.enable=false (the safe default), this requires bringing a broker back online or using partition reassignment. Enabling unclean election will recover availability but causes acknowledged data to be lost. Treat this as a last resort.

Prevention

Set min.insync.replicas=2 when replication.factor=3 so that acks=all provides real durability. Without it, a leader can acknowledge with zero followers in ISR.
Ensure replica.lag.time.max.ms matches your infrastructure. Virtualized environments, slow links, or heavy GC workloads need the default 30 seconds or higher.
Monitor IsrShrinksPerSec and IsrExpandsPerSec together as a pair. A shrink-expand loop is an early warning.
Maintain per-broker dashboards that correlate disk await, RequestHandlerAvgIdlePercent, and GC pauses. The follower is often the bottleneck, not the leader.
Test broker failure recovery in staging. Measure how long ISR recovery takes after a restart or shutdown so you know your real thresholds.

How Netdata helps

Netdata collects Kafka JMX metrics alongside OS metrics on every broker. Correlate IsrShrinksPerSec on the leader with disk await, CPU, and GC pauses on the follower in the same time window to identify the common denominator. It also tracks UnderReplicatedPartitions, OfflinePartitionsCount, and request handler idle percent, so the cascade from replication lag to saturation to unavailability is visible on one timeline.

Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline

Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Slow follower disk or saturated I/O

GC pauses causing flapping

Network issues between brokers

ISR flapping from tight replica.lag.time.max.ms

Offline partitions and KAFKA-12241

Prevention

How Netdata helps

Related guides

ISR flapping from tight `replica.lag.time.max.ms`