Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
IsrShrinksPerSec is climbing on your leaders and UnderReplicatedPartitions is no longer zero. If it is flapping – shrinks followed by expands every few minutes – the path ends with OfflinePartitionsCount rising and acks=all producers throwing NotEnoughReplicasException. This guide covers that path: how a lagging follower becomes a cluster-wide problem, how to separate flapping from one-way degradation, and how to stop the cascade before partitions go offline.
What this means
Kafka leaders maintain an In-Sync Replica set (ISR): followers that have fetched within replica.lag.time.max.ms (default 30 seconds since Kafka 2.5.0; 10 seconds before 2.5.0). When a follower stops fetching or falls behind beyond that window, the leader removes it. IsrShrinksPerSec measures the velocity of these removals. IsrExpandsPerSec measures replicas catching up and rejoining.
A shrink without a matching expand means a replica fell out and stayed out. Sustained one-way shrinks degrade durability: if the leader dies, fewer copies exist, and with unclean.leader.election.enable=false (default since 0.11.0.0) the partition can become unavailable. When shrinks and expands happen together in tight loops, you have ISR flapping: an intermittent follower that keeps crossing the lag threshold.
The worst case is the cascade: a broker’s disk degrades or a long GC pause hits, the follower falls behind, ISR shrinks, and under-replication spreads. If a second event hits while the cluster is already degraded, partitions can lose their last viable replicas and go offline. In the KAFKA-12241 pattern, an ISR shrink to the leader followed immediately by a LogDirFailure can offline the partition and require manual recovery.
flowchart TD
A[Follower disk slows or GC pauses] --> B[Follower fetch lags]
B --> C[Leader removes follower from ISR]
C --> D[IsrShrinksPerSec spikes]
D --> E[UnderReplicatedPartitions rises]
E --> F{Second failure?}
F -->|Yes| G[ISR below min.insync.replicas]
G --> H[acks=all writes rejected]
F -->|No| I[ISR keeps shrinking]
I --> J[No viable replicas remain]
J --> K[OfflinePartitionsCount > 0]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Follower disk I/O degradation | Shrinks clustered on partitions whose follower is on the same broker; disk await elevated | iostat -xz 1 on the follower broker |
| GC pause on follower broker | ISR flapping at regular intervals matching GC frequency; CollectionTime spikes | jstat -gcutil $(pgrep -f kafka.Kafka) 1000 |
| Network partition or saturation between leader and follower | Cross-broker FetchFollower latency jumps; connection errors in logs | ss -s and per-broker network metrics |
| Broker overload (too many partitions) | RequestHandlerAvgIdlePercent drops on the follower; high request queue | Per-broker RequestHandlerAvgIdlePercent |
replica.lag.time.max.ms too tight for workload | Flapping during traffic bursts; shrinks recover quickly | Broker config vs Kafka version default |
Quick checks
These are read-only checks you can run without risk.
# Check under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Check unavailable partitions (offline)
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions
# Check ISR shrink rate on a leader broker via JMX
echo "get -b kafka.server:type=ReplicaManager,name=IsrShrinksPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Check ISR expand rate (flapping indicator)
echo "get -b kafka.server:type=ReplicaManager,name=IsrExpandsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Check follower disk latency
iostat -xz 1
# Check GC behavior on the follower
jstat -gcutil $(pgrep -f kafka.Kafka) 1000
# Check active TCP connections to the broker
ss -tnp | grep $(pgrep -f kafka.Kafka) | wc -l
How to diagnose it
Confirm the scope. Run
--under-replicated-partitions. If the list is long and spans many leaders, look for the common follower broker. If it is short and localized, the problem is likely a single leader-follower pair.Correlate shrinks with expands. Pull
IsrShrinksPerSecandIsrExpandsPerSecOneMinuteRate from every broker. If both are elevated on the same leaders, you have flapping. If shrinks are high but expands are flat, replicas are falling out and staying out.Find the sick follower. Cross-reference
UnderReplicatedPartitionsacross brokers. The broker that appears most often as a follower in the URP list is the common denominator. Check its disk I/O (awaitviaiostat),RequestHandlerAvgIdlePercent, and GC pauses (G1 Old Generation CollectionTimevia JMX).Check leader-side replication latency. On the leaders showing shrinks, inspect
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower. If p99 approachesreplica.lag.time.max.ms, the follower is on the edge of eviction.Assess cascade risk. Check
OfflinePartitionsCounton the controller. If it is rising, the cascade is active. CheckUnderMinIsrPartitionCountto see ifacks=allwrites are already being rejected.Check for controller queue backup. On the active controller, check
ControllerEventQueueSize. If it is growing while offline partitions rise, leader elections are queued and the cluster cannot self-heal quickly.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
IsrShrinksPerSec OneMinuteRate | Velocity of replicas leaving ISR | Sustained > 0 outside maintenance |
IsrExpandsPerSec OneMinuteRate | Recovery velocity; paired with shrinks indicates flapping | Non-zero while shrinks are also non-zero |
UnderReplicatedPartitions | Cumulative effect of ISR loss; durability window is open | Nonzero on any broker in steady state |
UnderMinIsrPartitionCount | Direct measure of write rejection for acks=all | Nonzero for > 2 minutes |
OfflinePartitionsCount | Partitions with no leader; complete unavailability | Nonzero sustained > 60 seconds |
FetchFollower TotalTimeMs p99 | Leader-side view of replication latency | Exceeds 50% of replica.lag.time.max.ms |
Disk await (follower) | Root cause of many slow followers | > 20 ms for SSDs, > 50 ms for HDDs sustained |
RequestHandlerAvgIdlePercent | Follower broker processing saturation | Sustained < 0.3 |
GC CollectionTime (Old Gen) | Long pauses cause ISR eviction | Pauses > 200 ms regularly |
ControllerEventQueueSize | Metadata plane backlog during failures | Growing continuously > 1000 events |
Fixes
Slow follower disk or saturated I/O
If one broker’s disk is degraded, shut it down with a controlled shutdown. This triggers leader election for any partitions it hosted and removes the slow follower from the replication path. Do not kill the process directly; use controlled shutdown so the controller reassigns leadership gracefully. If the broker hosts leaders, ensure you have enough remaining replicas before shutting it down.
If the disk is merely saturated (not failing), check for competing workloads on the same volume or unexpected page cache thrashing. Reducing follower fetch load by reassigning some partitions away may help, but that is a longer operation.
GC pauses causing flapping
If jstat or JMX shows Old Generation collections exceeding 200 ms, the JVM heap is likely undersized or misconfigured. Kafka brokers should run with modest heap (4-8 GB is typical). Oversized heaps cause longer pauses. Reduce heap size if it is too large, or investigate memory leaks and message down-conversion that materializes large on-heap buffers.
Network issues between brokers
Check for packet loss or interface saturation. If replication traffic competes with consumer traffic on the same interface, consider dedicated inter-broker listeners or reducing consumer fetch size temporarily. Verify that num.network.threads is adequate for the connection count, especially with TLS.
ISR flapping from tight replica.lag.time.max.ms
If shrinks and expands cycle continuously and follower resources look healthy, the lag threshold may be too aggressive for your workload. Raising replica.lag.time.max.ms reduces spurious evictions caused by bursty traffic or transient latency. The tradeoff is that a real failure takes longer to detect: writes with acks=all will wait longer before the leader acts. Do not raise it above your application’s tolerance for failover time.
Clusters upgraded from pre-2.5 that retained the 10-second default are especially prone to this. The current default is 30 seconds.
Offline partitions and KAFKA-12241
If OfflinePartitionsCount is nonzero and the partition has no ISR members, you may need to elect a leader manually. If unclean.leader.election.enable=false (the safe default), this requires bringing a broker back online or using partition reassignment. Enabling unclean election will recover availability but causes acknowledged data to be lost. Treat this as a last resort.
Prevention
- Set
min.insync.replicas=2whenreplication.factor=3so thatacks=allprovides real durability. Without it, a leader can acknowledge with zero followers in ISR. - Ensure
replica.lag.time.max.msmatches your infrastructure. Virtualized environments, slow links, or heavy GC workloads need the default 30 seconds or higher. - Monitor
IsrShrinksPerSecandIsrExpandsPerSectogether as a pair. A shrink-expand loop is an early warning. - Maintain per-broker dashboards that correlate disk
await,RequestHandlerAvgIdlePercent, and GC pauses. The follower is often the bottleneck, not the leader. - Test broker failure recovery in staging. Measure how long ISR recovery takes after a restart or shutdown so you know your real thresholds.
How Netdata helps
Netdata collects Kafka JMX metrics alongside OS metrics on every broker. Correlate IsrShrinksPerSec on the leader with disk await, CPU, and GC pauses on the follower in the same time window to identify the common denominator. It also tracks UnderReplicatedPartitions, OfflinePartitionsCount, and request handler idle percent, so the cascade from replication lag to saturation to unavailability is visible on one timeline.







