Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline

IsrShrinksPerSec is climbing on your leaders and UnderReplicatedPartitions is no longer zero. If it is flapping – shrinks followed by expands every few minutes – the path ends with OfflinePartitionsCount rising and acks=all producers throwing NotEnoughReplicasException. This guide covers that path: how a lagging follower becomes a cluster-wide problem, how to separate flapping from one-way degradation, and how to stop the cascade before partitions go offline.

What this means

Kafka leaders maintain an In-Sync Replica set (ISR): followers that have fetched within replica.lag.time.max.ms (default 30 seconds since Kafka 2.5.0; 10 seconds before 2.5.0). When a follower stops fetching or falls behind beyond that window, the leader removes it. IsrShrinksPerSec measures the velocity of these removals. IsrExpandsPerSec measures replicas catching up and rejoining.

A shrink without a matching expand means a replica fell out and stayed out. Sustained one-way shrinks degrade durability: if the leader dies, fewer copies exist, and with unclean.leader.election.enable=false (default since 0.11.0.0) the partition can become unavailable. When shrinks and expands happen together in tight loops, you have ISR flapping: an intermittent follower that keeps crossing the lag threshold.

The worst case is the cascade: a broker’s disk degrades or a long GC pause hits, the follower falls behind, ISR shrinks, and under-replication spreads. If a second event hits while the cluster is already degraded, partitions can lose their last viable replicas and go offline. In the KAFKA-12241 pattern, an ISR shrink to the leader followed immediately by a LogDirFailure can offline the partition and require manual recovery.

flowchart TD
    A[Follower disk slows or GC pauses] --> B[Follower fetch lags]
    B --> C[Leader removes follower from ISR]
    C --> D[IsrShrinksPerSec spikes]
    D --> E[UnderReplicatedPartitions rises]
    E --> F{Second failure?}
    F -->|Yes| G[ISR below min.insync.replicas]
    G --> H[acks=all writes rejected]
    F -->|No| I[ISR keeps shrinking]
    I --> J[No viable replicas remain]
    J --> K[OfflinePartitionsCount > 0]

Common causes

CauseWhat it looks likeFirst thing to check
Follower disk I/O degradationShrinks clustered on partitions whose follower is on the same broker; disk await elevatediostat -xz 1 on the follower broker
GC pause on follower brokerISR flapping at regular intervals matching GC frequency; CollectionTime spikesjstat -gcutil $(pgrep -f kafka.Kafka) 1000
Network partition or saturation between leader and followerCross-broker FetchFollower latency jumps; connection errors in logsss -s and per-broker network metrics
Broker overload (too many partitions)RequestHandlerAvgIdlePercent drops on the follower; high request queuePer-broker RequestHandlerAvgIdlePercent
replica.lag.time.max.ms too tight for workloadFlapping during traffic bursts; shrinks recover quicklyBroker config vs Kafka version default

Quick checks

These are read-only checks you can run without risk.

# Check under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Check unavailable partitions (offline)
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions

# Check ISR shrink rate on a leader broker via JMX
echo "get -b kafka.server:type=ReplicaManager,name=IsrShrinksPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check ISR expand rate (flapping indicator)
echo "get -b kafka.server:type=ReplicaManager,name=IsrExpandsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check follower disk latency
iostat -xz 1

# Check GC behavior on the follower
jstat -gcutil $(pgrep -f kafka.Kafka) 1000

# Check active TCP connections to the broker
ss -tnp | grep $(pgrep -f kafka.Kafka) | wc -l

How to diagnose it

  1. Confirm the scope. Run --under-replicated-partitions. If the list is long and spans many leaders, look for the common follower broker. If it is short and localized, the problem is likely a single leader-follower pair.

  2. Correlate shrinks with expands. Pull IsrShrinksPerSec and IsrExpandsPerSec OneMinuteRate from every broker. If both are elevated on the same leaders, you have flapping. If shrinks are high but expands are flat, replicas are falling out and staying out.

  3. Find the sick follower. Cross-reference UnderReplicatedPartitions across brokers. The broker that appears most often as a follower in the URP list is the common denominator. Check its disk I/O (await via iostat), RequestHandlerAvgIdlePercent, and GC pauses (G1 Old Generation CollectionTime via JMX).

  4. Check leader-side replication latency. On the leaders showing shrinks, inspect kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower. If p99 approaches replica.lag.time.max.ms, the follower is on the edge of eviction.

  5. Assess cascade risk. Check OfflinePartitionsCount on the controller. If it is rising, the cascade is active. Check UnderMinIsrPartitionCount to see if acks=all writes are already being rejected.

  6. Check for controller queue backup. On the active controller, check ControllerEventQueueSize. If it is growing while offline partitions rise, leader elections are queued and the cluster cannot self-heal quickly.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
IsrShrinksPerSec OneMinuteRateVelocity of replicas leaving ISRSustained > 0 outside maintenance
IsrExpandsPerSec OneMinuteRateRecovery velocity; paired with shrinks indicates flappingNon-zero while shrinks are also non-zero
UnderReplicatedPartitionsCumulative effect of ISR loss; durability window is openNonzero on any broker in steady state
UnderMinIsrPartitionCountDirect measure of write rejection for acks=allNonzero for > 2 minutes
OfflinePartitionsCountPartitions with no leader; complete unavailabilityNonzero sustained > 60 seconds
FetchFollower TotalTimeMs p99Leader-side view of replication latencyExceeds 50% of replica.lag.time.max.ms
Disk await (follower)Root cause of many slow followers> 20 ms for SSDs, > 50 ms for HDDs sustained
RequestHandlerAvgIdlePercentFollower broker processing saturationSustained < 0.3
GC CollectionTime (Old Gen)Long pauses cause ISR evictionPauses > 200 ms regularly
ControllerEventQueueSizeMetadata plane backlog during failuresGrowing continuously > 1000 events

Fixes

Slow follower disk or saturated I/O

If one broker’s disk is degraded, shut it down with a controlled shutdown. This triggers leader election for any partitions it hosted and removes the slow follower from the replication path. Do not kill the process directly; use controlled shutdown so the controller reassigns leadership gracefully. If the broker hosts leaders, ensure you have enough remaining replicas before shutting it down.

If the disk is merely saturated (not failing), check for competing workloads on the same volume or unexpected page cache thrashing. Reducing follower fetch load by reassigning some partitions away may help, but that is a longer operation.

GC pauses causing flapping

If jstat or JMX shows Old Generation collections exceeding 200 ms, the JVM heap is likely undersized or misconfigured. Kafka brokers should run with modest heap (4-8 GB is typical). Oversized heaps cause longer pauses. Reduce heap size if it is too large, or investigate memory leaks and message down-conversion that materializes large on-heap buffers.

Network issues between brokers

Check for packet loss or interface saturation. If replication traffic competes with consumer traffic on the same interface, consider dedicated inter-broker listeners or reducing consumer fetch size temporarily. Verify that num.network.threads is adequate for the connection count, especially with TLS.

ISR flapping from tight replica.lag.time.max.ms

If shrinks and expands cycle continuously and follower resources look healthy, the lag threshold may be too aggressive for your workload. Raising replica.lag.time.max.ms reduces spurious evictions caused by bursty traffic or transient latency. The tradeoff is that a real failure takes longer to detect: writes with acks=all will wait longer before the leader acts. Do not raise it above your application’s tolerance for failover time.

Clusters upgraded from pre-2.5 that retained the 10-second default are especially prone to this. The current default is 30 seconds.

Offline partitions and KAFKA-12241

If OfflinePartitionsCount is nonzero and the partition has no ISR members, you may need to elect a leader manually. If unclean.leader.election.enable=false (the safe default), this requires bringing a broker back online or using partition reassignment. Enabling unclean election will recover availability but causes acknowledged data to be lost. Treat this as a last resort.

Prevention

  • Set min.insync.replicas=2 when replication.factor=3 so that acks=all provides real durability. Without it, a leader can acknowledge with zero followers in ISR.
  • Ensure replica.lag.time.max.ms matches your infrastructure. Virtualized environments, slow links, or heavy GC workloads need the default 30 seconds or higher.
  • Monitor IsrShrinksPerSec and IsrExpandsPerSec together as a pair. A shrink-expand loop is an early warning.
  • Maintain per-broker dashboards that correlate disk await, RequestHandlerAvgIdlePercent, and GC pauses. The follower is often the bottleneck, not the leader.
  • Test broker failure recovery in staging. Measure how long ISR recovery takes after a restart or shutdown so you know your real thresholds.

How Netdata helps

Netdata collects Kafka JMX metrics alongside OS metrics on every broker. Correlate IsrShrinksPerSec on the leader with disk await, CPU, and GC pauses on the follower in the same time window to identify the common denominator. It also tracks UnderReplicatedPartitions, OfflinePartitionsCount, and request handler idle percent, so the cascade from replication lag to saturation to unavailability is visible on one timeline.