Kafka ZooKeeper session expired: GC pauses, ISR drops, and controller loss
In ZooKeeper mode, a session expiry means ZK declared a broker dead. The Java process may still be running, but the cluster treats it as gone. If the broker was the controller, the metadata plane re-elects. If it was a leader, every partition it led starts a new leader election. If it was a follower, leaders remove it from the ISR. The default zookeeper.session.timeout.ms is 18000 ms, and the most common trigger is a Full GC pause longer than that window. This guide applies to Kafka clusters running in ZooKeeper mode; KRaft mode does not use ZK sessions.
What this means
A Kafka broker in ZK mode maintains a session with ZooKeeper through regular heartbeats. The timeout is governed by zookeeper.session.timeout.ms. When the broker fails to heartbeat within that window, ZK tears down the session and deletes the broker’s ephemeral nodes. The controller removes the broker from the live set. For every partition the broker led, the controller enqueues a leader election. For every partition where this broker was a follower, the leaders remove it from the ISR. The result is a burst of LeaderElectionRateAndTimeMs, a spike in IsrShrinksPerSec, and a rise in UnderReplicatedPartitions. If the expired session belonged to the active controller, the cluster also loses its metadata plane until a new controller is elected.
flowchart TD
A[Full GC pause or ZK latency] --> B[Broker misses ZK heartbeat window]
B --> C[ZK session expires]
C --> D{Broker role}
D -->|Controller| E[Controller election starts]
D -->|Leader| F[Leader elections enqueued]
D -->|Follower| G[ISR shrinks on affected partitions]
E --> H[ActiveControllerCount sum != 1]
F --> I[UnderReplicatedPartitions rises]
G --> I
I --> J[OfflinePartitionsCount may rise]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Full GC pause longer than session timeout | ZooKeeperExpiresPerSec correlates with GC log timestamps showing pauses over 18 s | jstat -gcutil <pid> 1000 5 or GC logs for Old Gen collection time |
| Network partition or latency to ZK | One broker affected, ZooKeeperRequestLatencyMs elevated, TCP retransmits high | Network path latency and ss -tnp output |
| ZK ensemble overload or slow transaction log | Multiple brokers expiring simultaneously, ZK request latency p99 over 1 s | Disk latency on ZK nodes via iostat -x |
| Broker CPU starvation or container throttling | High CPU throttled time, broker process runnable but slow to respond | /sys/fs/cgroup/cpu.stat or OS CPU metrics |
Quick checks
# Check ZK session expiration rate on the broker
# JMX port depends on your broker configuration; 9999 is used here as an example
echo "get -b kafka.server:type=SessionExpireListener,name=ZooKeeperExpiresPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Check active controller count across brokers
echo "get -b kafka.controller:type=KafkaController,name=ActiveControllerCount Value" | java -jar jmxterm.jar -l localhost:9999
# Check under-replicated partitions cluster-wide
echo "get -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions Value" | java -jar jmxterm.jar -l localhost:9999
# Check GC behavior (5 samples, 1 s apart)
jstat -gcutil $(pgrep -f kafka.Kafka) 1000 5
# Check ZK request latency p99
echo "get -b kafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperRequestLatencyMs 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# Check ISR shrink rate
echo "get -b kafka.server:type=ReplicaManager,name=IsrShrinksPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Check broker process uptime
ps -o pid,comm,etime -p $(pgrep -f kafka.Kafka)
# Check disk latency on the broker (5 samples)
iostat -xz 1 5
# Check controller event queue depth if a broker was controller
echo "get -b kafka.controller:type=ControllerEventManager,name=EventQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
How to diagnose it
- Confirm the session expiry. Query
ZooKeeperExpiresPerSecon the affected broker. Any nonzeroOneMinuteRateoutside a rolling restart is abnormal. - Determine scope. Is one broker affected or many? Was the broker the active controller? Check
ActiveControllerCounton all brokers and sum the values. - Correlate with GC. Overlay the expiry timestamp with GC logs. If a Full GC pause exceeds
zookeeper.session.timeout.ms, you have the root cause. Also check broker logs forOutOfMemoryErroror memory pressure warnings preceding the pause. - Check ZK health. If GC is clean, look at
ZooKeeperRequestLatencyMsp99. If it is elevated across multiple brokers, inspect the ZK ensemble disk and CPU. - Verify the network path. Check for packet loss or latency spikes between the broker and ZK nodes. Firewall rule changes, DNS resolution delays, or routing asymmetry can stall heartbeats.
- Measure impact. Check
UnderReplicatedPartitions,OfflinePartitionsCount, andLeaderElectionRateAndTimeMs. Confirm whether producers withacks=allare hittingNotEnoughReplicasExceptionviaUnderMinIsrPartitionCount. - Identify if ISR shrinks are continuing. Sustained
IsrShrinksPerSecmeans replicas are still falling behind. A transient spike followed byIsrExpandsPerSecmeans recovery is in progress.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
ZooKeeperExpiresPerSec | Direct signal of ZK declaring the broker dead | OneMinuteRate > 0 outside maintenance |
ActiveControllerCount | Exactly one broker must report 1 | Cluster-wide sum != 1 for more than 30 s |
UnderReplicatedPartitions | Durability window is open | Nonzero and growing |
UnderMinIsrPartitionCount | Confirms writes are being rejected | Nonzero for more than 2 min |
IsrShrinksPerSec | Velocity of replica loss from ISR | Sustained > 0 for more than 5 min |
OfflinePartitionsCount | Partitions with no leader | Nonzero for more than 60 s |
ZooKeeperRequestLatencyMs | Early indicator of ZK pressure | p99 > 100 ms sustained |
| GC Old Gen CollectionTime | GC pauses directly cause expiry | Full GC pause approaching 18 s |
Fixes
GC pauses exceeding the session timeout
The default zookeeper.session.timeout.ms is 18 s. If your GC logs show Full GC pauses approaching or exceeding this, you have three levers. First, reduce heap pressure. Avoid oversizing the heap; large heaps increase pause times and steal from the OS page cache. Look for message down-conversion or memory leaks. Second, switch to a low-pause collector such as G1GC, ZGC, or Shenandoah. Third, you may increase zookeeper.session.timeout.ms, but this trades faster failure detection for longer outage windows. Do not increase the timeout without also fixing the GC.
Network partition between broker and ZK
If only one broker is affected and GC is clean, inspect the network path. Check for TCP retransmits and latency spikes from the broker to ZK nodes. Firewall rule changes, DNS resolution delays, or routing asymmetry can all stall ZK heartbeats. Fix the network path rather than tuning Kafka.
ZK ensemble overload
If multiple brokers expire simultaneously, the ZK ensemble itself is likely the bottleneck. ZK is sensitive to transaction log disk latency. Ensure ZK runs on dedicated nodes with SSD-backed transaction logs. Do not share the ZK ensemble with other systems like Hadoop or Solr. Check disk latency on ZK nodes via iostat -x.
Controller session expiry
When the controller loses its ZK session, the cluster loses the ability to process metadata changes until a new controller is elected. Warning: Do not restart additional brokers during this window; each restart generates more controller events and prolongs recovery. Monitor the controller event queue on the new controller to ensure it drains. Once the cluster stabilizes, investigate the root cause on the former controller just as you would for any other broker.
Prevention
- Alert on Full GC pause duration with a threshold well below
zookeeper.session.timeout.ms. A 5-second pause warrants investigation; a 10-second pause is an incident in waiting. - Keep ZK infrastructure isolated and monitored independently. Track ZK transaction log disk latency.
- Do not set
zookeeper.session.timeout.msaggressively low to “fail fast” if your JVM cannot guarantee sub-second GC pauses. - Monitor
IsrShrinksPerSecas an early warning. ISR shrinks often precede session expiry when the root cause is gradual degradation. - During rolling restarts, expect brief ZK session transitions. If a broker does not rejoin within 1-2 times
replica.lag.time.max.ms, investigate before proceeding to the next broker. - Test failure recovery time by gracefully shutting down one broker and measuring how long ISR recovery takes. If recovery approaches the session timeout, you have insufficient headroom.
How Netdata helps
- Correlates
ZooKeeperExpiresPerSecwith JVM GC metrics and OS-level CPU and disk latency. - Collects
ActiveControllerCount,UnderReplicatedPartitions, andIsrShrinksPerSecfrom JMX. - Surfaces request latency breakdowns to distinguish GC-induced
LocalTimeMsspikes from network-induced delays. - Tracks
ZooKeeperRequestLatencyMsto detect ZK ensemble pressure before it triggers session timeouts. - Provides page fault and disk latency metrics from the OS to catch hardware degradation that indirectly causes GC or network stalls.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation
- Kafka leadership imbalance: LeaderCount skew and preferred replica election
- Kafka min.insync.replicas and acks: configuring durability you actually have
- Kafka monitoring checklist: the signals every production cluster needs
- Kafka monitoring maturity model: from survival to expert
- Kafka ActiveControllerCount not equal to 1: no controller or split brain
- Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas
- Kafka NOT_LEADER_FOR_PARTITION: stale metadata, controller lag, and client retries
- Kafka OfflinePartitionsCount > 0: partitions with no leader and how to recover







