Kafka ZooKeeper session expired: GC pauses, ISR drops, and controller loss

In ZooKeeper mode, a session expiry means ZK declared a broker dead. The Java process may still be running, but the cluster treats it as gone. If the broker was the controller, the metadata plane re-elects. If it was a leader, every partition it led starts a new leader election. If it was a follower, leaders remove it from the ISR. The default zookeeper.session.timeout.ms is 18000 ms, and the most common trigger is a Full GC pause longer than that window. This guide applies to Kafka clusters running in ZooKeeper mode; KRaft mode does not use ZK sessions.

What this means

A Kafka broker in ZK mode maintains a session with ZooKeeper through regular heartbeats. The timeout is governed by zookeeper.session.timeout.ms. When the broker fails to heartbeat within that window, ZK tears down the session and deletes the broker’s ephemeral nodes. The controller removes the broker from the live set. For every partition the broker led, the controller enqueues a leader election. For every partition where this broker was a follower, the leaders remove it from the ISR. The result is a burst of LeaderElectionRateAndTimeMs, a spike in IsrShrinksPerSec, and a rise in UnderReplicatedPartitions. If the expired session belonged to the active controller, the cluster also loses its metadata plane until a new controller is elected.

flowchart TD
    A[Full GC pause or ZK latency] --> B[Broker misses ZK heartbeat window]
    B --> C[ZK session expires]
    C --> D{Broker role}
    D -->|Controller| E[Controller election starts]
    D -->|Leader| F[Leader elections enqueued]
    D -->|Follower| G[ISR shrinks on affected partitions]
    E --> H[ActiveControllerCount sum != 1]
    F --> I[UnderReplicatedPartitions rises]
    G --> I
    I --> J[OfflinePartitionsCount may rise]

Common causes

CauseWhat it looks likeFirst thing to check
Full GC pause longer than session timeoutZooKeeperExpiresPerSec correlates with GC log timestamps showing pauses over 18 sjstat -gcutil <pid> 1000 5 or GC logs for Old Gen collection time
Network partition or latency to ZKOne broker affected, ZooKeeperRequestLatencyMs elevated, TCP retransmits highNetwork path latency and ss -tnp output
ZK ensemble overload or slow transaction logMultiple brokers expiring simultaneously, ZK request latency p99 over 1 sDisk latency on ZK nodes via iostat -x
Broker CPU starvation or container throttlingHigh CPU throttled time, broker process runnable but slow to respond/sys/fs/cgroup/cpu.stat or OS CPU metrics

Quick checks

# Check ZK session expiration rate on the broker
# JMX port depends on your broker configuration; 9999 is used here as an example
echo "get -b kafka.server:type=SessionExpireListener,name=ZooKeeperExpiresPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check active controller count across brokers
echo "get -b kafka.controller:type=KafkaController,name=ActiveControllerCount Value" | java -jar jmxterm.jar -l localhost:9999

# Check under-replicated partitions cluster-wide
echo "get -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions Value" | java -jar jmxterm.jar -l localhost:9999

# Check GC behavior (5 samples, 1 s apart)
jstat -gcutil $(pgrep -f kafka.Kafka) 1000 5

# Check ZK request latency p99
echo "get -b kafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperRequestLatencyMs 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Check ISR shrink rate
echo "get -b kafka.server:type=ReplicaManager,name=IsrShrinksPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check broker process uptime
ps -o pid,comm,etime -p $(pgrep -f kafka.Kafka)

# Check disk latency on the broker (5 samples)
iostat -xz 1 5

# Check controller event queue depth if a broker was controller
echo "get -b kafka.controller:type=ControllerEventManager,name=EventQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

How to diagnose it

  1. Confirm the session expiry. Query ZooKeeperExpiresPerSec on the affected broker. Any nonzero OneMinuteRate outside a rolling restart is abnormal.
  2. Determine scope. Is one broker affected or many? Was the broker the active controller? Check ActiveControllerCount on all brokers and sum the values.
  3. Correlate with GC. Overlay the expiry timestamp with GC logs. If a Full GC pause exceeds zookeeper.session.timeout.ms, you have the root cause. Also check broker logs for OutOfMemoryError or memory pressure warnings preceding the pause.
  4. Check ZK health. If GC is clean, look at ZooKeeperRequestLatencyMs p99. If it is elevated across multiple brokers, inspect the ZK ensemble disk and CPU.
  5. Verify the network path. Check for packet loss or latency spikes between the broker and ZK nodes. Firewall rule changes, DNS resolution delays, or routing asymmetry can stall heartbeats.
  6. Measure impact. Check UnderReplicatedPartitions, OfflinePartitionsCount, and LeaderElectionRateAndTimeMs. Confirm whether producers with acks=all are hitting NotEnoughReplicasException via UnderMinIsrPartitionCount.
  7. Identify if ISR shrinks are continuing. Sustained IsrShrinksPerSec means replicas are still falling behind. A transient spike followed by IsrExpandsPerSec means recovery is in progress.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ZooKeeperExpiresPerSecDirect signal of ZK declaring the broker deadOneMinuteRate > 0 outside maintenance
ActiveControllerCountExactly one broker must report 1Cluster-wide sum != 1 for more than 30 s
UnderReplicatedPartitionsDurability window is openNonzero and growing
UnderMinIsrPartitionCountConfirms writes are being rejectedNonzero for more than 2 min
IsrShrinksPerSecVelocity of replica loss from ISRSustained > 0 for more than 5 min
OfflinePartitionsCountPartitions with no leaderNonzero for more than 60 s
ZooKeeperRequestLatencyMsEarly indicator of ZK pressurep99 > 100 ms sustained
GC Old Gen CollectionTimeGC pauses directly cause expiryFull GC pause approaching 18 s

Fixes

GC pauses exceeding the session timeout

The default zookeeper.session.timeout.ms is 18 s. If your GC logs show Full GC pauses approaching or exceeding this, you have three levers. First, reduce heap pressure. Avoid oversizing the heap; large heaps increase pause times and steal from the OS page cache. Look for message down-conversion or memory leaks. Second, switch to a low-pause collector such as G1GC, ZGC, or Shenandoah. Third, you may increase zookeeper.session.timeout.ms, but this trades faster failure detection for longer outage windows. Do not increase the timeout without also fixing the GC.

Network partition between broker and ZK

If only one broker is affected and GC is clean, inspect the network path. Check for TCP retransmits and latency spikes from the broker to ZK nodes. Firewall rule changes, DNS resolution delays, or routing asymmetry can all stall ZK heartbeats. Fix the network path rather than tuning Kafka.

ZK ensemble overload

If multiple brokers expire simultaneously, the ZK ensemble itself is likely the bottleneck. ZK is sensitive to transaction log disk latency. Ensure ZK runs on dedicated nodes with SSD-backed transaction logs. Do not share the ZK ensemble with other systems like Hadoop or Solr. Check disk latency on ZK nodes via iostat -x.

Controller session expiry

When the controller loses its ZK session, the cluster loses the ability to process metadata changes until a new controller is elected. Warning: Do not restart additional brokers during this window; each restart generates more controller events and prolongs recovery. Monitor the controller event queue on the new controller to ensure it drains. Once the cluster stabilizes, investigate the root cause on the former controller just as you would for any other broker.

Prevention

  • Alert on Full GC pause duration with a threshold well below zookeeper.session.timeout.ms. A 5-second pause warrants investigation; a 10-second pause is an incident in waiting.
  • Keep ZK infrastructure isolated and monitored independently. Track ZK transaction log disk latency.
  • Do not set zookeeper.session.timeout.ms aggressively low to “fail fast” if your JVM cannot guarantee sub-second GC pauses.
  • Monitor IsrShrinksPerSec as an early warning. ISR shrinks often precede session expiry when the root cause is gradual degradation.
  • During rolling restarts, expect brief ZK session transitions. If a broker does not rejoin within 1-2 times replica.lag.time.max.ms, investigate before proceeding to the next broker.
  • Test failure recovery time by gracefully shutting down one broker and measuring how long ISR recovery takes. If recovery approaches the session timeout, you have insufficient headroom.

How Netdata helps

  • Correlates ZooKeeperExpiresPerSec with JVM GC metrics and OS-level CPU and disk latency.
  • Collects ActiveControllerCount, UnderReplicatedPartitions, and IsrShrinksPerSec from JMX.
  • Surfaces request latency breakdowns to distinguish GC-induced LocalTimeMs spikes from network-induced delays.
  • Tracks ZooKeeperRequestLatencyMs to detect ZK ensemble pressure before it triggers session timeouts.
  • Provides page fault and disk latency metrics from the OS to catch hardware degradation that indirectly causes GC or network stalls.