Kafka ZooKeeper request latency high: metadata slowdowns in ZK mode

If you run Kafka in ZooKeeper mode and kafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperRequestLatencyMs is climbing, a sustained p99 above 100 ms means the controller event queue is already backing up. Above 1 s, you approach the zookeeper.session.timeout.ms boundary and brokers risk session expiry. In ZK mode, every leader election, ISR change, and topic update flows through the ensemble. When ZK stalls, the controller stalls, metadata propagation freezes, and the cluster degrades in a way that looks like a broker problem but originates in the metadata plane.

Distinguish ZK-side slowness from local broker issues, know the safe thresholds, and stop the cascade before sessions expire. If you run Kafka 4.0 or later, this guide does not apply; ZK mode was removed in 4.0.

What this means

ZK latency is the broker-side measurement of every RPC to the ZooKeeper ensemble. The controller commits partition state, updates ISR membership, registers brokers, and handles leader elections through these operations. When p99 rises from a healthy sub-10 ms to 100 ms, each controller event takes an order of magnitude longer. Because the controller processes events sequentially, slow ZK operations back up the controller event queue. Partitions needing new leaders or ISR updates wait in line.

If p99 exceeds 1 s, the window between a heartbeat and session timeout narrows dangerously. The default zookeeper.session.timeout.ms is 18,000 ms. A broker that cannot complete ZK operations within that window is expelled. Session expiry triggers immediate leader elections for every partition led by that broker, ISR shrinks, and OfflinePartitionsCount can grow. In extreme cases, multiple brokers expire in succession, generating a metadata storm that the controller cannot clear.

flowchart TD
    A[High ZK request latency] --> B[Controller event queue backup]
    A --> C[Slow ISR and metadata writes]
    B --> D[Leader election delays]
    C --> E[UnderReplicatedPartitions]
    D --> F[OfflinePartitionsCount]
    E --> F
    A --> G{Approaches session timeout}
    G --> H[Broker session expiry]
    H --> I[Controller failover]
    I --> B

Common causes

CauseWhat it looks likeFirst thing to check
Shared ZK ensemble overloadLatency spikes correlate with traffic from other services or clusters on the same ensemble.Confirm whether other applications share the zookeeper.connect string.
ZK server disk I/O saturationZooKeeperRequestLatencyMs elevated uniformly on all brokers; ZK server disk latency is high.Inspect disk await on ZK node storage volumes.
Network path degradationLatency elevated only for brokers in specific racks or AZs.Measure RTT and packet loss from affected brokers to ZK nodes.
Metadata volume pressureLatency climbs with topic and partition count; controller queue depth trends upward in steady state.Check ControllerEventQueueSize and total partition count.

Quick checks

# ZK request latency p99
echo "get -b kafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperRequestLatencyMs 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# ZK session expiration rate
echo "get -b kafka.server:type=SessionExpireListener,name=ZooKeeperExpiresPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Controller event queue depth
echo "get -b kafka.controller:type=ControllerEventManager,name=EventQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

# Active controller count
echo "get -b kafka.controller:type=KafkaController,name=ActiveControllerCount Value" | java -jar jmxterm.jar -l localhost:9999

# Leader election latency p99
echo "get -b kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Local broker disk latency to rule out I/O stall
iostat -xz 1

# Current ZK session timeout
grep zookeeper.session.timeout.ms /etc/kafka/server.properties

# Network reachability to a ZK node
ping -c 5 <zk-host>

How to diagnose it

  1. Map the scope. Compare ZooKeeperRequestLatencyMs across all brokers. If every broker shows elevated p99, the ensemble or the network path to it is the problem. If only one broker is elevated, inspect that broker’s network path and local resource contention first. Asymmetric latency often points to a routing issue rather than ensemble-wide overload.

  2. Check controller queue depth. On the active controller (the broker reporting ActiveControllerCount=1), read ControllerEventQueueSize. If it is growing while ZK latency is high, the controller is bottlenecked on ZK commits. Do not restart brokers or initiate reassignments; new events will lengthen the queue.

  3. Look for session expiry. Read ZooKeeperExpiresPerSec on all brokers. Any nonzero rate means brokers are already being expelled. Even a single expiring broker can shift hundreds or thousands of partitions. Check broker logs for session timeout messages to confirm.

  4. Identify ensemble consumers. Determine whether the zookeeper.connect ensemble is shared with other applications or other Kafka clusters. Other Kafka clusters using the same ensemble are easy to overlook and are a common silent cause of degraded metadata throughput.

  5. Verify leader election health. Read LeaderElectionRateAndTimeMs p99. If elections are taking seconds instead of milliseconds, the controller is starved by ZK latency. Election latency is a trailing indicator; it rises after ZK latency but confirms operational impact.

  6. Rule out local broker saturation. Check RequestHandlerAvgIdlePercent and disk await on the brokers. If local disk await is high, fix the broker first before blaming ZK. Healthy local metrics confirm the bottleneck is external to the broker’s data plane.

  7. Assess metadata volume. If ControllerEventQueueSize trends upward during normal operations and total partition count is high, the cluster may be approaching the controller’s ZK throughput limit. There is no universal hard limit, but clusters beyond several hundred thousand partitions often see nonlinear ZK latency growth.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
ZooKeeperRequestLatencyMs p99Broker-observed ZK slowness that throttles every metadata operation.p99 > 100 ms sustained
ZooKeeperExpiresPerSecBrokers being expelled due to missed sessions.Any nonzero rate outside maintenance
ControllerEventQueueSizeQueue backup means metadata changes are delayed.Sustained > 100, growing > 1000
ActiveControllerCountLoss of controller halts all metadata recovery.Cluster-wide sum != 1
LeaderElectionRateAndTimeMs p99Slow elections mean the controller is starved by ZK latency.p99 > 1 s sustained
OfflinePartitionsCountResult when the controller cannot elect leaders in time.Nonzero > 60 s

Fixes

Shared ZK ensemble overload

If other services or Kafka clusters share the ensemble, their load directly competes with controller operations. The durable fix is migrating Kafka to a dedicated ZK ensemble. As a temporary mitigation, stop non-critical consumers of the ensemble. Tradeoff: ensemble migration requires planned downtime and client reconfiguration.

ZK server disk contention

ZK is sensitive to write latency on its underlying storage. If ZK nodes run on saturated or shared disks, every write operation slows. Ensure ZK nodes use dedicated, low-latency local storage and are not co-located with I/O-intensive neighbors. Tradeoff: this requires restarting ZK nodes and verifying quorum health.

Network path degradation

If latency is elevated only for brokers in specific racks or availability zones, inspect network routing and packet loss between those brokers and ZK nodes. Switching to ZK nodes with lower round-trip time, or fixing asymmetric routes, resolves the issue. Tradeoff: connection string changes may require a rolling broker restart to pick up.

Controller queue backup

When the queue is already backed up, do not generate new metadata events. Avoid topic creation, partition reassignment, and broker restarts until ControllerEventQueueSize drains and ZK latency recovers. Do not attempt to force a controller failover to clear the queue; the new controller will inherit the same ZK latency and the failover itself adds events. If partition count is the root cause, plan a reduction or cluster expansion. Tradeoff: no immediate fix except stopping new work and waiting.

Imminent session timeout risk

If p99 is approaching 1 s and a broker is likely to expire before the root cause is fixed, a controlled shutdown triggers cleaner leader elections than an abrupt session timeout. Time this only when the controller queue is draining, because the shutdown itself generates controller events. If multiple brokers are near expiry, stagger shutdowns; a mass shutdown can overload the surviving controller with leader elections. Tradeoff: you trade an orderly transition against the risk of adding queue pressure at the wrong moment.

Prevention

  • Dedicated ensembles. Never share a ZK ensemble between Kafka and other services, or between multiple Kafka clusters. Shared ZK is a silent cluster killer.
  • Monitor the p99. Treat ZooKeeperRequestLatencyMs p99 > 100 ms as a ticket threshold and > 1 s as an imminent session-timeout risk.
  • Size ZK storage properly. Ensure ZK nodes use fast local disk for write-heavy workloads and are not co-located with I/O-intensive neighbors.
  • Bound partition count. Keep partitions per broker within tested limits. High partition counts increase the volume of ZK writes and watcher traffic.
  • Game-day broker failure. Test controlled shutdown and recovery while watching ZK latency. If a single broker failure pushes p99 above 100 ms, the ensemble is undersized.

How Netdata helps

  • Correlates ZooKeeperRequestLatencyMs p99 with ControllerEventQueueSize and ActiveControllerCount to show when metadata plane slowness backs up the controller.
  • Surfaces ZooKeeperExpiresPerSec alongside broker JVM and OS disk metrics to distinguish external ZK latency from local broker pressure.
  • Tracks per-broker disk await and RequestHandlerAvgIdlePercent to rule out local I/O saturation before attributing latency to ZK.
  • Supports composite alerting on ZK latency plus controller queue depth, reducing false positives from transient spikes.
  • Maps LeaderElectionRateAndTimeMs increases to the exact windows of elevated ZK latency for post-incident correlation.