ClickHouse Keeper latency high: the early warning before sessions expire

INSERTs to replicated tables slow down, ON CLUSTER DDL hangs, and the replication queue grows on followers. SELECT 1 and HTTP /ping stay healthy, and non-replicated tables are fine. The culprit is usually the coordination service, not ClickHouse itself.

Rising ZooKeeper or ClickHouse Keeper operation latency is a leading indicator. Replicated inserts, replication log updates, and distributed DDL all round-trip through Keeper. Because Keeper writes its transaction log synchronously, disk I/O on the Keeper node is the most common bottleneck. A degraded-but-connected coordination service is worse than a hard partition: it silently slows every replicated operation until sessions start expiring and replicas flip to readonly. This article explains how to read the early signals, isolate the cause, and fix it before sessions expire.

flowchart TD
    A[Keeper/ZK transaction log fsync slow] --> B[Operation latency rises]
    B --> C[Replicated insert and DDL round-trips slow]
    C --> D[Replication queue grows]
    C --> E[Insert latency rises]
    B --> F[Session timeout risk]
    F --> G[Replica session expires]
    G --> H[Replicas become readonly]
    H --> I[Replicated writes fail]

What this means

“Keeper latency high” means the round-trip time for operations against ZooKeeper or ClickHouse Keeper is elevated from ClickHouse’s perspective. Baseline latency varies by network, but a local ensemble is typically single-digit milliseconds. When operation latency nears the negotiated session timeout, heartbeats can time out. Once a session expires, the replica drops its ephemeral nodes, flips to readonly, and must re-register before accepting writes.

The damage happens in stages. First, replicated inserts and DDL slow down because they wait on Keeper. Then followers fall behind because replication queue entries cannot be acknowledged quickly. If the session is lost, the replica becomes readonly until it reconnects and re-establishes its ephemeral nodes. During this window, writes to affected replicated tables can fail or require retries, even though the ClickHouse process appears healthy.

Common causes

CauseWhat it looks likeFirst thing to check
Keeper transaction log disk bottleneckLatency spikes correlate with high disk await on the Keeper node; mntr shows elevated average latency.`echo mntr
Too many replicated tables or watcheszk_znode_count and zk_watch_count are high or growing fast; latency rises with table count.Count replicated tables and compare with historical znode and watch counts from mntr.
DDL or metadata stormON CLUSTER operations hang; system.distributed_ddl_queue shows unfinished entries; latency spikes during schema changes.system.distributed_ddl_queue status and the rate of new DDL in system.query_log.
Network degradation short of partitionKeeper is reachable but RTT is high or retransmits are present; ClickHouse-side wait grows faster than server-side latency.ss -i or netstat -s for retransmits; compare RTT between ClickHouse and Keeper hosts.
JVM GC pauses on external ZooKeeperRegular latency spikes on ZooKeeper but not on ClickHouse Keeper; GC logs show long pauses.ZooKeeper JVM GC logs and heap usage. This does not apply to ClickHouse Keeper.

Quick checks

Run these read-only checks to confirm the symptom and narrow the cause.

# Test basic Keeper/ZK responsiveness
echo ruok | nc -w 2 <keeper-host> 2181
# Check Keeper server-side metrics; look for avg latency, znode count, watch count
echo mntr | nc -w 2 <keeper-host> 2181
# For built-in ClickHouse Keeper on port 9181
echo mntr | nc -w 2 localhost 9181
-- Check ClickHouse session health and negotiated timeout
SELECT name, host, port, is_expired, session_uptime_elapsed_seconds, session_timeout_ms
FROM system.zookeeper_connection;
-- Check replica session state
SELECT database, table, is_readonly, is_session_expired, total_replicas, active_replicas
FROM system.replicas
WHERE engine LIKE '%Replicated%';
-- Check cumulative ZooKeeper wait time from ClickHouse perspective
SELECT event, value
FROM system.events
WHERE event LIKE 'ZooKeeper%';
-- Check replication queue growth
SELECT database, table, queue_size, absolute_delay
FROM system.replicas
WHERE queue_size > 0
ORDER BY queue_size DESC;
-- Check for DDL adding metadata pressure
SELECT entry, query, status, exception_text
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY query_create_time DESC;
# Check disk I/O on the Keeper host; high await points to a transaction log bottleneck
iostat -xz 1 5

How to diagnose it

  1. Confirm latency from both sides. Compare ZooKeeperWaitMicroseconds in system.events with the average latency reported by Keeper’s mntr command. If the server-side latency is low but ClickHouse wait is high, suspect the network path. If both are high, the bottleneck is on the Keeper node.
  2. Inspect session health. Query system.zookeeper_connection. If is_expired = 1, sessions are already dropping. Check system.replicas for is_session_expired appearing across multiple tables.
  3. Correlate with replica state. Query system.replicas for is_readonly and is_session_expired. If these appear on multiple tables simultaneously, the coordination service is the common factor.
  4. Examine Keeper server metrics. From mntr, watch zk_avg_latency, zk_znode_count, zk_watch_count, and any queue of outstanding requests. Rising znode or watch counts with rising latency point to metadata overload.
  5. Check the disk under the transaction log. On the Keeper host, use iostat to measure await on the volume that holds the transaction log. Sustained high await on the log volume is the smoking gun. Also check that the log directory is not filling.
  6. Look for metadata churn. Check system.distributed_ddl_queue for stuck entries and system.query_log for recent DDL. A large number of replicated tables, rapid schema changes, or a thundering herd of reconnecting nodes can all raise load.
  7. Rule out network issues. Measure RTT and retransmits between ClickHouse and Keeper. If latency is high only from certain ClickHouse nodes, check their network paths.
  8. Choose the fix based on the dominant cause: disk bottleneck, metadata overload, network degradation, or session timeout mismatch.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Keeper average operation latencyDirect measure of coordination health.Sustained upward trend, or approaching session_timeout_ms.
ZooKeeperWaitMicroseconds rateClickHouse-side time spent waiting on Keeper.Sustained upward trend or step change.
is_expired from system.zookeeper_connection / is_session_expired from system.replicasIndicates session flapping before hard failures.is_expired = 1 or is_session_expired = 1.
Replication queue depthFollowers fall behind when coordination slows.queue_size growing for > 15 minutes.
Insert latency P99 on replicated tablesReplicated inserts include Keeper round-trips.P99 > 2x baseline without insert rate change.
system.distributed_ddl_queue statusDDL stalls when Keeper is slow.Entries stuck in non-Finished state.
ZK znode and watch countMetadata overhead drives load.Rapid growth or unusually high absolute values.
Keeper transaction log disk I/O waitSynchronous tx log makes disk the usual bottleneck.await elevated on the transaction log volume.

Fixes

Keeper transaction log disk bottleneck

Move the ZooKeeper transaction log to a dedicated, low-latency disk, preferably NVMe, and separate from both ClickHouse data and application logs. ZooKeeper fsyncs every write before responding, so spinning disk, shared volumes, or exhausted SSDs directly raise operation latency. For built-in ClickHouse Keeper, place the Keeper log directory on fast storage and avoid sharing it with the ClickHouse data volume.

Tradeoffs: Changing dataLogDir for external ZooKeeper or the Keeper log path for built-in Keeper requires a restart of the coordination node. Schedule this after stabilizing the cluster, and never restart multiple Keeper nodes at once if it risks quorum loss.

Metadata overload from tables, watches, or DDL

Pause non-essential DDL, especially ON CLUSTER operations, until latency recovers. Reduce the number of replicated tables where possible by consolidating tables or using non-replicated engines for transient data. If you use a large replicated_deduplication_window, review whether it is causing excessive znode growth.

Tradeoffs: Pausing DDL delays schema changes. Reducing replicated tables reduces write availability guarantees for those tables.

Network path degradation

Fix routing, MTU mismatches, or packet loss between ClickHouse and Keeper. Keep the Keeper ensemble in the same low-latency network as the ClickHouse cluster. High RTT or retransmits amplify the effect of every synchronous Keeper operation.

Tradeoffs: Network changes carry their own risk and may require coordination with network or cloud infrastructure teams.

Session timeout tuning as a temporary buffer

If you need immediate relief while fixing the root cause, increase the session timeout. This reduces session flapping but does not fix the underlying latency problem.

Tradeoff: Longer timeouts mask coordination degradation and prolong stale reads during true partitions. Treat this as a temporary bridge, not a fix.

Prevention

  • Monitor Keeper operation latency as a leading indicator, not just process liveness. Liveness checks miss the degraded-but-connected state.
  • Keep the Keeper transaction log on dedicated fast storage with enough headroom. Watch the log directory for growth and the underlying disk for latency.
  • Limit replicated table sprawl. Each replicated table adds znodes and watches; excessive table counts are a common root cause of Keeper saturation.
  • Gate DDL during incidents. A DDL storm on an already slow Keeper can push it over the edge.
  • Set alerts on the rate of ZooKeeperWaitMicroseconds, state changes in system.zookeeper_connection, and the derivative of replication queue depth.
  • Establish baselines during low-load windows using mntr and system.zookeeper_connection so you can spot deviations early.

How Netdata helps

  • Correlate ClickHouse ZooKeeperWaitMicroseconds with host disk await on Keeper nodes to isolate transaction log disk saturation.
  • Track is_expired from system.zookeeper_connection against is_readonly and is_session_expired from system.replicas.
  • Alert on insert latency P99 and DelayedInserts before RejectedInserts appear.
  • Plot replication queue depth derivatives and distributed DDL queue status alongside query error rates to separate coordination issues from query issues.