ClickHouse ZooKeeper session has expired: causes, recovery, and tuning

When system.replicas shows is_session_expired = 1, the replica has lost its ZooKeeper session and stopped participating in replication. It rejects inserts, coordinated merges, and distributed DDL. Depending on quorum and load-balancer configuration, writes may shift silently to other replicas or halt for entire shards.

is_session_expired often appears alongside is_readonly = 1, but the two are distinct. Session expiration means the coordination session is dead. Read-only means the replica refuses writes. Session expiration is the most severe because it breaks the replica’s contract with the cluster.

Brief session flapping during ZooKeeper leader elections is normal and usually resolves within seconds. Sustained expiration is not. If is_session_expired persists for more than a minute, treat it as an active coordination failure that will cascade into replication lag and stale reads.

What this means

ReplicatedMergeTree tables rely on ZooKeeper or ClickHouse Keeper to agree on leaders, queue merges, and propagate DDL. Each ClickHouse connection holds a negotiated session. When the session expires, ephemeral nodes vanish. The replica loses leadership claims and visibility into the shared replication log.

The replica still answers SELECT from local data. This creates a false sense of health because results may be stale: the replica receives no new parts while disconnected. If the session does not reconnect within seconds, the underlying cause is still active and the replica will not self-recover.

Common causes

CauseWhat it looks likeFirst thing to check
Aggressive session timeoutis_session_expired flaps during minor ZK latency spikessession_timeout_ms in system.zookeeper_connection vs ZooKeeperWaitMicroseconds
ZK disk I/O bottleneckAll replicas affected simultaneously; ZK latency highecho mntr | nc zk-host 2181 for latency and outstanding requests
Network partitionSubset of replicas affected; correlates with network errorsConnectivity from replica to ZK nodes
JVM GC pause (ZooKeeper only)Periodic session drops matching GC cyclesGC logs on ZK JVM nodes
ZK overload / too many tablesHigh znode count; thundering herd after any restartZK mntr znode count; number of replicated tables
ClickHouse node overloadNode unresponsive; heartbeats fail to ZKHost CPU, memory, and system.processes

Quick checks

-- Check replica session and readonly state
SELECT
    database,
    table,
    is_readonly,
    is_session_expired,
    total_replicas,
    active_replicas
FROM system.replicas
WHERE engine LIKE '%Replicated%';
-- Check ZK connection from ClickHouse perspective
SELECT
    name,
    host,
    port,
    is_expired,
    session_uptime_elapsed_seconds,
    session_timeout_ms
FROM system.zookeeper_connection;
-- Live ZK connectivity test
SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1;
# External ZooKeeper health check
echo "ruok" | nc <zookeeper-host> 2181
echo "mntr" | nc <zookeeper-host> 2181
# ClickHouse Keeper health check
echo ruok | nc localhost 9181
echo mntr | nc localhost 9181
-- ZooKeeper client-side latency and errors
SELECT event, value
FROM system.events
WHERE event IN ('ZooKeeperWaitMicroseconds', 'ZooKeeperExceptions');
-- Replication queue stuck entries
SELECT
    database,
    table,
    type,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 10;

How to diagnose it

  1. Confirm the blast radius. Query system.replicas. If all replicas across all tables show is_session_expired = 1, suspect a ZK ensemble issue. If only one replica or shard is affected, suspect a network or node-level problem.
  2. Determine if the loss is sustained. Brief transitions during ZK leader election resolve within seconds. If is_session_expired has been 1 for more than a minute, treat it as a sustained failure that requires intervention.
  3. Inspect the negotiated timeout. Query system.zookeeper_connection. ZK clients send heartbeats at one-third of session_timeout_ms. If ZooKeeperWaitMicroseconds repeatedly approaches or exceeds that threshold, heartbeats stall and expiration follows.
  4. Test ZK connectivity from the affected node. Run SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1. If this hangs or fails, the TCP session or ZK operation path is broken.
  5. Check ZK server health. Use echo mntr | nc <zk-host> 2181 (or port 9181 for Keeper). Look for high zk_avg_latency, large zk_outstanding_requests, or a missing leader (zk_server_state should show leader on one node and follower on the others).
  6. Check client-side error counters. ZooKeeperExceptions in system.events tracks failed operations. A rising count confirms degradation even before sessions expire.
  7. Verify network paths. Use ping, mtr, or similar from the ClickHouse host to ZK nodes. Packet loss, asymmetric routes, or MTU issues cause silent timeouts that do not always surface as socket errors.
  8. Review replication backlog. On affected replicas, queue_size and absolute_delay in system.replicas grow. Check system.replication_queue for entries with high num_tries and non-null last_exception that will block automatic recovery.
  9. Watch for thundering herd. If multiple nodes expired sessions simultaneously after a previous ZK incident, they may reconnect in a burst, spike zk_outstanding_requests, and overload ZK further.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
system.replicas.is_session_expiredDirect indicator of replica isolation from coordinationSustained 1 for more than 30 seconds
system.zookeeper_connection.is_expiredEarly warning before replica state transitionsAny 1 outside of planned maintenance
system.zookeeper_connection.session_timeout_msDetermines how much latency headroom existsClient wait time approaching one-third of this value
ZooKeeperWaitMicrosecondsClient-side coordination latencyTrending upward or elevated above baseline
ZooKeeperExceptionsFailed ZK operationsAny sustained non-zero rate
system.replicas.queue_sizeBacklog after disconnectionGrowing while sessions are expired
ZK mntr zk_avg_latencyServer-side coordination healthSustained elevation above baseline
system.replicas.is_readonlyWrite availability on the replica1 for more than 5 minutes

Fixes

Increase the session timeout

If session_timeout_ms is too aggressive for your network or ZK latency, increase it. This trades faster failure detection for stability under transient spikes. The actual timeout is negotiated at connection time and may differ from the configured value. Verify the effective value in system.zookeeper_connection.

Relieve ZK disk I/O pressure

Move ZooKeeper transaction logs to dedicated NVMe storage. Ensure ZK data directories are not co-located with ClickHouse data or logs. If ZK latency is already elevated, do not restart ClickHouse nodes to force reconnection. That creates a thundering herd and worsens the overload.

Resolve network partitions

Restore connectivity between the affected ClickHouse node and the ZK ensemble. Once the path is stable, the replica reconnects automatically. Verify is_expired returns to 0 and remains stable before considering the incident resolved.

Address JVM GC pauses

If running external ZooKeeper, review GC logs for stop-the-world pauses. Increase JVM heap or migrate to ClickHouse Keeper, which is not JVM-based and avoids GC pauses.

Reduce ZK metadata load

Pause non-essential DDL. Reduce the number of replicated tables where possible; each table creates znodes and watches that multiply coordination overhead. If the ensemble is chronically saturated, scale the ZK cluster or shard tables across separate clusters.

Recover from thundering herd

After a ZK outage, stagger any necessary ClickHouse restarts. If multiple nodes are already reconnecting, wait for the storm to subside before taking further action. Monitor ZooKeeperWaitMicroseconds for a clear downward trend before declaring recovery.

Prevention

  • Monitor ZK latency from both the ClickHouse client side and the ZK server side. Uptime alone is insufficient.
  • Maintain headroom between client-side coordination latency and the negotiated session_timeout_ms. ZK heartbeats are sent at one-third of the timeout.
  • Run ZooKeeper transaction logs on dedicated, low-latency disks isolated from ClickHouse data.
  • Avoid creating thousands of replicated tables on a single ZK ensemble.
  • Do not batch-restart ClickHouse nodes; rolling restarts prevent coordination storms.
  • Consider ClickHouse Keeper over external ZooKeeper to eliminate JVM GC risk.
  • Alert on ZooKeeperExceptions and latency trends, not just hard session expiry.

How Netdata helps

Netdata collects system.replicas.is_session_expired, system.zookeeper_connection.is_expired, ZooKeeperWaitMicroseconds, ZooKeeperExceptions, replication queue depth, and absolute_delay per second. This exposes coordination stress before sessions drop and correlates network latency or ZK disk stalls with session timeouts.