ClickHouse ZooKeeper session has expired: causes, recovery, and tuning
When system.replicas shows is_session_expired = 1, the replica has lost its ZooKeeper session and stopped participating in replication. It rejects inserts, coordinated merges, and distributed DDL. Depending on quorum and load-balancer configuration, writes may shift silently to other replicas or halt for entire shards.
is_session_expired often appears alongside is_readonly = 1, but the two are distinct. Session expiration means the coordination session is dead. Read-only means the replica refuses writes. Session expiration is the most severe because it breaks the replica’s contract with the cluster.
Brief session flapping during ZooKeeper leader elections is normal and usually resolves within seconds. Sustained expiration is not. If is_session_expired persists for more than a minute, treat it as an active coordination failure that will cascade into replication lag and stale reads.
What this means
ReplicatedMergeTree tables rely on ZooKeeper or ClickHouse Keeper to agree on leaders, queue merges, and propagate DDL. Each ClickHouse connection holds a negotiated session. When the session expires, ephemeral nodes vanish. The replica loses leadership claims and visibility into the shared replication log.
The replica still answers SELECT from local data. This creates a false sense of health because results may be stale: the replica receives no new parts while disconnected. If the session does not reconnect within seconds, the underlying cause is still active and the replica will not self-recover.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Aggressive session timeout | is_session_expired flaps during minor ZK latency spikes | session_timeout_ms in system.zookeeper_connection vs ZooKeeperWaitMicroseconds |
| ZK disk I/O bottleneck | All replicas affected simultaneously; ZK latency high | echo mntr | nc zk-host 2181 for latency and outstanding requests |
| Network partition | Subset of replicas affected; correlates with network errors | Connectivity from replica to ZK nodes |
| JVM GC pause (ZooKeeper only) | Periodic session drops matching GC cycles | GC logs on ZK JVM nodes |
| ZK overload / too many tables | High znode count; thundering herd after any restart | ZK mntr znode count; number of replicated tables |
| ClickHouse node overload | Node unresponsive; heartbeats fail to ZK | Host CPU, memory, and system.processes |
Quick checks
-- Check replica session and readonly state
SELECT
database,
table,
is_readonly,
is_session_expired,
total_replicas,
active_replicas
FROM system.replicas
WHERE engine LIKE '%Replicated%';
-- Check ZK connection from ClickHouse perspective
SELECT
name,
host,
port,
is_expired,
session_uptime_elapsed_seconds,
session_timeout_ms
FROM system.zookeeper_connection;
-- Live ZK connectivity test
SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1;
# External ZooKeeper health check
echo "ruok" | nc <zookeeper-host> 2181
echo "mntr" | nc <zookeeper-host> 2181
# ClickHouse Keeper health check
echo ruok | nc localhost 9181
echo mntr | nc localhost 9181
-- ZooKeeper client-side latency and errors
SELECT event, value
FROM system.events
WHERE event IN ('ZooKeeperWaitMicroseconds', 'ZooKeeperExceptions');
-- Replication queue stuck entries
SELECT
database,
table,
type,
num_tries,
last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 10;
How to diagnose it
- Confirm the blast radius. Query
system.replicas. If all replicas across all tables showis_session_expired = 1, suspect a ZK ensemble issue. If only one replica or shard is affected, suspect a network or node-level problem. - Determine if the loss is sustained. Brief transitions during ZK leader election resolve within seconds. If
is_session_expiredhas been1for more than a minute, treat it as a sustained failure that requires intervention. - Inspect the negotiated timeout. Query
system.zookeeper_connection. ZK clients send heartbeats at one-third ofsession_timeout_ms. IfZooKeeperWaitMicrosecondsrepeatedly approaches or exceeds that threshold, heartbeats stall and expiration follows. - Test ZK connectivity from the affected node. Run
SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1. If this hangs or fails, the TCP session or ZK operation path is broken. - Check ZK server health. Use
echo mntr | nc <zk-host> 2181(or port 9181 for Keeper). Look for highzk_avg_latency, largezk_outstanding_requests, or a missing leader (zk_server_stateshould showleaderon one node andfolloweron the others). - Check client-side error counters.
ZooKeeperExceptionsinsystem.eventstracks failed operations. A rising count confirms degradation even before sessions expire. - Verify network paths. Use
ping,mtr, or similar from the ClickHouse host to ZK nodes. Packet loss, asymmetric routes, or MTU issues cause silent timeouts that do not always surface as socket errors. - Review replication backlog. On affected replicas,
queue_sizeandabsolute_delayinsystem.replicasgrow. Checksystem.replication_queuefor entries with highnum_triesand non-nulllast_exceptionthat will block automatic recovery. - Watch for thundering herd. If multiple nodes expired sessions simultaneously after a previous ZK incident, they may reconnect in a burst, spike
zk_outstanding_requests, and overload ZK further.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
system.replicas.is_session_expired | Direct indicator of replica isolation from coordination | Sustained 1 for more than 30 seconds |
system.zookeeper_connection.is_expired | Early warning before replica state transitions | Any 1 outside of planned maintenance |
system.zookeeper_connection.session_timeout_ms | Determines how much latency headroom exists | Client wait time approaching one-third of this value |
ZooKeeperWaitMicroseconds | Client-side coordination latency | Trending upward or elevated above baseline |
ZooKeeperExceptions | Failed ZK operations | Any sustained non-zero rate |
system.replicas.queue_size | Backlog after disconnection | Growing while sessions are expired |
ZK mntr zk_avg_latency | Server-side coordination health | Sustained elevation above baseline |
system.replicas.is_readonly | Write availability on the replica | 1 for more than 5 minutes |
Fixes
Increase the session timeout
If session_timeout_ms is too aggressive for your network or ZK latency, increase it. This trades faster failure detection for stability under transient spikes. The actual timeout is negotiated at connection time and may differ from the configured value. Verify the effective value in system.zookeeper_connection.
Relieve ZK disk I/O pressure
Move ZooKeeper transaction logs to dedicated NVMe storage. Ensure ZK data directories are not co-located with ClickHouse data or logs. If ZK latency is already elevated, do not restart ClickHouse nodes to force reconnection. That creates a thundering herd and worsens the overload.
Resolve network partitions
Restore connectivity between the affected ClickHouse node and the ZK ensemble. Once the path is stable, the replica reconnects automatically. Verify is_expired returns to 0 and remains stable before considering the incident resolved.
Address JVM GC pauses
If running external ZooKeeper, review GC logs for stop-the-world pauses. Increase JVM heap or migrate to ClickHouse Keeper, which is not JVM-based and avoids GC pauses.
Reduce ZK metadata load
Pause non-essential DDL. Reduce the number of replicated tables where possible; each table creates znodes and watches that multiply coordination overhead. If the ensemble is chronically saturated, scale the ZK cluster or shard tables across separate clusters.
Recover from thundering herd
After a ZK outage, stagger any necessary ClickHouse restarts. If multiple nodes are already reconnecting, wait for the storm to subside before taking further action. Monitor ZooKeeperWaitMicroseconds for a clear downward trend before declaring recovery.
Prevention
- Monitor ZK latency from both the ClickHouse client side and the ZK server side. Uptime alone is insufficient.
- Maintain headroom between client-side coordination latency and the negotiated
session_timeout_ms. ZK heartbeats are sent at one-third of the timeout. - Run ZooKeeper transaction logs on dedicated, low-latency disks isolated from ClickHouse data.
- Avoid creating thousands of replicated tables on a single ZK ensemble.
- Do not batch-restart ClickHouse nodes; rolling restarts prevent coordination storms.
- Consider ClickHouse Keeper over external ZooKeeper to eliminate JVM GC risk.
- Alert on
ZooKeeperExceptionsand latency trends, not just hard session expiry.
How Netdata helps
Netdata collects system.replicas.is_session_expired, system.zookeeper_connection.is_expired, ZooKeeperWaitMicroseconds, ZooKeeperExceptions, replication queue depth, and absolute_delay per second. This exposes coordination stress before sessions drop and correlates network latency or ZK disk stalls with session timeouts.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse Memory limit (for query) exceeded: per-query limits and GROUP BY/JOIN blowups
- ClickHouse Memory limit (total) exceeded - server-wide memory pressure and fixes
- ClickHouse memory pressure death spiral: runaway queries, retries, and OOM
- ClickHouse MemoryTracking vs MemoryResident: reading the memory gap correctly
- ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
- ClickHouse merge duration climbing: the leading indicator of part explosion
- ClickHouse merges not keeping up: diagnosing a stalled or starved merge pool







