$ guides / clickhouse / clickhouse-zookeeper-session-expired ▌

Operations Guides

ClickHouse ZooKeeper session has expired: causes, recovery, and tuning

When system.replicas shows is_session_expired = 1, the replica has lost its ZooKeeper session and stopped participating in replication. It rejects inserts, coordinated merges, and distributed DDL. Depending on quorum and load-balancer configuration, writes may shift silently to other replicas or halt for entire shards.

is_session_expired often appears alongside is_readonly = 1, but the two are distinct. Session expiration means the coordination session is dead. Read-only means the replica refuses writes. Session expiration is the most severe because it breaks the replica’s contract with the cluster.

Brief session flapping during ZooKeeper leader elections is normal and usually resolves within seconds. Sustained expiration is not. If is_session_expired persists for more than a minute, treat it as an active coordination failure that will cascade into replication lag and stale reads.

What this means

ReplicatedMergeTree tables rely on ZooKeeper or ClickHouse Keeper to agree on leaders, queue merges, and propagate DDL. Each ClickHouse connection holds a negotiated session. When the session expires, ephemeral nodes vanish. The replica loses leadership claims and visibility into the shared replication log.

The replica still answers SELECT from local data. This creates a false sense of health because results may be stale: the replica receives no new parts while disconnected. If the session does not reconnect within seconds, the underlying cause is still active and the replica will not self-recover.

Common causes

Cause	What it looks like	First thing to check
Aggressive session timeout	`is_session_expired` flaps during minor ZK latency spikes	`session_timeout_ms` in `system.zookeeper_connection` vs `ZooKeeperWaitMicroseconds`
ZK disk I/O bottleneck	All replicas affected simultaneously; ZK latency high	`echo mntr \| nc zk-host 2181` for latency and outstanding requests
Network partition	Subset of replicas affected; correlates with network errors	Connectivity from replica to ZK nodes
JVM GC pause (ZooKeeper only)	Periodic session drops matching GC cycles	GC logs on ZK JVM nodes
ZK overload / too many tables	High znode count; thundering herd after any restart	ZK `mntr` znode count; number of replicated tables
ClickHouse node overload	Node unresponsive; heartbeats fail to ZK	Host CPU, memory, and `system.processes`

Quick checks

-- Check replica session and readonly state
SELECT
    database,
    table,
    is_readonly,
    is_session_expired,
    total_replicas,
    active_replicas
FROM system.replicas
WHERE engine LIKE '%Replicated%';

-- Check ZK connection from ClickHouse perspective
SELECT
    name,
    host,
    port,
    is_expired,
    session_uptime_elapsed_seconds,
    session_timeout_ms
FROM system.zookeeper_connection;

-- Live ZK connectivity test
SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1;

# External ZooKeeper health check
echo "ruok" | nc <zookeeper-host> 2181
echo "mntr" | nc <zookeeper-host> 2181

# ClickHouse Keeper health check
echo ruok | nc localhost 9181
echo mntr | nc localhost 9181

-- ZooKeeper client-side latency and errors
SELECT event, value
FROM system.events
WHERE event IN ('ZooKeeperWaitMicroseconds', 'ZooKeeperExceptions');

-- Replication queue stuck entries
SELECT
    database,
    table,
    type,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 10;

How to diagnose it

Confirm the blast radius. Query system.replicas. If all replicas across all tables show is_session_expired = 1, suspect a ZK ensemble issue. If only one replica or shard is affected, suspect a network or node-level problem.
Determine if the loss is sustained. Brief transitions during ZK leader election resolve within seconds. If is_session_expired has been 1 for more than a minute, treat it as a sustained failure that requires intervention.
Inspect the negotiated timeout. Query system.zookeeper_connection. ZK clients send heartbeats at one-third of session_timeout_ms. If ZooKeeperWaitMicroseconds repeatedly approaches or exceeds that threshold, heartbeats stall and expiration follows.
Test ZK connectivity from the affected node. Run SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1. If this hangs or fails, the TCP session or ZK operation path is broken.
Check ZK server health. Use echo mntr | nc <zk-host> 2181 (or port 9181 for Keeper). Look for high zk_avg_latency, large zk_outstanding_requests, or a missing leader (zk_server_state should show leader on one node and follower on the others).
Check client-side error counters. ZooKeeperExceptions in system.events tracks failed operations. A rising count confirms degradation even before sessions expire.
Verify network paths. Use ping, mtr, or similar from the ClickHouse host to ZK nodes. Packet loss, asymmetric routes, or MTU issues cause silent timeouts that do not always surface as socket errors.
Review replication backlog. On affected replicas, queue_size and absolute_delay in system.replicas grow. Check system.replication_queue for entries with high num_tries and non-null last_exception that will block automatic recovery.
Watch for thundering herd. If multiple nodes expired sessions simultaneously after a previous ZK incident, they may reconnect in a burst, spike zk_outstanding_requests, and overload ZK further.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`system.replicas.is_session_expired`	Direct indicator of replica isolation from coordination	Sustained `1` for more than 30 seconds
`system.zookeeper_connection.is_expired`	Early warning before replica state transitions	Any `1` outside of planned maintenance
`system.zookeeper_connection.session_timeout_ms`	Determines how much latency headroom exists	Client wait time approaching one-third of this value
`ZooKeeperWaitMicroseconds`	Client-side coordination latency	Trending upward or elevated above baseline
`ZooKeeperExceptions`	Failed ZK operations	Any sustained non-zero rate
`system.replicas.queue_size`	Backlog after disconnection	Growing while sessions are expired
ZK `mntr` `zk_avg_latency`	Server-side coordination health	Sustained elevation above baseline
`system.replicas.is_readonly`	Write availability on the replica	`1` for more than 5 minutes

Fixes

Increase the session timeout

If session_timeout_ms is too aggressive for your network or ZK latency, increase it. This trades faster failure detection for stability under transient spikes. The actual timeout is negotiated at connection time and may differ from the configured value. Verify the effective value in system.zookeeper_connection.

Relieve ZK disk I/O pressure

Move ZooKeeper transaction logs to dedicated NVMe storage. Ensure ZK data directories are not co-located with ClickHouse data or logs. If ZK latency is already elevated, do not restart ClickHouse nodes to force reconnection. That creates a thundering herd and worsens the overload.

Resolve network partitions

Restore connectivity between the affected ClickHouse node and the ZK ensemble. Once the path is stable, the replica reconnects automatically. Verify is_expired returns to 0 and remains stable before considering the incident resolved.

Address JVM GC pauses

If running external ZooKeeper, review GC logs for stop-the-world pauses. Increase JVM heap or migrate to ClickHouse Keeper, which is not JVM-based and avoids GC pauses.

Reduce ZK metadata load

Pause non-essential DDL. Reduce the number of replicated tables where possible; each table creates znodes and watches that multiply coordination overhead. If the ensemble is chronically saturated, scale the ZK cluster or shard tables across separate clusters.

Recover from thundering herd

After a ZK outage, stagger any necessary ClickHouse restarts. If multiple nodes are already reconnecting, wait for the storm to subside before taking further action. Monitor ZooKeeperWaitMicroseconds for a clear downward trend before declaring recovery.

Prevention

Monitor ZK latency from both the ClickHouse client side and the ZK server side. Uptime alone is insufficient.
Maintain headroom between client-side coordination latency and the negotiated session_timeout_ms. ZK heartbeats are sent at one-third of the timeout.
Run ZooKeeper transaction logs on dedicated, low-latency disks isolated from ClickHouse data.
Avoid creating thousands of replicated tables on a single ZK ensemble.
Do not batch-restart ClickHouse nodes; rolling restarts prevent coordination storms.
Consider ClickHouse Keeper over external ZooKeeper to eliminate JVM GC risk.
Alert on ZooKeeperExceptions and latency trends, not just hard session expiry.

How Netdata helps

Netdata collects system.replicas.is_session_expired, system.zookeeper_connection.is_expired, ZooKeeperWaitMicroseconds, ZooKeeperExceptions, replication queue depth, and absolute_delay per second. This exposes coordination stress before sessions drop and correlates network latency or ZK disk stalls with session timeouts.

The Netdata solution

ClickHouse monitoring with Netdata

Netdata monitors ClickHouse with per-second metrics and ML anomaly detection. Track merge debt, memory usage, replication lag, Keeper/ZooKeeper saturation, and disk headroom against the host signals that drive them.

See ClickHouse monitoring → Start monitoring free

ClickHouse ZooKeeper session has expired: causes, recovery, and tuning

ClickHouse ZooKeeper session has expired: causes, recovery, and tuning

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Increase the session timeout

Relieve ZK disk I/O pressure

Resolve network partitions

Address JVM GC pauses

Reduce ZK metadata load

Recover from thundering herd

Prevention

How Netdata helps

Related guides

ClickHouse monitoring with Netdata