ClickHouse cannot connect to ZooKeeper/Keeper: diagnosing the coordination layer

If SELECT * FROM system.zookeeper WHERE path = '/' fails, replicated tables flip to readonly, or ON CLUSTER DDL hangs, the coordination layer is broken. The server stays up and GET /ping returns Ok., so liveness checks miss the problem. Partial degradation can escalate to a write outage if replicated tables cannot re-establish sessions.

ClickHouse uses the coordination service for ReplicatedMergeTree leader election, replication log queues, insert deduplication, and distributed DDL. ClickHouse Keeper typically listens on port 9181; external ZooKeeper ensembles listen on 2181. The diagnostic path differs slightly, but the symptom is the same: ClickHouse cannot reliably complete coordination operations.

What this means

A broken coordination connection does not crash ClickHouse. Non-replicated tables continue to accept writes and serve queries. Replicated tables depend on continuous ZooKeeper/Keeper access. When the connection is lost or sessions expire, replicas transition to readonly, replication queues stop processing, and distributed DDL tasks stall in system.distributed_ddl_queue. The server can look healthy while the control plane is frozen.

The failure can originate on either side. ZooKeeper/Keeper may have lost quorum, saturated its transaction log disk, or be overwhelmed by watches and znodes. The network path may be blocked by a firewall, routing change, or DNS failure. ClickHouse may use a session timeout that is too aggressive for observed coordination latency.

flowchart TD
    A[Query to system.zookeeper fails or times out] --> B{Using ClickHouse Keeper or external ZooKeeper?}
    B -->|port 9181 / localhost| C[Run 4lw probes on localhost:9181]
    B -->|port 2181 / remote hosts| D[Run 4lw probes on ZK hosts:2181]
    C --> E{ruok returns imok and mntr shows healthy leader?}
    D --> E
    E -->|No| F[Fix ensemble first: quorum, disk latency, leader]
    E -->|Yes| G[Check DNS and TCP connectivity from CH host]
    G --> H{Resolves and connects?}
    H -->|No| I[Fix DNS, firewall, or route]
    H -->|Yes| J[Inspect CH session timeouts and replicated table count]
    J --> K[Reduce DDL pressure or tune session timeout]

Common causes

CauseWhat it looks likeFirst thing to check
ZooKeeper/Keeper ensemble failure or quorum losssystem.zookeeper queries fail across multiple CH nodes; ruok may return imok on individual nodes but mntr shows no leader or high latencyecho mntr | nc <host> <port> for zk_server_state and zk_avg_latency
Network partition, firewall change, or DNS failureCH logs contain connection timeouts or unknown host errors for ZK hosts; only some CH nodes affecteddig +short <zk-host> and nc -zv <zk-host> <port> from the CH host
ZK saturation from too many replicated tables or DDLzk_avg_latency climbing, znode count high; CH sessions flap repeatedly; issue affects all replicated tables at onceNumber of replicated tables and rate of DDL operations
Session timeout too aggressive for current latencyis_expired toggles frequently but reconnects quickly; sessions drop during brief ZK latency spikessession_timeout_ms in system.zookeeper_connection versus observed ZK latency
ClickHouse Keeper process issueOnly nodes using embedded or dedicated Keeper affected; port 9181 does not respond locallyKeeper process liveness and logs on the affected host

Quick checks

Run these safe, read-only checks to confirm the failure and locate its source.

-- Confirm coordination connectivity from ClickHouse
SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1;
-- Inspect session state and configured timeout
SELECT
    name,
    host,
    port,
    is_expired,
    session_uptime_elapsed_seconds,
    session_timeout_ms
FROM system.zookeeper_connection;
-- Check replica impact
SELECT
    database,
    table,
    is_readonly,
    is_session_expired,
    total_replicas,
    active_replicas
FROM system.replicas
WHERE engine LIKE '%Replicated%';
# ClickHouse Keeper 4lw health on the ZK protocol port
echo ruok | nc localhost 9181
echo mntr | nc localhost 9181
# External ZooKeeper 4lw health
echo ruok | nc <zk-host> 2181
echo mntr | nc <zk-host> 2181
# DNS resolution and basic TCP reachability from CH host
dig +short <zk-host>
nslookup <zk-host>
nc -zv <zk-host> 2181
-- Stalled distributed DDL
SELECT entry, host_name, status, exception_text
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY entry DESC
LIMIT 20;

How to diagnose it

  1. Prove the coordination failure. Run SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1. If it throws or hangs, coordination is broken. Non-replicated workloads may still behave normally.

  2. Identify Keeper versus external ZooKeeper. Look at host and port in system.zookeeper_connection. Localhost with port 9181 points to ClickHouse Keeper. Remote hosts with port 2181 point to an external ensemble.

  3. Probe the coordination service directly. Use ruok and mntr against the appropriate port. ruok returning nothing or an error means the process is not reachable. mntr shows zk_server_state, zk_avg_latency, and zk_znode_count. No leader, or average latency well above baseline, indicates ensemble trouble.

  4. Rule out network and DNS. From the affected ClickHouse host, resolve each ZK hostname with dig or nslookup, then confirm TCP connectivity with nc. DNS resolution failures are a common root cause that looks like a ClickHouse issue.

  5. Check ClickHouse session behavior. In system.zookeeper_connection, look for is_expired = 1 and compare session_timeout_ms to zk_avg_latency from mntr. If latency stays above ~30% of the timeout, sessions will flap.

  6. Assess replica impact. Query system.replicas for is_readonly, is_session_expired, and active_replicas < total_replicas. Any sustained readonly state means writes to replicated tables are already failing.

  7. Look for saturation load. Count replicated tables. Check system.distributed_ddl_queue for unfinished entries. If the ensemble is healthy but slow, and you have many tables or recent DDL bursts, the service is likely saturated.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
system.zookeeper query success and latencyDirect probe of coordination connectivityQuery fails, times out, or latency exceeds ~1 second
system.zookeeper_connection.is_expiredIndicates a lost session; replicas will go readonlyis_expired = 1 sustained for more than a few seconds
system.replicas.is_readonlyShows write availability loss for replicated tablesis_readonly = 1 sustained for more than 5 minutes
system.replicas.is_session_expiredReplica-level view of coordination partitionAny sustained non-zero value
ZK/Keeper average latency from mntrLeading indicator of saturation before sessions dropzk_avg_latency above baseline or approaching 30% of session_timeout_ms
Replication queue depthBacklog caused by stalled coordinationqueue_size growing for more than 15 minutes
RejectedInserts event counterHard failure when replicated writes are blockedAny sustained increment while insert traffic is active

Fixes

ZooKeeper/Keeper ensemble failure

Restore the ensemble before touching ClickHouse. Verify quorum, elect a leader, and ensure the transaction log disk is not saturated. For JVM-based ZooKeeper, check garbage collection logs and heap pressure. Avoid restarting ClickHouse nodes while the ensemble is unstable; a reconnection storm can worsen saturation.

Network partition, firewall, or DNS failure

Restore name resolution and TCP reachability. If ZK hosts changed IPs, update ClickHouse configuration and restart the server to pick up new endpoints. Validate with dig and nc from every ClickHouse node before declaring the incident resolved.

Coordination saturation

Pause non-essential DDL immediately. Reduce the number of replicated tables where possible, or split heavy metadata workloads across separate clusters. For external ZooKeeper, move the transaction log to a dedicated fast disk and consider adding ensemble nodes. For ClickHouse Keeper, review CPU and I/O on the Keeper hosts.

Session timeout flapping

If zk_avg_latency is stable but repeatedly approaches session_timeout_ms, increase the timeout cautiously. Edit session_timeout_ms in the <zookeeper> configuration and restart ClickHouse. Longer timeouts trade failure detection speed for stability under latency spikes.

ClickHouse Keeper process down

If embedded or dedicated Keeper is not responding on port 9181, inspect the process and logs. Restart the Keeper service or the hosting node. Once Keeper is healthy, ClickHouse should reconnect automatically.

Prevention

  • Monitor coordination latency, not just process liveness. A ZK process that answers ruok can still be too slow for ClickHouse sessions.
  • Keep the transaction log on fast, dedicated storage. ZooKeeper writes the transaction log synchronously; disk latency on this path directly affects coordination latency.
  • Limit replicated table count and DDL frequency. Each replicated table creates znodes and watches; excessive DDL amplifies metadata pressure.
  • Use stable DNS or static IPs for ZK endpoints. DNS resolution failures are a routine cause of apparent coordination loss.
  • Match session timeouts to observed latency. Set timeouts well above the baseline P99 coordination latency, with headroom for spikes.
  • Watch replication queue depth and part count. Growing queues can be an early sign of coordination stress before sessions expire.

How Netdata helps

  • Correlates is_readonly and is_session_expired with host network and disk metrics.
  • Surfaces insert latency and write rejection trends to expose backpressure before writes hard-fail.
  • Tracks TCP connectivity and DNS resolution metrics to detect network-side root causes.
  • Alerts on system.zookeeper_connection health and replica availability without requiring manual polling.