ClickHouse cannot connect to ZooKeeper/Keeper: diagnosing the coordination layer
If SELECT * FROM system.zookeeper WHERE path = '/' fails, replicated tables flip to readonly, or ON CLUSTER DDL hangs, the coordination layer is broken. The server stays up and GET /ping returns Ok., so liveness checks miss the problem. Partial degradation can escalate to a write outage if replicated tables cannot re-establish sessions.
ClickHouse uses the coordination service for ReplicatedMergeTree leader election, replication log queues, insert deduplication, and distributed DDL. ClickHouse Keeper typically listens on port 9181; external ZooKeeper ensembles listen on 2181. The diagnostic path differs slightly, but the symptom is the same: ClickHouse cannot reliably complete coordination operations.
What this means
A broken coordination connection does not crash ClickHouse. Non-replicated tables continue to accept writes and serve queries. Replicated tables depend on continuous ZooKeeper/Keeper access. When the connection is lost or sessions expire, replicas transition to readonly, replication queues stop processing, and distributed DDL tasks stall in system.distributed_ddl_queue. The server can look healthy while the control plane is frozen.
The failure can originate on either side. ZooKeeper/Keeper may have lost quorum, saturated its transaction log disk, or be overwhelmed by watches and znodes. The network path may be blocked by a firewall, routing change, or DNS failure. ClickHouse may use a session timeout that is too aggressive for observed coordination latency.
flowchart TD
A[Query to system.zookeeper fails or times out] --> B{Using ClickHouse Keeper or external ZooKeeper?}
B -->|port 9181 / localhost| C[Run 4lw probes on localhost:9181]
B -->|port 2181 / remote hosts| D[Run 4lw probes on ZK hosts:2181]
C --> E{ruok returns imok and mntr shows healthy leader?}
D --> E
E -->|No| F[Fix ensemble first: quorum, disk latency, leader]
E -->|Yes| G[Check DNS and TCP connectivity from CH host]
G --> H{Resolves and connects?}
H -->|No| I[Fix DNS, firewall, or route]
H -->|Yes| J[Inspect CH session timeouts and replicated table count]
J --> K[Reduce DDL pressure or tune session timeout]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| ZooKeeper/Keeper ensemble failure or quorum loss | system.zookeeper queries fail across multiple CH nodes; ruok may return imok on individual nodes but mntr shows no leader or high latency | echo mntr | nc <host> <port> for zk_server_state and zk_avg_latency |
| Network partition, firewall change, or DNS failure | CH logs contain connection timeouts or unknown host errors for ZK hosts; only some CH nodes affected | dig +short <zk-host> and nc -zv <zk-host> <port> from the CH host |
| ZK saturation from too many replicated tables or DDL | zk_avg_latency climbing, znode count high; CH sessions flap repeatedly; issue affects all replicated tables at once | Number of replicated tables and rate of DDL operations |
| Session timeout too aggressive for current latency | is_expired toggles frequently but reconnects quickly; sessions drop during brief ZK latency spikes | session_timeout_ms in system.zookeeper_connection versus observed ZK latency |
| ClickHouse Keeper process issue | Only nodes using embedded or dedicated Keeper affected; port 9181 does not respond locally | Keeper process liveness and logs on the affected host |
Quick checks
Run these safe, read-only checks to confirm the failure and locate its source.
-- Confirm coordination connectivity from ClickHouse
SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1;
-- Inspect session state and configured timeout
SELECT
name,
host,
port,
is_expired,
session_uptime_elapsed_seconds,
session_timeout_ms
FROM system.zookeeper_connection;
-- Check replica impact
SELECT
database,
table,
is_readonly,
is_session_expired,
total_replicas,
active_replicas
FROM system.replicas
WHERE engine LIKE '%Replicated%';
# ClickHouse Keeper 4lw health on the ZK protocol port
echo ruok | nc localhost 9181
echo mntr | nc localhost 9181
# External ZooKeeper 4lw health
echo ruok | nc <zk-host> 2181
echo mntr | nc <zk-host> 2181
# DNS resolution and basic TCP reachability from CH host
dig +short <zk-host>
nslookup <zk-host>
nc -zv <zk-host> 2181
-- Stalled distributed DDL
SELECT entry, host_name, status, exception_text
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY entry DESC
LIMIT 20;
How to diagnose it
Prove the coordination failure. Run
SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 1. If it throws or hangs, coordination is broken. Non-replicated workloads may still behave normally.Identify Keeper versus external ZooKeeper. Look at
hostandportinsystem.zookeeper_connection. Localhost with port 9181 points to ClickHouse Keeper. Remote hosts with port 2181 point to an external ensemble.Probe the coordination service directly. Use
ruokandmntragainst the appropriate port.ruokreturning nothing or an error means the process is not reachable.mntrshowszk_server_state,zk_avg_latency, andzk_znode_count. No leader, or average latency well above baseline, indicates ensemble trouble.Rule out network and DNS. From the affected ClickHouse host, resolve each ZK hostname with
digornslookup, then confirm TCP connectivity withnc. DNS resolution failures are a common root cause that looks like a ClickHouse issue.Check ClickHouse session behavior. In
system.zookeeper_connection, look foris_expired = 1and comparesession_timeout_mstozk_avg_latencyfrommntr. If latency stays above ~30% of the timeout, sessions will flap.Assess replica impact. Query
system.replicasforis_readonly,is_session_expired, andactive_replicas < total_replicas. Any sustained readonly state means writes to replicated tables are already failing.Look for saturation load. Count replicated tables. Check
system.distributed_ddl_queuefor unfinished entries. If the ensemble is healthy but slow, and you have many tables or recent DDL bursts, the service is likely saturated.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
system.zookeeper query success and latency | Direct probe of coordination connectivity | Query fails, times out, or latency exceeds ~1 second |
system.zookeeper_connection.is_expired | Indicates a lost session; replicas will go readonly | is_expired = 1 sustained for more than a few seconds |
system.replicas.is_readonly | Shows write availability loss for replicated tables | is_readonly = 1 sustained for more than 5 minutes |
system.replicas.is_session_expired | Replica-level view of coordination partition | Any sustained non-zero value |
ZK/Keeper average latency from mntr | Leading indicator of saturation before sessions drop | zk_avg_latency above baseline or approaching 30% of session_timeout_ms |
| Replication queue depth | Backlog caused by stalled coordination | queue_size growing for more than 15 minutes |
RejectedInserts event counter | Hard failure when replicated writes are blocked | Any sustained increment while insert traffic is active |
Fixes
ZooKeeper/Keeper ensemble failure
Restore the ensemble before touching ClickHouse. Verify quorum, elect a leader, and ensure the transaction log disk is not saturated. For JVM-based ZooKeeper, check garbage collection logs and heap pressure. Avoid restarting ClickHouse nodes while the ensemble is unstable; a reconnection storm can worsen saturation.
Network partition, firewall, or DNS failure
Restore name resolution and TCP reachability. If ZK hosts changed IPs, update ClickHouse configuration and restart the server to pick up new endpoints. Validate with dig and nc from every ClickHouse node before declaring the incident resolved.
Coordination saturation
Pause non-essential DDL immediately. Reduce the number of replicated tables where possible, or split heavy metadata workloads across separate clusters. For external ZooKeeper, move the transaction log to a dedicated fast disk and consider adding ensemble nodes. For ClickHouse Keeper, review CPU and I/O on the Keeper hosts.
Session timeout flapping
If zk_avg_latency is stable but repeatedly approaches session_timeout_ms, increase the timeout cautiously. Edit session_timeout_ms in the <zookeeper> configuration and restart ClickHouse. Longer timeouts trade failure detection speed for stability under latency spikes.
ClickHouse Keeper process down
If embedded or dedicated Keeper is not responding on port 9181, inspect the process and logs. Restart the Keeper service or the hosting node. Once Keeper is healthy, ClickHouse should reconnect automatically.
Prevention
- Monitor coordination latency, not just process liveness. A ZK process that answers
ruokcan still be too slow for ClickHouse sessions. - Keep the transaction log on fast, dedicated storage. ZooKeeper writes the transaction log synchronously; disk latency on this path directly affects coordination latency.
- Limit replicated table count and DDL frequency. Each replicated table creates znodes and watches; excessive DDL amplifies metadata pressure.
- Use stable DNS or static IPs for ZK endpoints. DNS resolution failures are a routine cause of apparent coordination loss.
- Match session timeouts to observed latency. Set timeouts well above the baseline P99 coordination latency, with headroom for spikes.
- Watch replication queue depth and part count. Growing queues can be an early sign of coordination stress before sessions expire.
How Netdata helps
- Correlates
is_readonlyandis_session_expiredwith host network and disk metrics. - Surfaces insert latency and write rejection trends to expose backpressure before writes hard-fail.
- Tracks TCP connectivity and DNS resolution metrics to detect network-side root causes.
- Alerts on
system.zookeeper_connectionhealth and replica availability without requiring manual polling.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse Memory limit (for query) exceeded: per-query limits and GROUP BY/JOIN blowups
- ClickHouse Memory limit (total) exceeded - server-wide memory pressure and fixes
- ClickHouse memory pressure death spiral: runaway queries, retries, and OOM
- ClickHouse MemoryTracking vs MemoryResident: reading the memory gap correctly
- ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
- ClickHouse merge duration climbing: the leading indicator of part explosion
- ClickHouse merges not keeping up: diagnosing a stalled or starved merge pool







