ClickHouse replicas out of sync: when SELECT count() differs across replicas
You run the same SELECT count() against two replicas and get different numbers. The replica with the lower count still responds instantly, its HTTP ping returns Ok., its replication queue may even be empty, and its CPU and memory look fine. Standard liveness checks give false confidence while the replica serves stale or incomplete data.
ClickHouse replication is asynchronous and coordinated through ZooKeeper or ClickHouse Keeper. A replica can lose its session, miss fetches, or silently drop data, yet continue serving reads from local disk. Distributed queries hide the divergence unless you compare row counts or partition metadata across nodes.
What this means
An out-of-sync replica does not stop serving traffic. A replica with is_session_expired = 1 has lost its ZooKeeper session and cannot receive new replication log entries, but retains all local parts up to the point of disconnection. Queries return fast, stale results. In more severe cases, data loss or accidental partition drops on a single replica cause permanent divergence. The replica reports as active, its queue may be empty, and SELECT 1 succeeds. Detect the problem by comparing data directly across replicas or inspecting replication metadata.
This is a data integrity incident, not a performance issue. Distributed queries that fan out to all replicas return inconsistent results depending on which node serves the request.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| ZooKeeper session expired | is_session_expired = 1 on the diverged replica; SELECTs are fast but return stale counts; absolute_delay is frozen or climbing | system.replicas on the affected node |
| Silent data loss or corruption | ReplicatedDataLoss event counter is nonzero; replica queue is empty; counts differ with no obvious lag | system.events and cross-replica SELECT count() |
Accidental DROP PARTITION on one replica | Queue appears healthy; counts differ by an exact partition boundary; replica is not readonly | Compare system.parts grouped by partition_id across replicas |
| Replication queue entry stuck permanently | num_tries climbing in system.replication_queue; last_exception shows repeated fetch or checksum failures; counts diverge and never converge | system.replication_queue filtered by num_tries > 0 |
Quick checks
Run these read-only checks on each replica.
SELECT
database,
table,
is_leader,
is_readonly,
is_session_expired,
total_replicas,
active_replicas,
queue_size,
log_max_index - log_pointer AS replication_lag_entries,
absolute_delay
FROM system.replicas
WHERE engine LIKE '%Replicated%';
SELECT name, host, port, is_expired, session_uptime_elapsed_seconds
FROM system.zookeeper_connection;
SELECT
database,
table,
type,
create_time,
last_attempt_time,
num_tries,
last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;
SELECT event, value
FROM system.events
WHERE event = 'ReplicatedDataLoss';
SELECT
partition_id,
count() AS parts,
sum(rows) AS total_rows
FROM system.parts
WHERE active = 1
GROUP BY partition_id
ORDER BY parts DESC;
-- Run on each replica and diff the output
SELECT count() FROM your_database.your_table;
How to diagnose it
Confirm the mismatch. Run
SELECT count() FROM db.tableon every replica. If you have many tables, start with the largest discrepancy. Query each replica directly; do not trust distributed DDL or cluster-level views.Check replica session state. Inspect
system.replicasforis_session_expiredandis_readonly. Ifis_session_expired = 1, the replica is partitioned from the coordination service and cannot receive new replication events. Reads return stale data.Verify coordination service health. Query
system.zookeeper_connectionforis_expired. If the session is expired, check ZooKeeper or ClickHouse Keeper independently withecho ruok | nc <host> 2181(or port 9181 for Keeper). High coordination latency often precedes session expiry.Inspect the replication queue. Query
system.replication_queuefor entries withnum_tries > 0. If the queue is empty but counts differ, you are in the silent divergence pattern. A non-empty queue with highnum_triesindicates a stuck fetch or merge blocking convergence.Compare per-partition metadata. Run the
system.partsquery from Quick checks on each replica. Look for missingpartition_idvalues or partitions wheresum(rows)differs. This identifies exactly which data ranges are out of sync.Check for data loss signals. A nonzero
ReplicatedDataLosscounter insystem.eventsmeans ClickHouse has detected that a part is irretrievably lost. This is a hard integrity failure.Review server logs. Search the ClickHouse server log for
Broken part,Checksum,Cannot read all data, or replication fetch failures around the time divergence began.
flowchart TD
A[Count differs across replicas] --> B{Check system.replicas}
B -->|is_session_expired=1| C[ZK session lost
stale reads]
B -->|queue empty
counts differ| D[Silent divergence]
D --> E[Check ReplicatedDataLoss events]
E -->|Event count > 0| F[Data loss or corruption]
E -->|No events| G[Compare system.parts
per partition]
G -->|Missing partition| H[Partial DDL or failed fetch]
B -->|num_tries high| I[Stuck queue entry
check last_exception]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
absolute_delay | Seconds since the oldest unprocessed insert; measures data freshness | Sustained value > 120 s; requires NTP sync for accuracy |
is_session_expired | An expired session means the replica is not receiving replication events, yet may still serve reads | Any non-zero value sustained beyond brief ZK leader elections |
active_replicas vs total_replicas | Shows whether the cluster sees all expected replicas | active_replicas < total_replicas indicates missing or partitioned replicas |
system.replication_queue.num_tries | Stuck entries block subsequent work | Any entry with num_tries > 5 and non-empty last_exception |
ReplicatedDataLoss | The only automatic signal that parts have been permanently lost | Any nonzero value is a critical integrity event |
Per-partition sum(rows) in system.parts | Reveals silent divergence that queue depth metrics hide | Differs across replicas for the same partition |
system.detached_parts | Detached parts from corruption or failed fetches reduce visible row counts | Unexpected entries with reason indicating corruption or broken fetch |
Fixes
Expired ZooKeeper session
If is_session_expired = 1, the replica has lost coordination. Run SYSTEM RESTART REPLICA db.table to force the replica to recreate its session and re-verify state against ZooKeeper. If the session repeatedly expires, fix the root cause: network partition, DNS failure, or ZooKeeper/Keeper overload. Do not restart the ClickHouse server as a first response; this creates additional coordination load.
Stuck replication queue entry
If system.replication_queue shows an entry with high num_tries and persistent last_exception, read the exception text. Transient network errors usually resume once connectivity returns. Permanent errors such as missing source parts or checksum mismatches may resolve by fetching a merged alternative. <!-- TODO: verify ReplicatedPartFetchesOfMerged behavior in your ClickHouse version --> If the entry is genuinely wedged, SYSTEM RESTART REPLICA db.table forces the replica to re-evaluate its queue from the coordination log.
Silent data loss or severe divergence
When counts differ and parts are missing or corrupt, the safest recovery is SYSTEM RESTORE REPLICA db.table. <!-- TODO: verify SYSTEM RESTORE REPLICA syntax and availability in your ClickHouse version --> This is destructive: it re-initializes the replica’s local metadata from ZooKeeper and re-fetches all parts from peers. Only use this when you have confirmed that at least one other replica is fully healthy. Ensure the replica has sufficient disk space and network bandwidth; it will pull all parts from scratch.
Accidental partition drop on one replica
If a partition exists on other replicas but is missing on one, and the queue is empty, the replica may have executed a local drop or lost the part silently. Reattach the partition if it exists in the detached/ directory, or force a re-fetch by dropping the local partition reference and allowing replication to heal it. If metadata is inconsistent, prefer SYSTEM RESTART REPLICA or SYSTEM RESTORE REPLICA over manual filesystem surgery.
Prevention
Run periodic cross-replica consistency checks. Automate SELECT count() on critical tables across all replicas and alert on mismatch. Do not rely on replication queue depth alone; queue depth can be zero while data is wrong.
Monitor is_session_expired and active_replicas as binary health signals, not lag indicators. Alert on ReplicatedDataLoss with zero tolerance.
Keep NTP synchronized across all ClickHouse nodes and ZooKeeper/Keeper hosts. Without accurate clocks, absolute_delay becomes unreliable.
Before running DDL such as DROP PARTITION or ALTER TABLE on replicated tables, confirm you are using ON CLUSTER syntax. Local DDL on one replica is a common source of silent divergence.
How Netdata helps
- Correlate
absolute_delaywith ZooKeeper/Keeper latency to distinguish coordination slowdown from replication backlog. - Alert on
is_session_expiredoris_readonlyalongside query success rates to catch stale-read scenarios. - Track
ReplicatedDataLossandRejectedInsertsin context with per-replica insert latency to surface integrity issues before they cascade. - Compare part count and merge activity across replicas to identify nodes that are falling behind while still serving traffic.
Related guides
- ClickHouse active part count growing: reading MaxPartCountForPartition before it pages
- ClickHouse ALTER UPDATE/DELETE overuse: why mutations are not row updates
- ClickHouse async inserts: when async_insert fixes too-many-parts and when it hides it
- ClickHouse DelayedInserts climbing: the warning before too-many-parts
- ClickHouse insert latency rising: the leading indicator of write-pipeline trouble
- ClickHouse Memory limit (for query) exceeded: per-query limits and GROUP BY/JOIN blowups
- ClickHouse Memory limit (total) exceeded - server-wide memory pressure and fixes
- ClickHouse memory pressure death spiral: runaway queries, retries, and OOM
- ClickHouse MemoryTracking vs MemoryResident: reading the memory gap correctly
- ClickHouse merge death spiral: when parts accumulate faster than merges consolidate
- ClickHouse merge duration climbing: the leading indicator of part explosion
- ClickHouse merges not keeping up: diagnosing a stalled or starved merge pool







