ClickHouse replicas out of sync: when SELECT count() differs across replicas

You run the same SELECT count() against two replicas and get different numbers. The replica with the lower count still responds instantly, its HTTP ping returns Ok., its replication queue may even be empty, and its CPU and memory look fine. Standard liveness checks give false confidence while the replica serves stale or incomplete data.

ClickHouse replication is asynchronous and coordinated through ZooKeeper or ClickHouse Keeper. A replica can lose its session, miss fetches, or silently drop data, yet continue serving reads from local disk. Distributed queries hide the divergence unless you compare row counts or partition metadata across nodes.

What this means

An out-of-sync replica does not stop serving traffic. A replica with is_session_expired = 1 has lost its ZooKeeper session and cannot receive new replication log entries, but retains all local parts up to the point of disconnection. Queries return fast, stale results. In more severe cases, data loss or accidental partition drops on a single replica cause permanent divergence. The replica reports as active, its queue may be empty, and SELECT 1 succeeds. Detect the problem by comparing data directly across replicas or inspecting replication metadata.

This is a data integrity incident, not a performance issue. Distributed queries that fan out to all replicas return inconsistent results depending on which node serves the request.

Common causes

CauseWhat it looks likeFirst thing to check
ZooKeeper session expiredis_session_expired = 1 on the diverged replica; SELECTs are fast but return stale counts; absolute_delay is frozen or climbingsystem.replicas on the affected node
Silent data loss or corruptionReplicatedDataLoss event counter is nonzero; replica queue is empty; counts differ with no obvious lagsystem.events and cross-replica SELECT count()
Accidental DROP PARTITION on one replicaQueue appears healthy; counts differ by an exact partition boundary; replica is not readonlyCompare system.parts grouped by partition_id across replicas
Replication queue entry stuck permanentlynum_tries climbing in system.replication_queue; last_exception shows repeated fetch or checksum failures; counts diverge and never convergesystem.replication_queue filtered by num_tries > 0

Quick checks

Run these read-only checks on each replica.

SELECT
    database,
    table,
    is_leader,
    is_readonly,
    is_session_expired,
    total_replicas,
    active_replicas,
    queue_size,
    log_max_index - log_pointer AS replication_lag_entries,
    absolute_delay
FROM system.replicas
WHERE engine LIKE '%Replicated%';
SELECT name, host, port, is_expired, session_uptime_elapsed_seconds
FROM system.zookeeper_connection;
SELECT
    database,
    table,
    type,
    create_time,
    last_attempt_time,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;
SELECT event, value
FROM system.events
WHERE event = 'ReplicatedDataLoss';
SELECT
    partition_id,
    count() AS parts,
    sum(rows) AS total_rows
FROM system.parts
WHERE active = 1
GROUP BY partition_id
ORDER BY parts DESC;
-- Run on each replica and diff the output
SELECT count() FROM your_database.your_table;

How to diagnose it

  1. Confirm the mismatch. Run SELECT count() FROM db.table on every replica. If you have many tables, start with the largest discrepancy. Query each replica directly; do not trust distributed DDL or cluster-level views.

  2. Check replica session state. Inspect system.replicas for is_session_expired and is_readonly. If is_session_expired = 1, the replica is partitioned from the coordination service and cannot receive new replication events. Reads return stale data.

  3. Verify coordination service health. Query system.zookeeper_connection for is_expired. If the session is expired, check ZooKeeper or ClickHouse Keeper independently with echo ruok | nc <host> 2181 (or port 9181 for Keeper). High coordination latency often precedes session expiry.

  4. Inspect the replication queue. Query system.replication_queue for entries with num_tries > 0. If the queue is empty but counts differ, you are in the silent divergence pattern. A non-empty queue with high num_tries indicates a stuck fetch or merge blocking convergence.

  5. Compare per-partition metadata. Run the system.parts query from Quick checks on each replica. Look for missing partition_id values or partitions where sum(rows) differs. This identifies exactly which data ranges are out of sync.

  6. Check for data loss signals. A nonzero ReplicatedDataLoss counter in system.events means ClickHouse has detected that a part is irretrievably lost. This is a hard integrity failure.

  7. Review server logs. Search the ClickHouse server log for Broken part, Checksum, Cannot read all data, or replication fetch failures around the time divergence began.

flowchart TD
    A[Count differs across replicas] --> B{Check system.replicas}
    B -->|is_session_expired=1| C[ZK session lost
stale reads] B -->|queue empty
counts differ| D[Silent divergence] D --> E[Check ReplicatedDataLoss events] E -->|Event count > 0| F[Data loss or corruption] E -->|No events| G[Compare system.parts
per partition] G -->|Missing partition| H[Partial DDL or failed fetch] B -->|num_tries high| I[Stuck queue entry
check last_exception]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
absolute_delaySeconds since the oldest unprocessed insert; measures data freshnessSustained value > 120 s; requires NTP sync for accuracy
is_session_expiredAn expired session means the replica is not receiving replication events, yet may still serve readsAny non-zero value sustained beyond brief ZK leader elections
active_replicas vs total_replicasShows whether the cluster sees all expected replicasactive_replicas < total_replicas indicates missing or partitioned replicas
system.replication_queue.num_triesStuck entries block subsequent workAny entry with num_tries > 5 and non-empty last_exception
ReplicatedDataLossThe only automatic signal that parts have been permanently lostAny nonzero value is a critical integrity event
Per-partition sum(rows) in system.partsReveals silent divergence that queue depth metrics hideDiffers across replicas for the same partition
system.detached_partsDetached parts from corruption or failed fetches reduce visible row countsUnexpected entries with reason indicating corruption or broken fetch

Fixes

Expired ZooKeeper session

If is_session_expired = 1, the replica has lost coordination. Run SYSTEM RESTART REPLICA db.table to force the replica to recreate its session and re-verify state against ZooKeeper. If the session repeatedly expires, fix the root cause: network partition, DNS failure, or ZooKeeper/Keeper overload. Do not restart the ClickHouse server as a first response; this creates additional coordination load.

Stuck replication queue entry

If system.replication_queue shows an entry with high num_tries and persistent last_exception, read the exception text. Transient network errors usually resume once connectivity returns. Permanent errors such as missing source parts or checksum mismatches may resolve by fetching a merged alternative. <!-- TODO: verify ReplicatedPartFetchesOfMerged behavior in your ClickHouse version --> If the entry is genuinely wedged, SYSTEM RESTART REPLICA db.table forces the replica to re-evaluate its queue from the coordination log.

Silent data loss or severe divergence

When counts differ and parts are missing or corrupt, the safest recovery is SYSTEM RESTORE REPLICA db.table. <!-- TODO: verify SYSTEM RESTORE REPLICA syntax and availability in your ClickHouse version --> This is destructive: it re-initializes the replica’s local metadata from ZooKeeper and re-fetches all parts from peers. Only use this when you have confirmed that at least one other replica is fully healthy. Ensure the replica has sufficient disk space and network bandwidth; it will pull all parts from scratch.

Accidental partition drop on one replica

If a partition exists on other replicas but is missing on one, and the queue is empty, the replica may have executed a local drop or lost the part silently. Reattach the partition if it exists in the detached/ directory, or force a re-fetch by dropping the local partition reference and allowing replication to heal it. If metadata is inconsistent, prefer SYSTEM RESTART REPLICA or SYSTEM RESTORE REPLICA over manual filesystem surgery.

Prevention

Run periodic cross-replica consistency checks. Automate SELECT count() on critical tables across all replicas and alert on mismatch. Do not rely on replication queue depth alone; queue depth can be zero while data is wrong.

Monitor is_session_expired and active_replicas as binary health signals, not lag indicators. Alert on ReplicatedDataLoss with zero tolerance.

Keep NTP synchronized across all ClickHouse nodes and ZooKeeper/Keeper hosts. Without accurate clocks, absolute_delay becomes unreliable.

Before running DDL such as DROP PARTITION or ALTER TABLE on replicated tables, confirm you are using ON CLUSTER syntax. Local DDL on one replica is a common source of silent divergence.

How Netdata helps

  • Correlate absolute_delay with ZooKeeper/Keeper latency to distinguish coordination slowdown from replication backlog.
  • Alert on is_session_expired or is_readonly alongside query success rates to catch stale-read scenarios.
  • Track ReplicatedDataLoss and RejectedInserts in context with per-replica insert latency to surface integrity issues before they cascade.
  • Compare part count and merge activity across replicas to identify nodes that are falling behind while still serving traffic.