$ guides / clickhouse / clickhouse-replicas-out-of-sync-counts-differ ▌

Operations Guides

ClickHouse replicas out of sync: when SELECT count() differs across replicas

You run the same SELECT count() against two replicas and get different numbers. The replica with the lower count still responds instantly, its HTTP ping returns Ok., its replication queue may even be empty, and its CPU and memory look fine. Standard liveness checks give false confidence while the replica serves stale or incomplete data.

ClickHouse replication is asynchronous and coordinated through ZooKeeper or ClickHouse Keeper. A replica can lose its session, miss fetches, or silently drop data, yet continue serving reads from local disk. Distributed queries hide the divergence unless you compare row counts or partition metadata across nodes.

What this means

An out-of-sync replica does not stop serving traffic. A replica with is_session_expired = 1 has lost its ZooKeeper session and cannot receive new replication log entries, but retains all local parts up to the point of disconnection. Queries return fast, stale results. In more severe cases, data loss or accidental partition drops on a single replica cause permanent divergence. The replica reports as active, its queue may be empty, and SELECT 1 succeeds. Detect the problem by comparing data directly across replicas or inspecting replication metadata.

This is a data integrity incident, not a performance issue. Distributed queries that fan out to all replicas return inconsistent results depending on which node serves the request.

Common causes

Cause	What it looks like	First thing to check
ZooKeeper session expired	`is_session_expired = 1` on the diverged replica; SELECTs are fast but return stale counts; `absolute_delay` is frozen or climbing	`system.replicas` on the affected node
Silent data loss or corruption	`ReplicatedDataLoss` event counter is nonzero; replica queue is empty; counts differ with no obvious lag	`system.events` and cross-replica `SELECT count()`
Accidental `DROP PARTITION` on one replica	Queue appears healthy; counts differ by an exact partition boundary; replica is not readonly	Compare `system.parts` grouped by `partition_id` across replicas
Replication queue entry stuck permanently	`num_tries` climbing in `system.replication_queue`; `last_exception` shows repeated fetch or checksum failures; counts diverge and never converge	`system.replication_queue` filtered by `num_tries > 0`

Quick checks

Run these read-only checks on each replica.

SELECT
    database,
    table,
    is_leader,
    is_readonly,
    is_session_expired,
    total_replicas,
    active_replicas,
    queue_size,
    log_max_index - log_pointer AS replication_lag_entries,
    absolute_delay
FROM system.replicas
WHERE engine LIKE '%Replicated%';

SELECT name, host, port, is_expired, session_uptime_elapsed_seconds
FROM system.zookeeper_connection;

SELECT
    database,
    table,
    type,
    create_time,
    last_attempt_time,
    num_tries,
    last_exception
FROM system.replication_queue
WHERE num_tries > 0
ORDER BY num_tries DESC
LIMIT 20;

SELECT event, value
FROM system.events
WHERE event = 'ReplicatedDataLoss';

SELECT
    partition_id,
    count() AS parts,
    sum(rows) AS total_rows
FROM system.parts
WHERE active = 1
GROUP BY partition_id
ORDER BY parts DESC;

-- Run on each replica and diff the output
SELECT count() FROM your_database.your_table;

How to diagnose it

Confirm the mismatch. Run SELECT count() FROM db.table on every replica. If you have many tables, start with the largest discrepancy. Query each replica directly; do not trust distributed DDL or cluster-level views.
Check replica session state. Inspect system.replicas for is_session_expired and is_readonly. If is_session_expired = 1, the replica is partitioned from the coordination service and cannot receive new replication events. Reads return stale data.
Verify coordination service health. Query system.zookeeper_connection for is_expired. If the session is expired, check ZooKeeper or ClickHouse Keeper independently with echo ruok | nc <host> 2181 (or port 9181 for Keeper). High coordination latency often precedes session expiry.
Inspect the replication queue. Query system.replication_queue for entries with num_tries > 0. If the queue is empty but counts differ, you are in the silent divergence pattern. A non-empty queue with high num_tries indicates a stuck fetch or merge blocking convergence.
Compare per-partition metadata. Run the system.parts query from Quick checks on each replica. Look for missing partition_id values or partitions where sum(rows) differs. This identifies exactly which data ranges are out of sync.
Check for data loss signals. A nonzero ReplicatedDataLoss counter in system.events means ClickHouse has detected that a part is irretrievably lost. This is a hard integrity failure.
Review server logs. Search the ClickHouse server log for Broken part, Checksum, Cannot read all data, or replication fetch failures around the time divergence began.

flowchart TD
    A[Count differs across replicas] --> B{Check system.replicas}
    B -->|is_session_expired=1| C[ZK session lost
stale reads]
    B -->|queue empty
counts differ| D[Silent divergence]
    D --> E[Check ReplicatedDataLoss events]
    E -->|Event count > 0| F[Data loss or corruption]
    E -->|No events| G[Compare system.parts
per partition]
    G -->|Missing partition| H[Partial DDL or failed fetch]
    B -->|num_tries high| I[Stuck queue entry
check last_exception]

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`absolute_delay`	Seconds since the oldest unprocessed insert; measures data freshness	Sustained value > 120 s; requires NTP sync for accuracy
`is_session_expired`	An expired session means the replica is not receiving replication events, yet may still serve reads	Any non-zero value sustained beyond brief ZK leader elections
`active_replicas` vs `total_replicas`	Shows whether the cluster sees all expected replicas	`active_replicas < total_replicas` indicates missing or partitioned replicas
`system.replication_queue.num_tries`	Stuck entries block subsequent work	Any entry with `num_tries > 5` and non-empty `last_exception`
`ReplicatedDataLoss`	The only automatic signal that parts have been permanently lost	Any nonzero value is a critical integrity event
Per-partition `sum(rows)` in `system.parts`	Reveals silent divergence that queue depth metrics hide	Differs across replicas for the same partition
`system.detached_parts`	Detached parts from corruption or failed fetches reduce visible row counts	Unexpected entries with `reason` indicating corruption or broken fetch

Fixes

Expired ZooKeeper session

If is_session_expired = 1, the replica has lost coordination. Run SYSTEM RESTART REPLICA db.table to force the replica to recreate its session and re-verify state against ZooKeeper. If the session repeatedly expires, fix the root cause: network partition, DNS failure, or ZooKeeper/Keeper overload. Do not restart the ClickHouse server as a first response; this creates additional coordination load.

Stuck replication queue entry

If system.replication_queue shows an entry with high num_tries and persistent last_exception, read the exception text. Transient network errors usually resume once connectivity returns. Permanent errors such as missing source parts or checksum mismatches may resolve by fetching a merged alternative.  If the entry is genuinely wedged, SYSTEM RESTART REPLICA db.table forces the replica to re-evaluate its queue from the coordination log.

Silent data loss or severe divergence

When counts differ and parts are missing or corrupt, the safest recovery is SYSTEM RESTORE REPLICA db.table.  This is destructive: it re-initializes the replica’s local metadata from ZooKeeper and re-fetches all parts from peers. Only use this when you have confirmed that at least one other replica is fully healthy. Ensure the replica has sufficient disk space and network bandwidth; it will pull all parts from scratch.

Accidental partition drop on one replica

If a partition exists on other replicas but is missing on one, and the queue is empty, the replica may have executed a local drop or lost the part silently. Reattach the partition if it exists in the detached/ directory, or force a re-fetch by dropping the local partition reference and allowing replication to heal it. If metadata is inconsistent, prefer SYSTEM RESTART REPLICA or SYSTEM RESTORE REPLICA over manual filesystem surgery.

Prevention

Run periodic cross-replica consistency checks. Automate SELECT count() on critical tables across all replicas and alert on mismatch. Do not rely on replication queue depth alone; queue depth can be zero while data is wrong.

Monitor is_session_expired and active_replicas as binary health signals, not lag indicators. Alert on ReplicatedDataLoss with zero tolerance.

Keep NTP synchronized across all ClickHouse nodes and ZooKeeper/Keeper hosts. Without accurate clocks, absolute_delay becomes unreliable.

Before running DDL such as DROP PARTITION or ALTER TABLE on replicated tables, confirm you are using ON CLUSTER syntax. Local DDL on one replica is a common source of silent divergence.

How Netdata helps

Correlate absolute_delay with ZooKeeper/Keeper latency to distinguish coordination slowdown from replication backlog.
Alert on is_session_expired or is_readonly alongside query success rates to catch stale-read scenarios.
Track ReplicatedDataLoss and RejectedInserts in context with per-replica insert latency to surface integrity issues before they cascade.
Compare part count and merge activity across replicas to identify nodes that are falling behind while still serving traffic.

The Netdata solution

ClickHouse monitoring with Netdata

Netdata monitors ClickHouse with per-second metrics and ML anomaly detection. Track merge debt, memory usage, replication lag, Keeper/ZooKeeper saturation, and disk headroom against the host signals that drive them.

See ClickHouse monitoring → Start monitoring free

ClickHouse replicas out of sync: when SELECT count() differs across replicas

ClickHouse replicas out of sync: when SELECT count() differs across replicas

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Expired ZooKeeper session

Stuck replication queue entry

Silent data loss or severe divergence

Accidental partition drop on one replica

Prevention

How Netdata helps

Related guides

ClickHouse monitoring with Netdata