Elasticsearch replica lag: sequence-number gaps and in-sync set removal

A replica shard is falling behind its primary. The local checkpoint on the replica is lower than the global checkpoint on the primary, and the gap is growing across consecutive samples. If the replica cannot acknowledge writes fast enough, Elasticsearch removes it from the in-sync set. At that point, wait_for_active_shards semantics change for the replication group and data redundancy is reduced. The replica must undergo peer recovery to rejoin. Read sequence-number checkpoints, find the bottleneck, and stop the lag before the master drops the replica.

What this means

Since Elasticsearch 6.0, every indexing operation receives a monotonically increasing sequence number (seq_no). The primary shard tracks three key values per replication group:

  • max_seq_no: the highest seq_no the primary has processed.
  • global_checkpoint (GCP): the highest seq_no acknowledged by all in-sync copies.
  • local_checkpoint (LCP): the highest seq_no successfully applied on a specific copy.

When local_checkpoint == global_checkpoint == max_seq_no, the group is fully in sync. When a replica’s LCP falls below the primary’s GCP, the replica is lagging. A shrinking gap means the replica is catching up, usually during peer recovery. A growing gap means the replica is falling further behind.

The primary advances the global checkpoint only when every in-sync replica acknowledges the write. If a replica stops acknowledging within the primary’s timeout, the primary asks the master to remove that replica’s allocation ID from the in-sync set. Once removed, the replica is stale. It will not be promoted to primary if the current primary fails, and it must replay the translog from the global checkpoint forward to rejoin.

flowchart TD
    Write[Client write]
    Primary[Primary shard]
    Replica[Replica shard]
    Master[Master node]

    Write --> Primary
    Primary --"replicate seq_no"--> Replica
    Replica --"ack up to LCP"--> Primary
    Primary --"report lag"--> Master
    Master --"drop from in-sync set"--> Replica
    Replica --"peer recovery required"--> Primary

Common causes

CauseWhat it looks likeFirst thing to check
Slow replica node (I/O or GC pressure)LCP on one replica stalls while the primary’s max_seq_no grows; other replicas keep up_cat/nodes for heap, CPU, and load on the lagging host
Network congestion between primary and replicaGap grows on only one replica; replication traffic traverses a congested linkNetwork throughput and error rates between the two nodes
Primary overwhelmed by indexing throughputAll replicas show lag; primary write thread pool queues are high_cat/thread_pool/write for queue depth and rejections on the primary
Peer recovery replaying translogGap exists but shrinks steadily; recovery is active_cat/recovery for translog replay progress

Quick checks

# Sequence-number checkpoints per shard
curl -s 'http://localhost:9200/<index>/_stats?level=shards&filter_path=indices.*.shards.*.seq_no'

# Shard states
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state'

# JVM and GC pressure on replica and primary nodes
curl -s 'http://localhost:9200/_nodes/stats/jvm?filter_path=nodes.*.jvm.mem,nodes.*.jvm.gc'

# Write thread pool saturation on the primary
curl -s 'http://localhost:9200/_cat/thread_pool/write?v&h=node_name,active,queue,rejected'

# Active recoveries and translog replay progress
curl -s 'http://localhost:9200/_cat/recovery?active_only=true&v&h=index,shard,stage,source_host,target_host,translog_ops_percent'

# Cluster health and unassigned shards
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards,active_shards_percent_as_number'

How to diagnose it

  1. Quantify the gap. Run the _stats call and compare seq_no.max_seq_no on the primary against seq_no.local_checkpoint on each replica. A gap greater than 10,000 operations that persists across multiple samples is abnormal.
  2. Determine direction. Sample the gap every 30 seconds. If it is shrinking, the replica is recovering. If it is growing or flat while the primary’s max_seq_no rises, the replica is not keeping up.
  3. Check the replica node for resource stalls. Query _nodes/stats/jvm for the replica’s host. Look for heap_used_percent above 85% or old GC pauses longer than 5 seconds. Check OS-level disk I/O wait; merges, translog fsyncs, and segment writes can saturate slow disk and stall replication.
  4. Check the primary for overload. A primary that is itself bottlenecked cannot replicate efficiently. Check the primary node’s write thread pool queue depth and indexing rate. If the write queue is above 50% of max or rejections are rising, the primary is overwhelmed.
  5. Check for active recovery. If the replica recently restarted, it may be in peer recovery. Check _cat/recovery for the shard. A translog_ops_percent value that is advancing is normal; a stuck percentage indicates a blocked recovery.
  6. Check the network path between primary and replica nodes. Saturation or packet loss on the inter-node link delays replication acknowledgments. Compare network throughput on both hosts against baseline.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Sequence-number gap (global_checkpoint minus local_checkpoint on replica)Direct measure of replication lagSustained gap above 10,000 ops and growing
Indexing rate on primary nodeHigh ingest can overwhelm replication throughputSustained spike above baseline with lag
JVM heap usage and GC activity on replicaGC pauses stall segment writes and acksHeap above 85% or old GC pauses above 5 seconds
Write thread pool queue depth on primaryPrimary cannot fan out replication fast enoughQueue depth sustained above 50% of configured maximum, or any rejections
Shard recovery progressTranslog replay is normal catch-upRecovery stuck or translog ops not advancing
Disk I/O pressure on replica nodeSlow storage blocks fsyncs and segment materializationI/O wait above 20% on spinning disk or SSD latency saturation

Fixes

Slow replica node

If the replica node shows high heap usage, old GC pauses, or disk I/O wait:

  • Reduce non-essential load on the node. Cancel expensive searches or rebalancing operations targeting it.
  • If the node is permanently undersized, add data nodes and let the allocator redistribute shards.
  • For I/O-bound nodes on spinning disks, reduce index.merge.scheduler.max_thread_count to 1 to free disk bandwidth for replication.

Primary overload

If the primary node’s write thread pool is saturated:

  • Reduce ingest rate at the client side with smaller bulk batches or explicit backpressure.
  • Scale out the cluster so primaries are distributed across more nodes.
  • Temporarily increase refresh_interval on the index to reduce segment creation overhead, which lowers primary CPU and I/O pressure.

Network congestion

  • Ensure recovery and snapshot traffic is not saturating the inter-node network. Lower indices.recovery.max_bytes_per_sec. If snapshot traffic shares the path, throttle the repository setting max_snapshot_bytes_per_sec.
  • If the cluster spans availability zones, verify cross-zone bandwidth is sufficient for replication traffic.

Recover a replica dropped from the in-sync set

If the replica has already been removed:

  • Let peer recovery run. Do not restart the replica node unless it is completely hung; restarting interrupts recovery and starts over.
  • If recovery is stuck, check target node disk watermarks and cluster allocation explain with GET /_cluster/allocation/explain.
  • If all in-sync copies are lost and the shard is unassigned, see Elasticsearch cluster health red: unassigned primaries and how to recover.

Prevention

  • Monitor sequence-number gaps. A gap that grows over three consecutive samples should trigger investigation.
  • Keep replica hardware equivalent to primaries. A replica on slower disk will consistently lag under burst indexing.
  • Maintain network headroom below 50% of link capacity. Replication traffic must not compete with recovery or snapshot throughput.
  • Track indexing pressure. Use _nodes/stats/indexing_pressure to catch primary-side overload before write thread pool queues fill.

How Netdata helps

  • Correlate per-node OS metrics (disk I/O wait, network throughput, CPU) with Elasticsearch _stats to determine whether lag is caused by the replica node, the network, or the primary.
  • Alert on JVM heap floor rising on replica nodes before GC pauses stall replication.
  • Visualize indexing rate per node alongside write thread pool queue depth to spot primary-side saturation.
  • Track shard recovery progress and translog size to distinguish normal catch-up from stuck recovery.