Elasticsearch replica lag: sequence-number gaps and in-sync set removal
A replica shard is falling behind its primary. The local checkpoint on the replica is lower than the global checkpoint on the primary, and the gap is growing across consecutive samples. If the replica cannot acknowledge writes fast enough, Elasticsearch removes it from the in-sync set. At that point, wait_for_active_shards semantics change for the replication group and data redundancy is reduced. The replica must undergo peer recovery to rejoin. Read sequence-number checkpoints, find the bottleneck, and stop the lag before the master drops the replica.
What this means
Since Elasticsearch 6.0, every indexing operation receives a monotonically increasing sequence number (seq_no). The primary shard tracks three key values per replication group:
max_seq_no: the highest seq_no the primary has processed.global_checkpoint(GCP): the highest seq_no acknowledged by all in-sync copies.local_checkpoint(LCP): the highest seq_no successfully applied on a specific copy.
When local_checkpoint == global_checkpoint == max_seq_no, the group is fully in sync. When a replica’s LCP falls below the primary’s GCP, the replica is lagging. A shrinking gap means the replica is catching up, usually during peer recovery. A growing gap means the replica is falling further behind.
The primary advances the global checkpoint only when every in-sync replica acknowledges the write. If a replica stops acknowledging within the primary’s timeout, the primary asks the master to remove that replica’s allocation ID from the in-sync set. Once removed, the replica is stale. It will not be promoted to primary if the current primary fails, and it must replay the translog from the global checkpoint forward to rejoin.
flowchart TD
Write[Client write]
Primary[Primary shard]
Replica[Replica shard]
Master[Master node]
Write --> Primary
Primary --"replicate seq_no"--> Replica
Replica --"ack up to LCP"--> Primary
Primary --"report lag"--> Master
Master --"drop from in-sync set"--> Replica
Replica --"peer recovery required"--> PrimaryCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Slow replica node (I/O or GC pressure) | LCP on one replica stalls while the primary’s max_seq_no grows; other replicas keep up | _cat/nodes for heap, CPU, and load on the lagging host |
| Network congestion between primary and replica | Gap grows on only one replica; replication traffic traverses a congested link | Network throughput and error rates between the two nodes |
| Primary overwhelmed by indexing throughput | All replicas show lag; primary write thread pool queues are high | _cat/thread_pool/write for queue depth and rejections on the primary |
| Peer recovery replaying translog | Gap exists but shrinks steadily; recovery is active | _cat/recovery for translog replay progress |
Quick checks
# Sequence-number checkpoints per shard
curl -s 'http://localhost:9200/<index>/_stats?level=shards&filter_path=indices.*.shards.*.seq_no'
# Shard states
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state'
# JVM and GC pressure on replica and primary nodes
curl -s 'http://localhost:9200/_nodes/stats/jvm?filter_path=nodes.*.jvm.mem,nodes.*.jvm.gc'
# Write thread pool saturation on the primary
curl -s 'http://localhost:9200/_cat/thread_pool/write?v&h=node_name,active,queue,rejected'
# Active recoveries and translog replay progress
curl -s 'http://localhost:9200/_cat/recovery?active_only=true&v&h=index,shard,stage,source_host,target_host,translog_ops_percent'
# Cluster health and unassigned shards
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards,active_shards_percent_as_number'
How to diagnose it
- Quantify the gap. Run the
_statscall and compareseq_no.max_seq_noon the primary againstseq_no.local_checkpointon each replica. A gap greater than 10,000 operations that persists across multiple samples is abnormal. - Determine direction. Sample the gap every 30 seconds. If it is shrinking, the replica is recovering. If it is growing or flat while the primary’s
max_seq_norises, the replica is not keeping up. - Check the replica node for resource stalls. Query
_nodes/stats/jvmfor the replica’s host. Look forheap_used_percentabove 85% or old GC pauses longer than 5 seconds. Check OS-level disk I/O wait; merges, translog fsyncs, and segment writes can saturate slow disk and stall replication. - Check the primary for overload. A primary that is itself bottlenecked cannot replicate efficiently. Check the primary node’s write thread pool queue depth and indexing rate. If the
writequeue is above 50% of max or rejections are rising, the primary is overwhelmed. - Check for active recovery. If the replica recently restarted, it may be in peer recovery. Check
_cat/recoveryfor the shard. Atranslog_ops_percentvalue that is advancing is normal; a stuck percentage indicates a blocked recovery. - Check the network path between primary and replica nodes. Saturation or packet loss on the inter-node link delays replication acknowledgments. Compare network throughput on both hosts against baseline.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Sequence-number gap (global_checkpoint minus local_checkpoint on replica) | Direct measure of replication lag | Sustained gap above 10,000 ops and growing |
| Indexing rate on primary node | High ingest can overwhelm replication throughput | Sustained spike above baseline with lag |
| JVM heap usage and GC activity on replica | GC pauses stall segment writes and acks | Heap above 85% or old GC pauses above 5 seconds |
| Write thread pool queue depth on primary | Primary cannot fan out replication fast enough | Queue depth sustained above 50% of configured maximum, or any rejections |
| Shard recovery progress | Translog replay is normal catch-up | Recovery stuck or translog ops not advancing |
| Disk I/O pressure on replica node | Slow storage blocks fsyncs and segment materialization | I/O wait above 20% on spinning disk or SSD latency saturation |
Fixes
Slow replica node
If the replica node shows high heap usage, old GC pauses, or disk I/O wait:
- Reduce non-essential load on the node. Cancel expensive searches or rebalancing operations targeting it.
- If the node is permanently undersized, add data nodes and let the allocator redistribute shards.
- For I/O-bound nodes on spinning disks, reduce
index.merge.scheduler.max_thread_countto 1 to free disk bandwidth for replication.
Primary overload
If the primary node’s write thread pool is saturated:
- Reduce ingest rate at the client side with smaller bulk batches or explicit backpressure.
- Scale out the cluster so primaries are distributed across more nodes.
- Temporarily increase
refresh_intervalon the index to reduce segment creation overhead, which lowers primary CPU and I/O pressure.
Network congestion
- Ensure recovery and snapshot traffic is not saturating the inter-node network. Lower
indices.recovery.max_bytes_per_sec. If snapshot traffic shares the path, throttle the repository settingmax_snapshot_bytes_per_sec. - If the cluster spans availability zones, verify cross-zone bandwidth is sufficient for replication traffic.
Recover a replica dropped from the in-sync set
If the replica has already been removed:
- Let peer recovery run. Do not restart the replica node unless it is completely hung; restarting interrupts recovery and starts over.
- If recovery is stuck, check target node disk watermarks and cluster allocation explain with
GET /_cluster/allocation/explain. - If all in-sync copies are lost and the shard is unassigned, see Elasticsearch cluster health red: unassigned primaries and how to recover.
Prevention
- Monitor sequence-number gaps. A gap that grows over three consecutive samples should trigger investigation.
- Keep replica hardware equivalent to primaries. A replica on slower disk will consistently lag under burst indexing.
- Maintain network headroom below 50% of link capacity. Replication traffic must not compete with recovery or snapshot throughput.
- Track indexing pressure. Use
_nodes/stats/indexing_pressureto catch primary-side overload before write thread pool queues fill.
How Netdata helps
- Correlate per-node OS metrics (disk I/O wait, network throughput, CPU) with Elasticsearch
_statsto determine whether lag is caused by the replica node, the network, or the primary. - Alert on JVM heap floor rising on replica nodes before GC pauses stall replication.
- Visualize indexing rate per node alongside write thread pool queue depth to spot primary-side saturation.
- Track shard recovery progress and translog size to distinguish normal catch-up from stuck recovery.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch cluster state too large: field count, index count, and per-node heap
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
- Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch FORBIDDEN/12/index read-only / allow delete (api) — flood stage recovery







