Elasticsearch translog growing: flush problems, durability, and slow recovery

Shard recovery estimates climb from minutes to hours. A node restart stalls during translog replay. Disk usage creeps up on data nodes, and uncommitted translog size per shard is past the flush threshold. The translog is growing faster than Elasticsearch can flush it to Lucene commit points.

Every indexing operation appends to the per-shard write-ahead log before the document is searchable. A flush commits those operations to a Lucene segment and truncates the log. When flush cannot keep pace with the write path, uncommitted operations accumulate. During recovery, Elasticsearch replays every operation sequentially. A multi-gigabyte translog translates directly into a multi-hour recovery window, extending time-to-recovery and increasing cascade risk if another node fails during replay.

The bottleneck is usually indexing throughput, disk I/O capacity, or durability configuration.

flowchart TD
    A[High indexing or I/O stall] --> B[Flush falls behind]
    B --> C[Translog grows past threshold]
    C --> D[Sequential translog replay]
    D --> E[Hours-long shard recovery]
    C --> F[More fsync latency]
    F --> B

What this means

By default, the translog flushes and truncates when it reaches index.translog.flush_threshold_size, which defaults to 512MB. Between flushes, every index, update, and delete operation is appended for durability.

The translog.durability setting controls how quickly writes are fsynced to disk. The default is request, meaning every operation is fsynced before acknowledgment. This is the safest setting, but it adds latency to each write. The alternative is async, which fsyncs only on the sync_interval (default 5 seconds). This reduces I/O pressure but creates a data-loss window for acknowledged writes not yet fsynced.

In Elasticsearch 8.x, translog retention settings were removed and replaced by soft deletes for peer recovery, but the translog still retains all unflushed operations. Once the translog grows large, the only way to bring a recovering shard online is to replay those operations one by one. If a node encounters a TranslogCorruptedException, the only remediation is to delete the corrupted translog files and restart the node, forcing a full recovery from the primary. Any operations that were acknowledged but not yet flushed to a Lucene commit will be lost.

Common causes

CauseWhat it looks likeFirst thing to check
Disk I/O saturationFlush time increasing, high I/O wait, translog growing across multiple nodesiostat -xz 1 and _nodes/stats/indices/flush
Indexing rate exceeds flush speedWrite queue building, indexing latency rising, translog ops climbing steadily_cat/nodes?v&h=name,indexing.index_total rate
durability: request on slow storageHigh fsync latency, single-node translog bloat, I/O wait spikingIndex settings for translog.durability
Replica catching up after outageTranslog large on one replica, recovery active, primary looks normal_cat/recovery?v&active_only=true
Flush suppressed or blockedTranslog ops growing unbounded, flush count stagnant, possible disk watermark blocking allocation_cat/allocation?v and flush stats

Quick checks

# Translog size and uncommitted operations per node
curl -s 'http://localhost:9200/_nodes/stats/indices/translog?filter_path=nodes.*.indices.translog'

# Flush performance and total time
curl -s 'http://localhost:9200/_nodes/stats/indices/flush?filter_path=nodes.*.indices.flush'

# Indexing rate to spot ingest spikes
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,indexing.index_total,indexing.index_time'

# Write and flush thread pool saturation
curl -s 'http://localhost:9200/_cat/thread_pool/write,flush?v&h=node_name,name,active,queue,rejected'

# Active recoveries that may be inflating translogs
curl -s 'http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,type,stage,source_host,target_host,bytes_percent'

# Disk usage and watermark pressure
curl -s 'http://localhost:9200/_cat/allocation?v'

# Index-level translog and durability settings
curl -s "http://localhost:9200/<index>/_settings" | grep -E 'translog|flush_threshold'

How to diagnose it

  1. Confirm translog growth. Sample _nodes/stats/indices/translog twice over a 30-second window. If uncommitted_size_in_bytes climbs past 512MB per shard, flush is not keeping up.
  2. Check flush health. Divide flush.total_time_in_millis by flush.total on the affected node. A sustained average above 30 seconds indicates I/O trouble or very large translogs.
  3. Inspect disk I/O. Run iostat -xz 1 on the affected node. Sustained await above 50ms or utilization above 90% indicates storage saturation.
  4. Correlate with indexing rate. Compute the delta of indexing.index_total. A sudden spike without a corresponding rise in flush throughput means the write path is outpacing commit speed.
  5. Review durability settings. Use GET /<index>/_settings to check index.translog.durability. If it is request and disk latency is high, every operation pays an fsync tax.
  6. Check for active recoveries. A replica replaying a large translog from its primary during recovery is expected. Use _cat/recovery to distinguish normal catch-up from unbounded growth on primaries.
  7. Review node logs. Look for TranslogCorruptedException, FlushNotAllowedException , or disk-full errors that explain why flushes are failing.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
translog.uncommitted_size_in_bytesDirectly measures how much data must be replayed on recovery>5GB per shard, or sustained growth >512MB past the threshold
translog.uncommitted_operationsOperation count drives replay time independently of byte size>10,000 sustained
indices.flush.total_time_in_millisIndicates flush is slowing down or competing for I/OAverage flush time >30 seconds sustained
indexing.index_total rateHigh ingest overwhelms flush without matching commit throughput>2x baseline without proportional flush rate increase
Disk I/O waitTranslog fsyncs and flushes are I/O bound>30% sustained I/O wait
thread_pool.write.rejectedBackpressure from the write pathSustained >0 rejections per minute
Shard recovery stage and timeLarge translogs stall recovery at the translog replay phaseRecovery stuck in translog replay for >30 minutes

Fixes

Relieve disk I/O pressure

If merges are competing with flushes, reduce index.merge.scheduler.max_thread_count to 1 on spinning disks. If the storage layer is fundamentally saturated, add data nodes or move to faster disks. Do not trigger a force merge while translogs are already large; it compounds the I/O debt and can temporarily double disk usage.

Reduce indexing pressure

Temporarily lower inbound bulk batch sizes or add data nodes to spread primary shards. Raising the write thread pool queue size only delays rejection and increases memory pressure without fixing the throughput mismatch.

Tune flush threshold for recovery SLA

Lower index.translog.flush_threshold_size from the default 512MB to 256MB or 128MB. This increases flush frequency, shrinks translogs, and shortens recovery time at the cost of higher steady-state I/O. Only make this change if your disk subsystem can absorb the extra commit load.

Adjust durability when safe

Switch index.translog.durability from request to async. This batches fsyncs to the sync_interval (default 5 seconds), reducing per-operation latency and IOPS consumption. The tradeoff is that acknowledged writes within that window are lost on an unclean shutdown. Do not use this for indices where every acknowledged document must survive a crash.

Handle a stuck replica recovery

If a single replica is inflating the translog while catching up, you can wait for it to complete. If the lag is extreme and you need to free the primary’s resources, cancel the recovery and let the allocator retry. Reducing replica count to zero is destructive to redundancy and should only be done with explicit acceptance of the risk.

Recover from translog corruption

Warning: This procedure deletes unflushed data.

If node logs show TranslogCorruptedException, stop the node, delete the shard’s translog.ckp and translog-*.tlog files, and restart. The shard will recover fully from the primary. Any operations that were acknowledged but not yet flushed to a Lucene commit will be lost.

Prevention

  • Monitor translog.uncommitted_size_in_bytes per shard to catch flush lag before recovery stalls.
  • Provision disk IOPS headroom; merges and translog fsyncs compete for the same storage.
  • Measure recovery time under production load so your RTO accounts for realistic translog replay duration.
  • Set index.refresh_interval explicitly during bulk ingestion to reduce segment churn that steals I/O bandwidth from flush.
  • Keep translog.durability: request unless the workload has explicitly accepted a measurable data-loss window.

How Netdata helps

Correlate translog growth with per-node disk I/O wait and indexing rates on the same timeline to distinguish ingest spikes from storage saturation. Alert on sustained disk I/O saturation that precedes flush stalls. Track JVM heap and write thread pool rejections alongside translog metrics to spot write-path bottlenecks before shard recovery is impacted.