Elasticsearch translog growing: flush problems, durability, and slow recovery
Shard recovery estimates climb from minutes to hours. A node restart stalls during translog replay. Disk usage creeps up on data nodes, and uncommitted translog size per shard is past the flush threshold. The translog is growing faster than Elasticsearch can flush it to Lucene commit points.
Every indexing operation appends to the per-shard write-ahead log before the document is searchable. A flush commits those operations to a Lucene segment and truncates the log. When flush cannot keep pace with the write path, uncommitted operations accumulate. During recovery, Elasticsearch replays every operation sequentially. A multi-gigabyte translog translates directly into a multi-hour recovery window, extending time-to-recovery and increasing cascade risk if another node fails during replay.
The bottleneck is usually indexing throughput, disk I/O capacity, or durability configuration.
flowchart TD
A[High indexing or I/O stall] --> B[Flush falls behind]
B --> C[Translog grows past threshold]
C --> D[Sequential translog replay]
D --> E[Hours-long shard recovery]
C --> F[More fsync latency]
F --> BWhat this means
By default, the translog flushes and truncates when it reaches index.translog.flush_threshold_size, which defaults to 512MB. Between flushes, every index, update, and delete operation is appended for durability.
The translog.durability setting controls how quickly writes are fsynced to disk. The default is request, meaning every operation is fsynced before acknowledgment. This is the safest setting, but it adds latency to each write. The alternative is async, which fsyncs only on the sync_interval (default 5 seconds). This reduces I/O pressure but creates a data-loss window for acknowledged writes not yet fsynced.
In Elasticsearch 8.x, translog retention settings were removed and replaced by soft deletes for peer recovery, but the translog still retains all unflushed operations. Once the translog grows large, the only way to bring a recovering shard online is to replay those operations one by one. If a node encounters a TranslogCorruptedException, the only remediation is to delete the corrupted translog files and restart the node, forcing a full recovery from the primary. Any operations that were acknowledged but not yet flushed to a Lucene commit will be lost.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Disk I/O saturation | Flush time increasing, high I/O wait, translog growing across multiple nodes | iostat -xz 1 and _nodes/stats/indices/flush |
| Indexing rate exceeds flush speed | Write queue building, indexing latency rising, translog ops climbing steadily | _cat/nodes?v&h=name,indexing.index_total rate |
durability: request on slow storage | High fsync latency, single-node translog bloat, I/O wait spiking | Index settings for translog.durability |
| Replica catching up after outage | Translog large on one replica, recovery active, primary looks normal | _cat/recovery?v&active_only=true |
| Flush suppressed or blocked | Translog ops growing unbounded, flush count stagnant, possible disk watermark blocking allocation | _cat/allocation?v and flush stats |
Quick checks
# Translog size and uncommitted operations per node
curl -s 'http://localhost:9200/_nodes/stats/indices/translog?filter_path=nodes.*.indices.translog'
# Flush performance and total time
curl -s 'http://localhost:9200/_nodes/stats/indices/flush?filter_path=nodes.*.indices.flush'
# Indexing rate to spot ingest spikes
curl -s 'http://localhost:9200/_cat/nodes?v&h=name,indexing.index_total,indexing.index_time'
# Write and flush thread pool saturation
curl -s 'http://localhost:9200/_cat/thread_pool/write,flush?v&h=node_name,name,active,queue,rejected'
# Active recoveries that may be inflating translogs
curl -s 'http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,type,stage,source_host,target_host,bytes_percent'
# Disk usage and watermark pressure
curl -s 'http://localhost:9200/_cat/allocation?v'
# Index-level translog and durability settings
curl -s "http://localhost:9200/<index>/_settings" | grep -E 'translog|flush_threshold'
How to diagnose it
- Confirm translog growth. Sample
_nodes/stats/indices/translogtwice over a 30-second window. Ifuncommitted_size_in_bytesclimbs past 512MB per shard, flush is not keeping up. - Check flush health. Divide
flush.total_time_in_millisbyflush.totalon the affected node. A sustained average above 30 seconds indicates I/O trouble or very large translogs. - Inspect disk I/O. Run
iostat -xz 1on the affected node. Sustainedawaitabove 50ms or utilization above 90% indicates storage saturation. - Correlate with indexing rate. Compute the delta of
indexing.index_total. A sudden spike without a corresponding rise in flush throughput means the write path is outpacing commit speed. - Review durability settings. Use
GET /<index>/_settingsto checkindex.translog.durability. If it isrequestand disk latency is high, every operation pays an fsync tax. - Check for active recoveries. A replica replaying a large translog from its primary during recovery is expected. Use
_cat/recoveryto distinguish normal catch-up from unbounded growth on primaries. - Review node logs. Look for
TranslogCorruptedException,FlushNotAllowedException, or disk-full errors that explain why flushes are failing.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
translog.uncommitted_size_in_bytes | Directly measures how much data must be replayed on recovery | >5GB per shard, or sustained growth >512MB past the threshold |
translog.uncommitted_operations | Operation count drives replay time independently of byte size | >10,000 sustained |
indices.flush.total_time_in_millis | Indicates flush is slowing down or competing for I/O | Average flush time >30 seconds sustained |
indexing.index_total rate | High ingest overwhelms flush without matching commit throughput | >2x baseline without proportional flush rate increase |
| Disk I/O wait | Translog fsyncs and flushes are I/O bound | >30% sustained I/O wait |
thread_pool.write.rejected | Backpressure from the write path | Sustained >0 rejections per minute |
| Shard recovery stage and time | Large translogs stall recovery at the translog replay phase | Recovery stuck in translog replay for >30 minutes |
Fixes
Relieve disk I/O pressure
If merges are competing with flushes, reduce index.merge.scheduler.max_thread_count to 1 on spinning disks. If the storage layer is fundamentally saturated, add data nodes or move to faster disks. Do not trigger a force merge while translogs are already large; it compounds the I/O debt and can temporarily double disk usage.
Reduce indexing pressure
Temporarily lower inbound bulk batch sizes or add data nodes to spread primary shards. Raising the write thread pool queue size only delays rejection and increases memory pressure without fixing the throughput mismatch.
Tune flush threshold for recovery SLA
Lower index.translog.flush_threshold_size from the default 512MB to 256MB or 128MB. This increases flush frequency, shrinks translogs, and shortens recovery time at the cost of higher steady-state I/O. Only make this change if your disk subsystem can absorb the extra commit load.
Adjust durability when safe
Switch index.translog.durability from request to async. This batches fsyncs to the sync_interval (default 5 seconds), reducing per-operation latency and IOPS consumption. The tradeoff is that acknowledged writes within that window are lost on an unclean shutdown. Do not use this for indices where every acknowledged document must survive a crash.
Handle a stuck replica recovery
If a single replica is inflating the translog while catching up, you can wait for it to complete. If the lag is extreme and you need to free the primary’s resources, cancel the recovery and let the allocator retry. Reducing replica count to zero is destructive to redundancy and should only be done with explicit acceptance of the risk.
Recover from translog corruption
Warning: This procedure deletes unflushed data.
If node logs show TranslogCorruptedException, stop the node, delete the shard’s translog.ckp and translog-*.tlog files, and restart. The shard will recover fully from the primary. Any operations that were acknowledged but not yet flushed to a Lucene commit will be lost.
Prevention
- Monitor
translog.uncommitted_size_in_bytesper shard to catch flush lag before recovery stalls. - Provision disk IOPS headroom; merges and translog fsyncs compete for the same storage.
- Measure recovery time under production load so your RTO accounts for realistic translog replay duration.
- Set
index.refresh_intervalexplicitly during bulk ingestion to reduce segment churn that steals I/O bandwidth from flush. - Keep
translog.durability: requestunless the workload has explicitly accepted a measurable data-loss window.
How Netdata helps
Correlate translog growth with per-node disk I/O wait and indexing rates on the same timeline to distinguish ingest spikes from storage saturation. Alert on sustained disk I/O saturation that precedes flush stalls. Track JVM heap and write thread pool rejections alongside translog metrics to spot write-path bottlenecks before shard recovery is impacted.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch authentication failures: audit logs, brute force, and credential drift
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch cluster state too large: field count, index count, and per-node heap
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
- Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
- Elasticsearch exposed without authentication: open clusters and snapshot exfiltration







