Elasticsearch snapshot failed or partial: backups that silently stop working
You check backups and the last successful snapshot is three days old. Or you see PARTIAL snapshots where only some indices were captured. Cluster health is green, indexing and search are fine, but your recovery point is slipping. Elasticsearch does not fail the cluster when snapshots break, so the problem surfaces only when someone asks, “When did we last test a restore?”
A FAILED snapshot means the cluster could not write anything useful to the repository. A PARTIAL snapshot means the global cluster state was stored, but at least one primary shard was skipped, usually because it was relocating or unassigned at the time. Snapshots are incremental at the segment level, so even a PARTIAL snapshot may contain valid data for some indices, but it is not a complete backup. A snapshot that reports SUCCESS is not a guarantee that you can restore from it. Until you test a restore, you do not know if your backup is real.
What this means
Elasticsearch snapshots move segment files from primary shards to a registered repository. The master coordinates the operation, but data transfer happens at the shard level. When you initiate a snapshot, the cluster marks the shards involved and copies new segments since the last snapshot. If a primary shard is initializing, relocating, or unassigned, the snapshot skips it. If every shard is available, the snapshot finishes as SUCCESS. If some shards are skipped, the state is PARTIAL. If the repository is unreachable or the master cannot coordinate the operation, the state is FAILED.
PARTIAL and FAILED states do not affect cluster health. A green cluster with a broken backup pipeline looks healthy in every dashboard that only tracks /_cluster/health. SLM policies can continue to trigger, fail quietly, and leave you with an ever-growing recovery point objective.
flowchart TD
A[Stale or failed snapshot] --> B{State?}
B -->|FAILED| C[Repository or SLM issue]
B -->|PARTIAL| D[Shard allocation problem]
C --> E[Verify repo connectivity and capacity]
C --> F[Review SLM execution history]
D --> G[Check relocating and unassigned shards]
D --> H[Check allocation explain]
E --> I[Repair and rerun snapshot]
F --> I
G --> I
H --> ICommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Repository unreachable or permission denied | Snapshots fail quickly with no progress; cloud repos show auth or endpoint errors | GET /_snapshot/<repo> settings and network path |
| Repository full or quota exhausted | Snapshots start but fail after writing partial data; duration grows before failure | Repository storage capacity and growth rate |
| SLM misconfigured or disabled | No recent snapshots despite a policy; schedule does not match expected interval | Snapshot recency via _cat/snapshots and SLM logs |
| Shards relocating or unassigned | Repeated PARTIAL snapshots for the same indices; skipped shards in snapshot details | GET /_cat/shards?v for relocating or unassigned primaries |
| Concurrent snapshot limit reached | New snapshot operations hang or queue while others run | GET /_snapshot/_status for active operations |
| I/O contention from backup traffic | Snapshot duration increases sharply; indexing or search latency rises during backup windows | Disk I/O metrics and max_snapshot_bytes_per_sec throttling |
Quick checks
Run these read-only commands to assess state without affecting the cluster.
# List recent snapshots and their states
curl -s 'http://localhost:9200/_cat/snapshots/<repo>?v&s=end_epoch:desc' | head -10
# Inspect the latest snapshot for per-shard failures
curl -s 'http://localhost:9200/_snapshot/<repo>/<snapshot>'
# Check for currently running snapshots
curl -s 'http://localhost:9200/_snapshot/_status'
# Check for relocating or unassigned shards that could cause PARTIAL snapshots
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state' | grep -E 'RELOCATING|UNASSIGNED'
# Check cluster health for allocation issues
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards,relocating_shards'
# Check disk watermark proximity that can block allocation and stress nodes
curl -s 'http://localhost:9200/_cat/allocation?v&s=disk.percent:desc'
# Check for snapshot-specific thread pool pressure
curl -s 'http://localhost:9200/_cat/thread_pool/snapshot?v&h=node_name,name,active,queue,rejected'
How to diagnose it
Find the boundary. List snapshots with
_cat/snapshotsand identify the first failure in the sequence. If the last good snapshot was days ago, the problem is persistent. If only the latest failed, look for a recent change: rolling restart, ILM rollover, or repository update.Classify the failure mode. FAILED means the cluster could not write to the repository. PARTIAL means at least one shard was skipped. Look at the
failuresarray in the snapshot details. Each entry names the index, shard, and reason. Reasons like “shard is unassigned” or “primary shard is not active” point to allocation issues.Correlate with shard movement. Check
_cat/shardsand_cluster/allocation/explainfor the affected indices at the snapshot start time. If primaries were relocating because of a node restart or disk watermark rebalancing, that explains the PARTIAL state.Inspect the repository. List the repository with
GET /_snapshot/<repo>to confirm settings. Check repository storage for capacity exhaustion. For object storage, verify credentials, endpoints, and network paths. Trigger a small manual snapshot to prove write access.Check SLM execution history. A misconfigured schedule, missing repository, or policy error can prevent snapshots from starting. Correlate SLM logs with the snapshot list. Check that the policy schedule is valid and the repository name matches.
Look for I/O saturation. If snapshots take longer than the interval between them, they will overlap. Check disk I/O wait and network throughput to the repository. Snapshot traffic is throttled by
max_snapshot_bytes_per_sec(default 40 MB/s), but if the disk or network is already saturated, the throttle may not be enough.Check for allocation blocks. Nodes above the high disk watermark (90%) trigger relocations. Nodes above flood stage (95%) set
index.blocks.read_only_allow_delete. Heavy relocation can destabilize primaries during snapshot windows and create PARTIAL states.Test a manual snapshot. If automated snapshots fail but the repository looks healthy, trigger a manual snapshot with
PUT /_snapshot/<repo>/<snapshot>. If the manual snapshot succeeds, the problem is in the scheduler or policy. If it fails, the repository or cluster state is the culprit.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Snapshot state (FAILED / PARTIAL / SUCCESS) | Direct indicator of backup integrity | Any FAILED snapshot; PARTIAL containing critical indices |
| Last successful snapshot age | Measures how far behind your recovery point is | No success in >2x the configured schedule interval |
| Snapshot duration | Reveals repository slowdown or I/O competition | Duration approaching or exceeding the backup interval |
| Manual snapshot test | Catches permission, network, or corruption issues early | Any test failure or timeout |
| Unassigned / relocating shard count | Explains PARTIAL snapshots and recovery risk | Non-zero during scheduled snapshot windows |
| Disk watermark proximity | Predicts allocation storms that disrupt primaries | Any node above 85% low watermark |
| Snapshot thread pool queue / rejected | Shows backup operations competing for resources | Queue growing or rejections sustained >1 minute |
| Pending cluster tasks | Snapshot metadata operations add master queue pressure | Backlog >20 tasks or tasks aging >30 seconds |
Fixes
Repository unreachable or permission denied
List the repository with GET /_snapshot/<repo> and confirm settings. For S3, GCS, or Azure repositories, check credentials, endpoints, and bucket names. Re-register the repository after correcting settings. If the repository is on a shared filesystem, verify mount availability and permissions on every data node.
Repository full or quota exhausted
Expand the repository storage or delete obsolete snapshots. Deleting snapshots is irreversible and generates cleanup load; avoid mass deletions and do not remove your last known good snapshot. After freeing space, run a manual snapshot to confirm the repository accepts writes again. Monitor repository growth against your retention policy.
SLM misconfiguration or silent execution failures
Review SLM policies and compare the schedule to the actual snapshot list. Confirm the repository name and index patterns match your intent. Check the last success and failure timestamps in cluster logs. If a policy is failing, inspect execution history for error messages. A common mistake is renaming a repository without updating the policy.
PARTIAL snapshots from relocating or unassigned shards
Identify which shards were skipped and why. If they were relocating due to a rolling restart, wait for recovery to finish and rerun the snapshot. If shards are unassigned because of ALLOCATION_FAILED, force a retry with POST /_cluster/reroute?retry_failed=true. This triggers shard allocations; only run it after resolving the root cause. For persistent allocation blocks, use GET /_cluster/allocation/explain and resolve the root cause: disk watermark, awareness attributes, or corrupt shards.
I/O contention and concurrent operation limits
If snapshots overlap with peak traffic or maintenance windows, reschedule them to quieter periods. Reduce snapshot throughput by lowering max_snapshot_bytes_per_sec on the repository to protect production I/O. If the concurrent snapshot limit is reached, wait for in-progress operations to finish before triggering new ones. Avoid raising concurrency limits to mask a slow repository.
Prevention
Alert on backup age, not just backup failure. Set a page if the last successful snapshot exceeds twice your scheduled interval. Treat PARTIAL snapshots as seriously as FAILED ones for critical indices. Test restores to a staging cluster regularly. Snapshot success does not prove restore success. Keep SLM policies simple, avoid scheduling snapshots during rolling restart windows, and fix unassigned shards promptly so primaries are stable when backup runs.
How Netdata helps
- Correlate snapshot duration spikes with per-node disk I/O wait and network throughput to distinguish repository latency from local node saturation.
- Alert on snapshot health alongside cluster health so you catch silent backup gaps while the cluster is still green.
- Track thread pool queue depths on the snapshot pool to detect when backup operations compete with production traffic.
- Monitor per-node disk watermark proximity to predict allocation storms that can destabilize primaries during snapshot windows.
- Surface unassigned shard counts and relocation rates next to snapshot failure events to explain PARTIAL states without manual correlation.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch cluster state too large: field count, index count, and per-node heap
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
- Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch FORBIDDEN/12/index read-only / allow delete (api) — flood stage recovery







