Elasticsearch snapshot failed or partial: backups that silently stop working

You check backups and the last successful snapshot is three days old. Or you see PARTIAL snapshots where only some indices were captured. Cluster health is green, indexing and search are fine, but your recovery point is slipping. Elasticsearch does not fail the cluster when snapshots break, so the problem surfaces only when someone asks, “When did we last test a restore?”

A FAILED snapshot means the cluster could not write anything useful to the repository. A PARTIAL snapshot means the global cluster state was stored, but at least one primary shard was skipped, usually because it was relocating or unassigned at the time. Snapshots are incremental at the segment level, so even a PARTIAL snapshot may contain valid data for some indices, but it is not a complete backup. A snapshot that reports SUCCESS is not a guarantee that you can restore from it. Until you test a restore, you do not know if your backup is real.

What this means

Elasticsearch snapshots move segment files from primary shards to a registered repository. The master coordinates the operation, but data transfer happens at the shard level. When you initiate a snapshot, the cluster marks the shards involved and copies new segments since the last snapshot. If a primary shard is initializing, relocating, or unassigned, the snapshot skips it. If every shard is available, the snapshot finishes as SUCCESS. If some shards are skipped, the state is PARTIAL. If the repository is unreachable or the master cannot coordinate the operation, the state is FAILED.

PARTIAL and FAILED states do not affect cluster health. A green cluster with a broken backup pipeline looks healthy in every dashboard that only tracks /_cluster/health. SLM policies can continue to trigger, fail quietly, and leave you with an ever-growing recovery point objective.

flowchart TD
    A[Stale or failed snapshot] --> B{State?}
    B -->|FAILED| C[Repository or SLM issue]
    B -->|PARTIAL| D[Shard allocation problem]
    C --> E[Verify repo connectivity and capacity]
    C --> F[Review SLM execution history]
    D --> G[Check relocating and unassigned shards]
    D --> H[Check allocation explain]
    E --> I[Repair and rerun snapshot]
    F --> I
    G --> I
    H --> I

Common causes

CauseWhat it looks likeFirst thing to check
Repository unreachable or permission deniedSnapshots fail quickly with no progress; cloud repos show auth or endpoint errorsGET /_snapshot/<repo> settings and network path
Repository full or quota exhaustedSnapshots start but fail after writing partial data; duration grows before failureRepository storage capacity and growth rate
SLM misconfigured or disabledNo recent snapshots despite a policy; schedule does not match expected intervalSnapshot recency via _cat/snapshots and SLM logs
Shards relocating or unassignedRepeated PARTIAL snapshots for the same indices; skipped shards in snapshot detailsGET /_cat/shards?v for relocating or unassigned primaries
Concurrent snapshot limit reachedNew snapshot operations hang or queue while others runGET /_snapshot/_status for active operations
I/O contention from backup trafficSnapshot duration increases sharply; indexing or search latency rises during backup windowsDisk I/O metrics and max_snapshot_bytes_per_sec throttling

Quick checks

Run these read-only commands to assess state without affecting the cluster.

# List recent snapshots and their states
curl -s 'http://localhost:9200/_cat/snapshots/<repo>?v&s=end_epoch:desc' | head -10

# Inspect the latest snapshot for per-shard failures
curl -s 'http://localhost:9200/_snapshot/<repo>/<snapshot>'

# Check for currently running snapshots
curl -s 'http://localhost:9200/_snapshot/_status'

# Check for relocating or unassigned shards that could cause PARTIAL snapshots
curl -s 'http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state' | grep -E 'RELOCATING|UNASSIGNED'

# Check cluster health for allocation issues
curl -s 'http://localhost:9200/_cluster/health?filter_path=status,unassigned_shards,relocating_shards'

# Check disk watermark proximity that can block allocation and stress nodes
curl -s 'http://localhost:9200/_cat/allocation?v&s=disk.percent:desc'

# Check for snapshot-specific thread pool pressure
curl -s 'http://localhost:9200/_cat/thread_pool/snapshot?v&h=node_name,name,active,queue,rejected'

How to diagnose it

  1. Find the boundary. List snapshots with _cat/snapshots and identify the first failure in the sequence. If the last good snapshot was days ago, the problem is persistent. If only the latest failed, look for a recent change: rolling restart, ILM rollover, or repository update.

  2. Classify the failure mode. FAILED means the cluster could not write to the repository. PARTIAL means at least one shard was skipped. Look at the failures array in the snapshot details. Each entry names the index, shard, and reason. Reasons like “shard is unassigned” or “primary shard is not active” point to allocation issues.

  3. Correlate with shard movement. Check _cat/shards and _cluster/allocation/explain for the affected indices at the snapshot start time. If primaries were relocating because of a node restart or disk watermark rebalancing, that explains the PARTIAL state.

  4. Inspect the repository. List the repository with GET /_snapshot/<repo> to confirm settings. Check repository storage for capacity exhaustion. For object storage, verify credentials, endpoints, and network paths. Trigger a small manual snapshot to prove write access.

  5. Check SLM execution history. A misconfigured schedule, missing repository, or policy error can prevent snapshots from starting. Correlate SLM logs with the snapshot list. Check that the policy schedule is valid and the repository name matches.

  6. Look for I/O saturation. If snapshots take longer than the interval between them, they will overlap. Check disk I/O wait and network throughput to the repository. Snapshot traffic is throttled by max_snapshot_bytes_per_sec (default 40 MB/s), but if the disk or network is already saturated, the throttle may not be enough.

  7. Check for allocation blocks. Nodes above the high disk watermark (90%) trigger relocations. Nodes above flood stage (95%) set index.blocks.read_only_allow_delete. Heavy relocation can destabilize primaries during snapshot windows and create PARTIAL states.

  8. Test a manual snapshot. If automated snapshots fail but the repository looks healthy, trigger a manual snapshot with PUT /_snapshot/<repo>/<snapshot>. If the manual snapshot succeeds, the problem is in the scheduler or policy. If it fails, the repository or cluster state is the culprit.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Snapshot state (FAILED / PARTIAL / SUCCESS)Direct indicator of backup integrityAny FAILED snapshot; PARTIAL containing critical indices
Last successful snapshot ageMeasures how far behind your recovery point isNo success in >2x the configured schedule interval
Snapshot durationReveals repository slowdown or I/O competitionDuration approaching or exceeding the backup interval
Manual snapshot testCatches permission, network, or corruption issues earlyAny test failure or timeout
Unassigned / relocating shard countExplains PARTIAL snapshots and recovery riskNon-zero during scheduled snapshot windows
Disk watermark proximityPredicts allocation storms that disrupt primariesAny node above 85% low watermark
Snapshot thread pool queue / rejectedShows backup operations competing for resourcesQueue growing or rejections sustained >1 minute
Pending cluster tasksSnapshot metadata operations add master queue pressureBacklog >20 tasks or tasks aging >30 seconds

Fixes

Repository unreachable or permission denied

List the repository with GET /_snapshot/<repo> and confirm settings. For S3, GCS, or Azure repositories, check credentials, endpoints, and bucket names. Re-register the repository after correcting settings. If the repository is on a shared filesystem, verify mount availability and permissions on every data node.

Repository full or quota exhausted

Expand the repository storage or delete obsolete snapshots. Deleting snapshots is irreversible and generates cleanup load; avoid mass deletions and do not remove your last known good snapshot. After freeing space, run a manual snapshot to confirm the repository accepts writes again. Monitor repository growth against your retention policy.

SLM misconfiguration or silent execution failures

Review SLM policies and compare the schedule to the actual snapshot list. Confirm the repository name and index patterns match your intent. Check the last success and failure timestamps in cluster logs. If a policy is failing, inspect execution history for error messages. A common mistake is renaming a repository without updating the policy.

PARTIAL snapshots from relocating or unassigned shards

Identify which shards were skipped and why. If they were relocating due to a rolling restart, wait for recovery to finish and rerun the snapshot. If shards are unassigned because of ALLOCATION_FAILED, force a retry with POST /_cluster/reroute?retry_failed=true. This triggers shard allocations; only run it after resolving the root cause. For persistent allocation blocks, use GET /_cluster/allocation/explain and resolve the root cause: disk watermark, awareness attributes, or corrupt shards.

I/O contention and concurrent operation limits

If snapshots overlap with peak traffic or maintenance windows, reschedule them to quieter periods. Reduce snapshot throughput by lowering max_snapshot_bytes_per_sec on the repository to protect production I/O. If the concurrent snapshot limit is reached, wait for in-progress operations to finish before triggering new ones. Avoid raising concurrency limits to mask a slow repository.

Prevention

Alert on backup age, not just backup failure. Set a page if the last successful snapshot exceeds twice your scheduled interval. Treat PARTIAL snapshots as seriously as FAILED ones for critical indices. Test restores to a staging cluster regularly. Snapshot success does not prove restore success. Keep SLM policies simple, avoid scheduling snapshots during rolling restart windows, and fix unassigned shards promptly so primaries are stable when backup runs.

How Netdata helps

  • Correlate snapshot duration spikes with per-node disk I/O wait and network throughput to distinguish repository latency from local node saturation.
  • Alert on snapshot health alongside cluster health so you catch silent backup gaps while the cluster is still green.
  • Track thread pool queue depths on the snapshot pool to detect when backup operations compete with production traffic.
  • Monitor per-node disk watermark proximity to predict allocation storms that can destabilize primaries during snapshot windows.
  • Surface unassigned shard counts and relocation rates next to snapshot failure events to explain PARTIAL states without manual correlation.