Redis CLUSTERDOWN / cluster_state:fail: slot coverage and recovery

CLUSTERDOWN The cluster is down means at least one of the 16384 hash slots lacks a healthy master. With cluster-require-full-coverage yes (the default), a single missing slot blocks all writes. This guide covers diagnosing the root cause, recovering safely, and preventing recurrence.

What this means

Redis Cluster shards the keyspace across 16384 hash slots. Each slot must be assigned to a master node that is reachable and healthy to count toward cluster_slots_ok. When cluster_slots_assigned drops below 16384, or cluster_slots_fail becomes non-zero because a node has been marked FAIL by quorum, the cluster transitions to cluster_state:fail. Clients receive CLUSTERDOWN for operations hashing to affected slots.

The four slot counters in CLUSTER INFO are:

  • cluster_slots_assigned: slots bound to any node. Must be 16384 in a healthy cluster.
  • cluster_slots_ok: assigned slots on nodes that are neither FAIL nor PFAIL.
  • cluster_slots_pfail: slots on nodes that one or more masters suspect are down. This is a unilateral opinion and can clear if gossip resumes.
  • cluster_slots_fail: slots on nodes that a majority of masters have agreed are failed. This is a confirmed outage for that slot range.

CLUSTER INFO reflects the local node’s perspective. A node isolated in a minority partition reports cluster_state:fail because it cannot reach a majority of masters, even if the majority partition is healthy. Always query multiple nodes before concluding the cluster is globally down.

flowchart TD
    A[Client sees CLUSTERDOWN] --> B{Query CLUSTER INFO on all masters}
    B -->|All report fail| C{cluster_slots_assigned < 16384?}
    B -->|Split views| D[Network partition or minority isolation]
    C -->|Yes| E[Orphaned slots: find gaps in CLUSTER NODES]
    C -->|No| F[Check cluster_slots_fail for dead master]
    D --> G[Fix port+10000 connectivity]
    E --> H[Reassign with CLUSTER ADDSLOTS]
    F --> I{Node recoverable?}
    I -->|Yes| J[Restart or reconnect node]
    I -->|No| H
    J --> K[Verify cluster_slots_ok = 16384]
    H --> K

Common causes

CauseWhat it looks likeFirst thing to check
Node crash or OOM killcluster_slots_fail > 0, one node does not respond to PINGINFO server uptime and dmesg on the missing node
Network partitionSplit CLUSTER INFO views: some nodes show ok, others failCLUSTER INFO from every node; gossip port port+10000 reachability
Scale-in without slot migrationcluster_slots_assigned < 16384 after a node was removedCLUSTER NODES for slots with no master
Stuck slot migrationCLUSTER NODES shows importing or migrating flags that never clearMigration state on source and target
Gossip port blockedNodes appear healthy individually but disagree on topologyss -tlnp and network policy for port+10000

Quick checks

Run these read-only commands before making changes.

# Cluster state and slot counts from the local node
redis-cli CLUSTER INFO

# Topology, flags, and slot assignments
redis-cli CLUSTER NODES

# Determine if the failure is a local view or global
redis-cli -h <another-master> CLUSTER INFO

# Liveness of the failed node
redis-cli -h <failed-node> PING

# Nodes currently marked as fail or pfail
redis-cli CLUSTER NODES | grep "fail"

# Local gossip port listen state
ss -tlnp | grep redis-server

# Whether full coverage is enforced
redis-cli CONFIG GET cluster-require-full-coverage

How to diagnose it

  1. Query multiple nodes. CLUSTER INFO is a per-node view. A node in a minority partition reports cluster_state:fail even if the majority partition is healthy. Run CLUSTER INFO on at least one node in each partition, or every master if possible.

  2. Read the slot counters.

    • If cluster_slots_assigned < 16384, slots are orphaned. This usually happens after a node was removed without migrating its slots first.
    • If cluster_slots_assigned = 16384 but cluster_slots_fail > 0, a node owning slots has been marked FAIL by quorum.
    • If cluster_slots_pfail > 0 and cluster_slots_fail = 0, a node is suspected failed but quorum has not confirmed it. Act before the cluster escalates to FAIL.
  3. Map slots to nodes. CLUSTER NODES shows which node owns which slot range. Look for flags like fail? (PFAIL) or fail (FAIL), and look for missing slot ranges that have no owner.

  4. Check node liveness independently. A node marked FAIL may simply be partitioned. Try redis-cli -h <node> INFO server and PING. If it responds, check network routes and firewall rules for the gossip bus on port+10000.

  5. Inspect for stuck migrations. In CLUSTER NODES, slot ranges with [<-nodeid] or [->nodeid] indicate importing or migrating states. Compare the source and target: if the target died mid-migration, the source may still flag the slot as migrating while no node flags it as importing, leaving the slot in limbo.

  6. Evaluate coverage settings. If cluster-require-full-coverage is no, the cluster can report ok even when some slots have no owner, masking partial failures. If it is yes (the default), all writes are blocked during the outage.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
cluster_stateBinary cluster healthfail sustained for more than 60 seconds
cluster_slots_assignedMust equal 16384 for full coverageAny value below 16384
cluster_slots_okSlots available for queriesBelow 16384
cluster_slots_failSlots on confirmed-dead mastersGreater than 0 means active outage
cluster_slots_pfailSlots on suspected-dead mastersSustained value above 0 indicates impending fail
cluster_known_nodesExpected topology sizeDrop suggests partition or decommission

Fixes

Orphaned slots after node removal

When cluster_slots_assigned < 16384, identify orphaned ranges with CLUSTER NODES. Choose a live master with capacity and assign the slots:

redis-cli -h <surviving-master> CLUSTER ADDSLOTS 5461 5462 5463

This is a destructive topology change. Only add slots that are currently unbound. If another node still claims the slot, the command returns an error. After adding slots, verify cluster_slots_assigned returns to 16384 on all reachable masters. For large ranges, script multiple CLUSTER ADDSLOTS calls rather than adding thousands of individual arguments in one command.

Recovering a failed node

If the node is alive but marked FAIL due to gossip timeout, fix the underlying connectivity issue. Once the node resumes gossip and the majority agrees, it exits FAIL state and its slots return to ok automatically.

If the node crashed and restarted with persistent data and the same node ID, allow it to rejoin. If a slave was promoted during the outage, the returning master becomes a replica of the new master and does not reclaim its old slots automatically. Do not manually promote it back unless you intend to trigger another failover. If slots were manually migrated away during the outage, you must either move them back or accept the new topology.

Unsticking a dead migration

If a slot migration hangs because the target died, the source and target may hold stale migrating and importing flags. After confirming the target is permanently gone, use CLUSTER SETSLOT <slot> NODE <node-id> to force ownership to a stable master, then verify with CLUSTER NODES. This is destructive: ensure the target master actually holds the keys for that slot, or clients will query an empty keyspace. If the source still lists the slot as migrating and no node lists it as importing, the slot is effectively orphaned and must be reassigned.

Emergency partial coverage

As a temporary measure to restore writes to healthy slots while you recover a dead node, you can disable full coverage:

redis-cli CONFIG SET cluster-require-full-coverage no

This is dangerous. It allows writes to covered slots but returns errors for uncovered slots, and it masks partial failures. Client libraries may also cache the topology and continue sending requests to the failed node. Revert to yes immediately after recovery.

Prevention

  • Migrate before removing. Always move slots off a node before decommissioning it. Do not rely on the node being deleted to free its slots.
  • Open gossip ports. Ensure port+10000 is reachable between all cluster nodes and not blocked by host-level firewalls, security groups, or network policies.
  • Monitor slot counts. Alert when cluster_slots_assigned is not 16384 or cluster_slots_fail is above zero.
  • Avoid cluster-require-full-coverage no as a default. It makes partial outages silent.
  • Maintain quorum awareness. A three-master cluster can survive one master loss. Losing two masters leaves the remaining node unable to reach quorum, creating a zombie state that cannot auto-failover.

How Netdata helps

Netdata exposes cluster_state, cluster_slots_assigned, cluster_slots_ok, cluster_slots_fail, and cluster_slots_pfail as time-series metrics. Use them to:

  • Correlate cluster_state:fail with per-node CPU, memory, and network metrics to determine whether the cause is a local crash, OOM kill, or network partition.
  • Alert without delay when cluster_slots_fail is above zero or cluster_slots_assigned drops below 16384.
  • Catch suspected node failures while they are still pfail, before quorum confirms them as fail.
  • Compare cluster_state across multiple nodes to identify minority partitions during network splits.