Redis cluster_slots_pfail > 0: impending node failure in a cluster

cluster_slots_pfail > 0 means at least one hash slot is mapped to a node that a peer suspects is down. In Redis Cluster, PFAIL is unilateral: any node raises it when another stops answering gossip PINGs for longer than cluster-node-timeout. Slots continue to serve traffic; the cluster has not yet agreed the node is dead.

Brief spikes are expected during background saves, AOF rewrites, or any main-thread freeze. Sustained non-zero values indicate a real problem: network partition, node crash, or overload. If the majority of masters confirm the suspicion within twice cluster-node-timeout, PFAIL escalates to FAIL. The affected slots become unavailable until a replica wins election. In a three-master cluster, losing two primaries leaves the survivor without quorum. The cluster enters a zombie state where no failover can proceed. Investigate PFAIL while you still have quorum and before automatic escalation.

What this means

Redis Cluster shards data across 16384 hash slots. Gossip on the cluster bus (Redis port + 10000) maintains topology. When node A has an unanswered PING pending to node B past cluster-node-timeout, node A marks node B PFAIL. Masters and replicas can raise PFAIL against any other node.

PFAIL is one node’s opinion. Slots mapped to a PFAIL node keep serving traffic. Escalation to FAIL requires a single node to collect gossip from the majority of masters confirming the same target is PFAIL, with those reports arriving within twice cluster-node-timeout. Once FAIL is reached, the replica with the highest replication offset initiates an election. A majority of masters must vote for the candidate to promote it.

The FAIL flag is mostly one-way. It clears only when the node is reachable and is a replica, or the node is reachable and is a master serving no slots, or an extended period passes with no replica promotion detected.

flowchart TD
    A[Node misses PONGs past cluster-node-timeout] --> B[PFAIL raised by one node]
    B --> C{Node recovers and answers PONG?}
    C -->|Yes| D[PFAIL cleared automatically]
    C -->|No| E{Majority of masters confirm PFAIL?}
    E -->|No| F[PFAIL persists without escalation]
    E -->|Yes| G[FAIL declared]
    G --> H[Replica election triggered]
    H --> I{Quorum available to vote?}
    I -->|Yes| J[Replica promoted, failover complete]
    I -->|No| K[Zombie state: no quorum]

Common causes

CauseWhat it looks likeFirst thing to check
Transient pause from fork or GCcluster_slots_pfail spikes briefly, then returns to 0 during BGSAVE or AOF rewriteINFO persistence for rdb_bgsave_in_progress or aof_rewrite_in_progress, and latest_fork_usec
Network partition or packet lossPFAIL persists on multiple nodes for the same target, or asymmetric gossip flagsCLUSTER NODES to see if multiple nodes flag the same peer; ping the cluster bus port
Node crash or unresponsive event loopTarget node does not respond to direct PING; uptime_in_seconds may have resetDirect redis-cli PING to the target; check kernel logs for OOM kills
Gossip port blocked by firewallOne node sees all others as PFAIL, while the rest see each other as healthyCompare CLUSTER NODES from the isolated node’s perspective vs a healthy node
cluster-node-timeout too aggressivePFAIL appears during normal I/O latency without actual failureCONFIG GET cluster-node-timeout and compare to your latest_fork_usec baseline

Quick checks

# Check cluster-wide PFAIL, FAIL, and node counts
redis-cli CLUSTER INFO | grep -E "cluster_state|cluster_slots_pfail|cluster_slots_fail|cluster_known_nodes"

# View node flags, IDs, and link states
redis-cli CLUSTER NODES

# Test direct reachability to the suspected node
redis-cli -h <target_ip> -p <port> PING

# Check if a fork is blocking the main thread
redis-cli INFO persistence | grep -E "rdb_bgsave_in_progress|aof_rewrite_in_progress"

# Check recent fork duration (transient PFAIL correlates with high values)
redis-cli INFO stats | grep latest_fork_usec

# Check replication link if the target is a replica
redis-cli -h <target_ip> INFO replication | grep master_link_status

# Read the configured failure detection timeout
redis-cli CONFIG GET cluster-node-timeout

How to diagnose it

  1. Confirm duration. A single 30-second spike of cluster_slots_pfail = 2 differs from a sustained plateau over minutes. Sample CLUSTER INFO repeatedly or inspect the metric trend before treating this as an incident.
  2. Identify the affected node. CLUSTER NODES shows flags. Look for fail? (PFAIL) or fail (FAIL). The address and node ID tell you which host to investigate directly.
  3. Correlate with persistence events. If rdb_bgsave_in_progress or aof_rewrite_in_progress is 1 and latest_fork_usec is high, especially above 500ms, the PFAIL is likely from a fork freeze. Check Transparent Huge Pages status: cat /sys/kernel/mm/transparent_hugepage/enabled.
  4. Verify network reachability. From the reporting node, run redis-cli -h <target> -p <port> PING. Also test the cluster bus port (port + 10000). If the client port answers but the bus port does not, firewall rules are the most likely cause.
  5. Check for asymmetry. Compare CLUSTER NODES output from multiple masters. If only one node reports PFAIL for the target while others see it healthy, suspect a partition between those two specific nodes rather than total node failure.
  6. Assess quorum health. Count active masters. A three-master cluster needs two healthy primaries to declare FAIL and vote in a replica election. If you are down to one master, the cluster is in a zombie state. Do not wait for automatic failover.
  7. Check the target node directly. INFO stats for instantaneous_ops_per_sec dropping to zero indicates an event loop block. used_memory_rss near system limits suggests OOM pressure. loading:1 explains why a recently restarted node is temporarily unreachable.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
cluster_slots_pfailSlots on nodes suspected of failureSustained value > 0
cluster_slots_failSlots on confirmed-failed nodesAny value > 0 means active outage
cluster_stateBinary cluster health indicatorfail means clients receive CLUSTERDOWN
latest_fork_usecMain-thread freeze during fork> 500ms correlates with transient PFAIL spikes
master_link_status on replicasWhether replicas are streaming from primariesdown on a replica that should be up narrows scope to replication health
cluster_stats_messages_sent / receivedGossip volume between nodesAsymmetry or drops indicate network partition
loadingStartup dataset restorationloading:1 explains a temporarily unresponsive node after restart
connected_slaves on primaryExpected replica countDrop below expected means fewer failover candidates

Fixes

Transient fork or I/O pause

If PFAIL correlates with rdb_bgsave_in_progress or aof_rewrite_in_progress, and latest_fork_usec is high, the node froze long enough to miss gossip heartbeats.

  • Increase cluster-node-timeout temporarily or permanently if your workload requires frequent persistence forks: CONFIG SET cluster-node-timeout 30000. This increases the window for actual failure detection.
  • Disable Transparent Huge Pages at runtime: echo never > /sys/kernel/mm/transparent_hugepage/enabled. THP is a common cause of fork latency spikes. This is not persistent across reboots on most systems.
  • If the node is restarting, wait for loading to return to 0 before assuming failure. Do not force failover during startup restoration.

Network partition or firewall

If the node is reachable on the client port but not the cluster bus, or if CLUSTER NODES shows one-way reachability:

  • Open the cluster bus port (Redis port + 10000) between all nodes. TCP must be allowed bidirectionally.
  • In containers or cloud VPCs, verify security groups and network policies include the bus port. This is the most commonly missed rule in Redis Cluster deployments.

Node crash or overload

If the target node does not respond to PING and uptime_in_seconds has reset or is very low:

  • Check kernel logs for OOM kills.
  • If the node is permanently lost and you have quorum (majority of masters healthy), automatic failover should proceed once FAIL is declared. Monitor cluster_slots_fail to confirm slot migration.
  • If you do not have quorum, for example a three-master cluster with two masters down, automatic failover cannot proceed. You must recover at least one master or consider manual promotion.

Stuck PFAIL without FAIL escalation

Sometimes a node stays in PFAIL but never reaches FAIL because not enough masters agree.

  • Investigate why gossip is not propagating. Asymmetric partitions can cause this.
  • If the node is actually down and you need to force promotion, CLUSTER FAILOVER FORCE on an eligible replica starts failover immediately without waiting for FAIL state, provided the master is unreachable from the replica’s view.
  • CLUSTER FAILOVER TAKEOVER bypasses the election entirely and promotes the replica unconditionally. Only use this when quorum is lost and you accept the risk of split-brain if the old master reappears. After TAKEOVER, if the old master recovers, run CLUSTER FORGET for the old master node ID to prevent two nodes from claiming the same slots.

Prevention

  • Size cluster-node-timeout for your fork baseline. Set it well above your typical latest_fork_usec peak, but not so high that real failures go undetected for minutes. In fork-heavy environments, 20 to 30 seconds is common.
  • Monitor fork latency and THP status. Persistent fork latency above 200ms is a leading indicator for transient PFAIL storms.
  • Ensure the cluster bus is routable. Document the port + 10000 requirement in firewall rules and network policies. Verify bidirectionally from every node.
  • Avoid marginal quorum topologies. A three-master cluster survives one master failure but enters a zombie state on the second. If you need higher availability, design for faster manual intervention or additional masters.
  • Tune repl-backlog-size to prevent full resync cascades. A replica that reconnects after a long PFAIL event may need to full resync if the backlog is too small. The default 1MB is usually insufficient. 100MB or more is typical for production.
  • Set cluster-require-full-coverage intentionally. The default is yes, meaning the entire cluster goes to cluster_state:fail if any slot loses all its nodes. Setting it to no allows available slots to keep serving traffic during partial outages, but it masks coverage gaps.

How Netdata helps

  • Netdata collects cluster_slots_pfail, cluster_slots_fail, and cluster_state from every node, letting you spot whether a PFAIL event is isolated to one observer or cluster-wide.
  • Correlate cluster_slots_pfail spikes with redis.latest_fork_usec on the same node to identify fork-induced transient pauses.
  • Cross-reference with redis.connected_slaves, redis.master_link_status, and replication offset lag to determine whether a PFAIL node is a master with healthy replicas ready for failover.
  • Latency and CPU saturation metrics help distinguish a slow node from a completely dead one before FAIL is declared.