Redis cluster_slots_pfail > 0: impending node failure in a cluster
cluster_slots_pfail > 0 means at least one hash slot is mapped to a node that a peer suspects is down. In Redis Cluster, PFAIL is unilateral: any node raises it when another stops answering gossip PINGs for longer than cluster-node-timeout. Slots continue to serve traffic; the cluster has not yet agreed the node is dead.
Brief spikes are expected during background saves, AOF rewrites, or any main-thread freeze. Sustained non-zero values indicate a real problem: network partition, node crash, or overload. If the majority of masters confirm the suspicion within twice cluster-node-timeout, PFAIL escalates to FAIL. The affected slots become unavailable until a replica wins election. In a three-master cluster, losing two primaries leaves the survivor without quorum. The cluster enters a zombie state where no failover can proceed. Investigate PFAIL while you still have quorum and before automatic escalation.
What this means
Redis Cluster shards data across 16384 hash slots. Gossip on the cluster bus (Redis port + 10000) maintains topology. When node A has an unanswered PING pending to node B past cluster-node-timeout, node A marks node B PFAIL. Masters and replicas can raise PFAIL against any other node.
PFAIL is one node’s opinion. Slots mapped to a PFAIL node keep serving traffic. Escalation to FAIL requires a single node to collect gossip from the majority of masters confirming the same target is PFAIL, with those reports arriving within twice cluster-node-timeout. Once FAIL is reached, the replica with the highest replication offset initiates an election. A majority of masters must vote for the candidate to promote it.
The FAIL flag is mostly one-way. It clears only when the node is reachable and is a replica, or the node is reachable and is a master serving no slots, or an extended period passes with no replica promotion detected.
flowchart TD
A[Node misses PONGs past cluster-node-timeout] --> B[PFAIL raised by one node]
B --> C{Node recovers and answers PONG?}
C -->|Yes| D[PFAIL cleared automatically]
C -->|No| E{Majority of masters confirm PFAIL?}
E -->|No| F[PFAIL persists without escalation]
E -->|Yes| G[FAIL declared]
G --> H[Replica election triggered]
H --> I{Quorum available to vote?}
I -->|Yes| J[Replica promoted, failover complete]
I -->|No| K[Zombie state: no quorum]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Transient pause from fork or GC | cluster_slots_pfail spikes briefly, then returns to 0 during BGSAVE or AOF rewrite | INFO persistence for rdb_bgsave_in_progress or aof_rewrite_in_progress, and latest_fork_usec |
| Network partition or packet loss | PFAIL persists on multiple nodes for the same target, or asymmetric gossip flags | CLUSTER NODES to see if multiple nodes flag the same peer; ping the cluster bus port |
| Node crash or unresponsive event loop | Target node does not respond to direct PING; uptime_in_seconds may have reset | Direct redis-cli PING to the target; check kernel logs for OOM kills |
| Gossip port blocked by firewall | One node sees all others as PFAIL, while the rest see each other as healthy | Compare CLUSTER NODES from the isolated node’s perspective vs a healthy node |
cluster-node-timeout too aggressive | PFAIL appears during normal I/O latency without actual failure | CONFIG GET cluster-node-timeout and compare to your latest_fork_usec baseline |
Quick checks
# Check cluster-wide PFAIL, FAIL, and node counts
redis-cli CLUSTER INFO | grep -E "cluster_state|cluster_slots_pfail|cluster_slots_fail|cluster_known_nodes"
# View node flags, IDs, and link states
redis-cli CLUSTER NODES
# Test direct reachability to the suspected node
redis-cli -h <target_ip> -p <port> PING
# Check if a fork is blocking the main thread
redis-cli INFO persistence | grep -E "rdb_bgsave_in_progress|aof_rewrite_in_progress"
# Check recent fork duration (transient PFAIL correlates with high values)
redis-cli INFO stats | grep latest_fork_usec
# Check replication link if the target is a replica
redis-cli -h <target_ip> INFO replication | grep master_link_status
# Read the configured failure detection timeout
redis-cli CONFIG GET cluster-node-timeout
How to diagnose it
- Confirm duration. A single 30-second spike of
cluster_slots_pfail = 2differs from a sustained plateau over minutes. SampleCLUSTER INFOrepeatedly or inspect the metric trend before treating this as an incident. - Identify the affected node.
CLUSTER NODESshows flags. Look forfail?(PFAIL) orfail(FAIL). The address and node ID tell you which host to investigate directly. - Correlate with persistence events. If
rdb_bgsave_in_progressoraof_rewrite_in_progressis 1 andlatest_fork_usecis high, especially above 500ms, the PFAIL is likely from a fork freeze. Check Transparent Huge Pages status:cat /sys/kernel/mm/transparent_hugepage/enabled. - Verify network reachability. From the reporting node, run
redis-cli -h <target> -p <port> PING. Also test the cluster bus port (port + 10000). If the client port answers but the bus port does not, firewall rules are the most likely cause. - Check for asymmetry. Compare
CLUSTER NODESoutput from multiple masters. If only one node reports PFAIL for the target while others see it healthy, suspect a partition between those two specific nodes rather than total node failure. - Assess quorum health. Count active masters. A three-master cluster needs two healthy primaries to declare FAIL and vote in a replica election. If you are down to one master, the cluster is in a zombie state. Do not wait for automatic failover.
- Check the target node directly.
INFO statsforinstantaneous_ops_per_secdropping to zero indicates an event loop block.used_memory_rssnear system limits suggests OOM pressure.loading:1explains why a recently restarted node is temporarily unreachable.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
cluster_slots_pfail | Slots on nodes suspected of failure | Sustained value > 0 |
cluster_slots_fail | Slots on confirmed-failed nodes | Any value > 0 means active outage |
cluster_state | Binary cluster health indicator | fail means clients receive CLUSTERDOWN |
latest_fork_usec | Main-thread freeze during fork | > 500ms correlates with transient PFAIL spikes |
master_link_status on replicas | Whether replicas are streaming from primaries | down on a replica that should be up narrows scope to replication health |
cluster_stats_messages_sent / received | Gossip volume between nodes | Asymmetry or drops indicate network partition |
loading | Startup dataset restoration | loading:1 explains a temporarily unresponsive node after restart |
connected_slaves on primary | Expected replica count | Drop below expected means fewer failover candidates |
Fixes
Transient fork or I/O pause
If PFAIL correlates with rdb_bgsave_in_progress or aof_rewrite_in_progress, and latest_fork_usec is high, the node froze long enough to miss gossip heartbeats.
- Increase
cluster-node-timeouttemporarily or permanently if your workload requires frequent persistence forks:CONFIG SET cluster-node-timeout 30000. This increases the window for actual failure detection. - Disable Transparent Huge Pages at runtime:
echo never > /sys/kernel/mm/transparent_hugepage/enabled. THP is a common cause of fork latency spikes. This is not persistent across reboots on most systems. - If the node is restarting, wait for
loadingto return to 0 before assuming failure. Do not force failover during startup restoration.
Network partition or firewall
If the node is reachable on the client port but not the cluster bus, or if CLUSTER NODES shows one-way reachability:
- Open the cluster bus port (Redis port + 10000) between all nodes. TCP must be allowed bidirectionally.
- In containers or cloud VPCs, verify security groups and network policies include the bus port. This is the most commonly missed rule in Redis Cluster deployments.
Node crash or overload
If the target node does not respond to PING and uptime_in_seconds has reset or is very low:
- Check kernel logs for OOM kills.
- If the node is permanently lost and you have quorum (majority of masters healthy), automatic failover should proceed once FAIL is declared. Monitor
cluster_slots_failto confirm slot migration. - If you do not have quorum, for example a three-master cluster with two masters down, automatic failover cannot proceed. You must recover at least one master or consider manual promotion.
Stuck PFAIL without FAIL escalation
Sometimes a node stays in PFAIL but never reaches FAIL because not enough masters agree.
- Investigate why gossip is not propagating. Asymmetric partitions can cause this.
- If the node is actually down and you need to force promotion,
CLUSTER FAILOVER FORCEon an eligible replica starts failover immediately without waiting for FAIL state, provided the master is unreachable from the replica’s view. CLUSTER FAILOVER TAKEOVERbypasses the election entirely and promotes the replica unconditionally. Only use this when quorum is lost and you accept the risk of split-brain if the old master reappears. After TAKEOVER, if the old master recovers, runCLUSTER FORGETfor the old master node ID to prevent two nodes from claiming the same slots.
Prevention
- Size
cluster-node-timeoutfor your fork baseline. Set it well above your typicallatest_fork_usecpeak, but not so high that real failures go undetected for minutes. In fork-heavy environments, 20 to 30 seconds is common. - Monitor fork latency and THP status. Persistent fork latency above 200ms is a leading indicator for transient PFAIL storms.
- Ensure the cluster bus is routable. Document the port + 10000 requirement in firewall rules and network policies. Verify bidirectionally from every node.
- Avoid marginal quorum topologies. A three-master cluster survives one master failure but enters a zombie state on the second. If you need higher availability, design for faster manual intervention or additional masters.
- Tune
repl-backlog-sizeto prevent full resync cascades. A replica that reconnects after a long PFAIL event may need to full resync if the backlog is too small. The default 1MB is usually insufficient. 100MB or more is typical for production. - Set
cluster-require-full-coverageintentionally. The default is yes, meaning the entire cluster goes tocluster_state:failif any slot loses all its nodes. Setting it to no allows available slots to keep serving traffic during partial outages, but it masks coverage gaps.
How Netdata helps
- Netdata collects
cluster_slots_pfail,cluster_slots_fail, andcluster_statefrom every node, letting you spot whether a PFAIL event is isolated to one observer or cluster-wide. - Correlate
cluster_slots_pfailspikes withredis.latest_fork_usecon the same node to identify fork-induced transient pauses. - Cross-reference with
redis.connected_slaves,redis.master_link_status, and replication offset lag to determine whether a PFAIL node is a master with healthy replicas ready for failover. - Latency and CPU saturation metrics help distinguish a slow node from a completely dead one before FAIL is declared.
Related guides
- How Redis actually works in production: a mental model for operators
- Redis aof_last_write_status:err: AOF write failures and recovery
- Redis appendfsync always latency: durability vs throughput trade-offs
- Redis blocked_clients growing: dead consumers vs healthy queues
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix
- Redis client output buffer overflow: slow consumers and client-output-buffer-limit
- Redis connected_clients climbing: connection leak detection
- Redis connected_slaves dropped: detecting replica disconnects on the primary
- Redis connection exhaustion: leaks, pools, and the retry storm
- Redis event loop blocked: when one slow command freezes everything
- Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction







