Redis cluster bus port blocked: the port+10000 firewall gotcha

CLUSTER INFO reports cluster_state:fail. Nodes show non-zero cluster_slots_pfail. Clients receive CLUSTERDOWN. Yet redis-cli -p 6379 PING returns PONG on every node, application connections are still accepted, and the client port shows no obvious network outage. The cluster behaves like it is partitioned, but only the bus is broken. Port 16379, or your configured client port plus 10000, is missing from a firewall rule, security group, or container port mapping. The cluster bus carries gossip, failure detection, and node discovery over this separate TCP port. When the bus is unreachable, nodes cannot synchronize the cluster map, so they mark peers as failed and withdraw slot coverage even though the data port stays healthy. Because firewall rules often cover the client port but omit the bus port, this failure mode is common after infrastructure changes, node replacements, or environment migrations.

flowchart TD
    A[Bus port blocked] --> B[Heartbeat and gossip loss]
    B --> C[Nodes mark peers PFAIL]
    C --> D[Master quorum confirms FAIL]
    D --> E[Slots lose coverage]
    E --> F[cluster_state:fail]
    F --> G[CLUSTERDOWN to clients]

What this means

In Redis Cluster, every node binds a client command port (default 6379) and a cluster bus port (client port + 10000, default 16379). The bus carries heartbeat exchange, failure detection, slot migration coordination, and node discovery.

If a host firewall, cloud security group, Kubernetes NetworkPolicy, or missing container mapping blocks the bus port, nodes on opposite sides of the block lose gossip connectivity. They continue to serve client traffic, but stop trusting each other. Without heartbeats, a node suspects its peer is dead and marks it PFAIL. If the majority of masters agree, the suspicion becomes FAIL. The failed node’s slots become unavailable, cluster_state transitions to fail, and clients receive CLUSTERDOWN. The root cause is not a crashed node or a true network partition. It is a missing allow rule for port+10000.

Common causes

CauseWhat it looks likeFirst thing to check
Cloud security group or VPC firewall missing port+10000Cross-AZ or cross-subnet nodes cannot form a healthy cluster; new nodes hang during joinSecurity group ingress and egress rules for 16379 between all node private IPs
Host-level firewall (iptables, firewalld, nftables)A single node or rack appears isolated; CLUSTER INFO on the isolated node shows fewer known nodesLocal firewall rules and ss -tlnp for the bus port
Kubernetes Service or NetworkPolicy exposing only port 6379Pods restart and the cluster never re-forms; only the client port is routableService spec ports and NetworkPolicy ingress rules covering the bus port
Container runtime port mapping only forwarding 6379Nodes on different Docker hosts cannot gossip, but single-host clusters work fineContainer port mappings and host firewall between container hosts

Quick checks

Run these safe, read-only commands to confirm the bus port is the problem.

# Verify the client port responds while cluster state is broken
redis-cli -p 6379 PING

# Check cluster state and gossip counters
redis-cli CLUSTER INFO

# Inspect node topology and flags
redis-cli CLUSTER NODES

# Verify Redis is listening on the bus port
ss -tlnp | grep ':16379'

# Test bus port reachability from a peer using bash TCP
timeout 2 bash -c "</dev/tcp/<peer_ip>/16379" && echo "open" || echo "blocked"

# Review local firewall rules for the bus port
sudo iptables -L -n | grep 16379

If CLUSTER INFO shows cluster_state:fail while ss -tlnp confirms Redis is listening on 16379, the process is healthy but something on the network path is dropping packets.

How to diagnose it

  1. Collect cluster state from every node. Run redis-cli CLUSTER INFO on each node. Look for cluster_state:fail, non-zero cluster_slots_pfail, or non-zero cluster_slots_fail. Record cluster_stats_messages_sent and cluster_stats_messages_received.

  2. Compare gossip asymmetry. If node A shows rising cluster_stats_messages_sent while node B shows flat or zero cluster_stats_messages_received corresponding to A, bus traffic is being dropped in one or both directions. This asymmetry is the hallmark of a firewall block.

  3. Inspect node topology. Run redis-cli CLUSTER NODES. Look for nodes with failure flags or nodes that appear disconnected from the local view. If a node has been isolated long enough, it may be marked fail.

  4. Verify local binding. Run ss -tlnp | grep ':16379' to confirm the redis-server process is listening on the bus port. If the port is not bound, check whether the node was started without cluster mode enabled or whether the port is already in use.

  5. Test peer reachability. From each node, test connectivity to every other node’s bus port using nc -z or the /dev/tcp bashism. If the client port (6379) connects but the bus port (16379) does not, the path is filtered.

  6. Audit all network layers. Check host-level firewalls (iptables, nftables, firewalld), cloud security groups, VPC network ACLs, and container network policies. The most common mistake is an allow rule for 6379 that omits 16379.

  7. Check for one-way rules. A firewall that allows outbound but not inbound on 16379, or vice versa, still breaks gossip. The bus port needs bidirectional reachability between every pair of nodes.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
cluster_stateBinary indicator of cluster healthfail
cluster_slots_pfailSlots on nodes suspected of failureNon-zero and growing
cluster_slots_failSlots on confirmed-failed nodesNon-zero
cluster_stats_messages_sent vs receivedAsymmetry means gossip is one-way or droppedSent count rising while received is flat or zero
cluster_known_nodesWhether the node sees the full topologyFewer nodes than expected
connected_clientsClient port may still accept connectionsHealthy count despite cluster failure

A healthy client port alongside a failing cluster state strongly suggests the bus port is blocked rather than a total node failure.

Fixes

Open the bus port in all firewalls

Allow TCP traffic on the bus port (default 16379) between all cluster node IPs. Update cloud security groups, VPC ACLs, host firewalls, and container port mappings to permit bidirectional traffic. You do not need to restart Redis for most firewall changes. The bus port does not need to be exposed to applications or the public internet, but every node must reach every other node’s bus port.

Rejoin isolated nodes

If a node was isolated long enough to be marked FAIL, opening the port may not automatically restore full cluster membership. The node may need to be reintroduced manually. If the node was removed from the topology during incident response, use the standard cluster management commands to re-add it, or restart the node after confirming full connectivity so it rejoins via the remaining nodes. Restarting a node is disruptive and will interrupt client connections.

Recover slot coverage

If the cluster entered fail state and slot assignments were altered during the incident, verify slot ownership with CLUSTER NODES. Correct any misassigned slots before returning the cluster to production traffic. In severe cases where multiple masters independently marked each other failed, restart the affected nodes one at a time after ensuring full bus connectivity. Restarting nodes is disruptive; do this during a maintenance window or with traffic rerouted.

Prevention

  • Infrastructure-as-code checklist. Every Redis Cluster node provisioning template must open both the client port and the bus port (port+10000) in all security layers. A single omission in one security group or container mapping is enough to break the cluster.

  • Node bootstrap verification. Before marking a new node as ready, confirm it is listening on the bus port with ss -tlnp | grep ':16379' and verify connectivity from an existing node to the new node’s bus port.

  • Monitor gossip asymmetry. Alert when cluster_stats_messages_sent grows while cluster_stats_messages_received stays flat. This catches one-way firewall rules, asymmetric network policies, or packet loss before they escalate to cluster_state:fail.

  • Track topology size. Monitor cluster_known_nodes during scaling events. A drop immediately after a new node joins indicates the bus port is not open to or from the new member.

How Netdata helps

  • Track cluster_state, cluster_slots_pfail, and cluster_slots_fail to correlate cluster-wide failures with the exact moment slot coverage dropped.
  • Surface cluster_stats_messages_sent and cluster_stats_messages_received to detect gossip asymmetry without relying on manual CLUSTER INFO sampling.
  • Correlate healthy connected_clients with failing cluster state to distinguish a bus port block from a total node outage.
  • Monitor cluster_known_nodes per node to detect topology drift before it becomes a full partition.