Cassandra native transport not running: node UP in gossip but refusing CQL clients

A node reports UN in nodetool status but rejects CQL connections on port 9042. Gossip and replication are healthy; the failure is isolated to the native transport layer.

Because the node remains in the token ring, it continues to handle internode replication, gossip, and streaming. Applications see it as down; the cluster sees it as up. The JMX attribute NativeTransportRunning on org.apache.cassandra.db:type=StorageService is false while gossip heartbeats continue. The usual triggers are nodetool disablebinary left active after maintenance, or a firewall blocking TCP 9042.

flowchart TD
    A[Clients refuse CQL on 9042] --> B{nodetool status}
    B -->|UJ| C[Bootstrap: wait for streaming]
    B -->|DS| D[Drained: restart required]
    B -->|UN| E{nodetool statusbinary}
    E -->|not running| F{Maintenance window?}
    F -->|yes| G[nodetool enablebinary]
    F -->|no| H{Port 9042 reachable?}
    H -->|no| I[Firewall or security group]
    H -->|yes| J[Investigate RPC state]

Common causes

CauseWhat it looks likeFirst thing to check
nodetool disablebinary left active after maintenanceNode UN; nodetool statusbinary returns not running; ops logs show recent maintenancenodetool statusbinary and maintenance calendar
Firewall, security group, or host firewall blocks 9042Node UN; statusbinary reports running locally; clients timeout from application subnetnc -vz <node-ip> 9042 from a client host
Node joining the clusterNode shows UJ; native transport binds after bootstrap streaming completesnodetool status for UJ state
nodetool drain confusiondrain stops native transport and gossip; node shows DS and requires restartnodetool status to confirm state letter

Quick checks

# Is native transport enabled?
nodetool statusbinary

# Confirm state in nodetool info
nodetool info | grep "Native Transport"

# Gossip state: UN, UJ, DS, etc.
nodetool status

# CQL port reachability from the client subnet
nc -vz <node-ip> 9042

# Local port binding (run as root or the cassandra user to see PIDs)
ss -tlnp | grep 9042

# Configured native transport port
grep -E "^native_transport_port:" /etc/cassandra/cassandra.yaml

# If client encryption is required, check the SSL port
grep -E "^native_transport_port_ssl:" /etc/cassandra/cassandra.yaml

# Connected client count
nodetool clientstats

# Recent disablebinary/enablebinary activity in logs
grep -E "disablebinary|enablebinary" /var/log/cassandra/system.log

# Streaming progress on new nodes
nodetool netstats

How to diagnose it

  1. Confirm gossip state. Run nodetool status. DS means the node was drained and needs a restart before accepting CQL. UJ means it is bootstrapping; native transport starts only after streaming finishes and the state transitions to NORMAL. UN means the node is in the ring but the binary interface is off.

  2. Check native transport directly. Run nodetool statusbinary and nodetool info | grep "Native Transport". Both should report running on a traffic-bearing node. If they report not running, Cassandra is not accepting CQL connections regardless of gossip health.

  3. Distinguish maintenance from incident. Review system.log for disablebinary, enablebinary, Native transport service stopped, or Stopping native transport. Check your maintenance calendar. If nodetool disablebinary was run without a matching enablebinary, the fix is simply to re-enable it. This is the most common cause of a UN node refusing clients.

  4. Verify the network path. Even when native transport is running, a firewall, security group, or iptables rule can block 9042 between the application and the node. Run nc -vz <node-ip> 9042 from an application host. If this fails while ss -tlnp on the Cassandra node shows 9042 in LISTEN, the block is external.

  5. Validate the configured port. If the node uses a non-default native_transport_port in cassandra.yaml, clients and load balancers may target the wrong port. The default is 9042. If client encryption is mandatory, check native_transport_port_ssl (default 9142) and ensure clients are not trying to connect plaintext.

  6. Assess bootstrap progress. If the node is new or replaced, check nodetool netstats. If streaming is active, the node is UJ and intentionally delays native transport until bootstrap finishes. Do not force enablebinary during bootstrap.

  7. Check for port conflicts. In system.log, look for BindException on port 9042. If another process has bound the port, Cassandra native transport will fail to start. This is rare but can occur during failed restarts or container port collisions.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
NativeTransportRunning (JMX: StorageService)Direct boolean for CQL transport statefalse while node is UN
nodetool statusbinaryOperator-facing transport stateReturns not running on a production node
connectedNativeClients (JMX: Client)Active CQL session countSudden drop to zero on a traffic-bearing node
Gossip state (nodetool status)Separates bootstrap/drain from binary-only issuesUJ or DS explains transport unavailability
Port 9042 connectivitySeparates Cassandra state from network policyRefused or timeout from client subnet
Client request timeouts / unavailablesDirect client impactSpikes correlate with transport downtime

Fixes

Re-enable native transport after maintenance

If nodetool statusbinary reports not running and the node was taken out of client rotation intentionally:

nodetool enablebinary

Verify:

nodetool statusbinary
nodetool clientstats

No restart is required. If multiple nodes in the same rack or replica set were disabled, re-enable them one at a time and verify client connections recover before proceeding. Re-enabling many nodes simultaneously can cause a reconnect storm.

If clients continue to timeout after enablebinary, the driver may have temporarily blacklisted the node. Wait for the driver’s reconnection window, or restart the application connection pool if the driver does not retry the node automatically.

Unblock port 9042

If the transport is running locally but clients cannot connect:

  • Cloud security groups: Allow ingress TCP 9042 from the client subnet.
  • Host firewall (iptables/ufw/firewalld): Add an allow rule for 9042.
  • Container networking: Verify the container port is mapped and the CNI policy permits the connection.

Restrict the source to your client subnet. Unless client_encryption_options is enabled, native transport traffic is unencrypted, and exposing it broadly is a security risk.

Wait for bootstrap or restart after drain

If the node is UJ, do not force enablebinary. Native transport binds after bootstrap completes by design. Monitor nodetool netstats until streaming finishes and the state transitions to UN.

If the node is DS after nodetool drain, enablebinary will not restore service. The node requires a full Cassandra process restart.

Prevention

  • Document runbooks clearly. Distinguish nodetool disablebinary (reversible, gossip stays UP) from nodetool drain (requires restart, shows DS). Never use them interchangeably. Post-maintenance verification must include nodetool statusbinary.

  • Monitor native transport state alongside gossip. Alerting only on nodetool status misses this failure mode. Include NativeTransportRunning or nodetool statusbinary in your availability checks, and suppress alerts on connectedNativeClients = 0 when a node is tagged for maintenance.

  • Verify 9042 end-to-end. Run periodic connectivity checks from the client subnet, not just localhost, to catch firewall drift before applications fail. A node can pass all local health checks while a security group change blocks remote clients.

How Netdata helps

  • Correlates NativeTransportRunning = false with gossip UP state to surface nodes that look healthy to the cluster but are unreachable by clients.
  • Tracks connectedNativeClients per node to detect sudden disconnection events that precede application timeouts.
  • Charts client request rates and errors against transport state changes, distinguishing binary transport outages from quorum loss or GC pauses.
  • Alerts on zero connected clients for nodes that historically carry traffic, catching disablebinary or firewall issues before downstream services degrade.