Cassandra native transport not running: node UP in gossip but refusing CQL clients
A node reports UN in nodetool status but rejects CQL connections on port 9042. Gossip and replication are healthy; the failure is isolated to the native transport layer.
Because the node remains in the token ring, it continues to handle internode replication, gossip, and streaming. Applications see it as down; the cluster sees it as up. The JMX attribute NativeTransportRunning on org.apache.cassandra.db:type=StorageService is false while gossip heartbeats continue. The usual triggers are nodetool disablebinary left active after maintenance, or a firewall blocking TCP 9042.
flowchart TD
A[Clients refuse CQL on 9042] --> B{nodetool status}
B -->|UJ| C[Bootstrap: wait for streaming]
B -->|DS| D[Drained: restart required]
B -->|UN| E{nodetool statusbinary}
E -->|not running| F{Maintenance window?}
F -->|yes| G[nodetool enablebinary]
F -->|no| H{Port 9042 reachable?}
H -->|no| I[Firewall or security group]
H -->|yes| J[Investigate RPC state]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
nodetool disablebinary left active after maintenance | Node UN; nodetool statusbinary returns not running; ops logs show recent maintenance | nodetool statusbinary and maintenance calendar |
| Firewall, security group, or host firewall blocks 9042 | Node UN; statusbinary reports running locally; clients timeout from application subnet | nc -vz <node-ip> 9042 from a client host |
| Node joining the cluster | Node shows UJ; native transport binds after bootstrap streaming completes | nodetool status for UJ state |
nodetool drain confusion | drain stops native transport and gossip; node shows DS and requires restart | nodetool status to confirm state letter |
Quick checks
# Is native transport enabled?
nodetool statusbinary
# Confirm state in nodetool info
nodetool info | grep "Native Transport"
# Gossip state: UN, UJ, DS, etc.
nodetool status
# CQL port reachability from the client subnet
nc -vz <node-ip> 9042
# Local port binding (run as root or the cassandra user to see PIDs)
ss -tlnp | grep 9042
# Configured native transport port
grep -E "^native_transport_port:" /etc/cassandra/cassandra.yaml
# If client encryption is required, check the SSL port
grep -E "^native_transport_port_ssl:" /etc/cassandra/cassandra.yaml
# Connected client count
nodetool clientstats
# Recent disablebinary/enablebinary activity in logs
grep -E "disablebinary|enablebinary" /var/log/cassandra/system.log
# Streaming progress on new nodes
nodetool netstats
How to diagnose it
Confirm gossip state. Run
nodetool status.DSmeans the node was drained and needs a restart before accepting CQL.UJmeans it is bootstrapping; native transport starts only after streaming finishes and the state transitions toNORMAL.UNmeans the node is in the ring but the binary interface is off.Check native transport directly. Run
nodetool statusbinaryandnodetool info | grep "Native Transport". Both should reportrunningon a traffic-bearing node. If they reportnot running, Cassandra is not accepting CQL connections regardless of gossip health.Distinguish maintenance from incident. Review
system.logfordisablebinary,enablebinary,Native transport service stopped, orStopping native transport. Check your maintenance calendar. Ifnodetool disablebinarywas run without a matchingenablebinary, the fix is simply to re-enable it. This is the most common cause of aUNnode refusing clients.Verify the network path. Even when native transport is running, a firewall, security group, or iptables rule can block 9042 between the application and the node. Run
nc -vz <node-ip> 9042from an application host. If this fails whiless -tlnpon the Cassandra node shows 9042 inLISTEN, the block is external.Validate the configured port. If the node uses a non-default
native_transport_portincassandra.yaml, clients and load balancers may target the wrong port. The default is 9042. If client encryption is mandatory, checknative_transport_port_ssl(default 9142) and ensure clients are not trying to connect plaintext.Assess bootstrap progress. If the node is new or replaced, check
nodetool netstats. If streaming is active, the node isUJand intentionally delays native transport until bootstrap finishes. Do not forceenablebinaryduring bootstrap.Check for port conflicts. In
system.log, look forBindExceptionon port 9042. If another process has bound the port, Cassandra native transport will fail to start. This is rare but can occur during failed restarts or container port collisions.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
NativeTransportRunning (JMX: StorageService) | Direct boolean for CQL transport state | false while node is UN |
nodetool statusbinary | Operator-facing transport state | Returns not running on a production node |
connectedNativeClients (JMX: Client) | Active CQL session count | Sudden drop to zero on a traffic-bearing node |
Gossip state (nodetool status) | Separates bootstrap/drain from binary-only issues | UJ or DS explains transport unavailability |
| Port 9042 connectivity | Separates Cassandra state from network policy | Refused or timeout from client subnet |
| Client request timeouts / unavailables | Direct client impact | Spikes correlate with transport downtime |
Fixes
Re-enable native transport after maintenance
If nodetool statusbinary reports not running and the node was taken out of client rotation intentionally:
nodetool enablebinary
Verify:
nodetool statusbinary
nodetool clientstats
No restart is required. If multiple nodes in the same rack or replica set were disabled, re-enable them one at a time and verify client connections recover before proceeding. Re-enabling many nodes simultaneously can cause a reconnect storm.
If clients continue to timeout after enablebinary, the driver may have temporarily blacklisted the node. Wait for the driver’s reconnection window, or restart the application connection pool if the driver does not retry the node automatically.
Unblock port 9042
If the transport is running locally but clients cannot connect:
- Cloud security groups: Allow ingress TCP 9042 from the client subnet.
- Host firewall (iptables/ufw/firewalld): Add an allow rule for 9042.
- Container networking: Verify the container port is mapped and the CNI policy permits the connection.
Restrict the source to your client subnet. Unless client_encryption_options is enabled, native transport traffic is unencrypted, and exposing it broadly is a security risk.
Wait for bootstrap or restart after drain
If the node is UJ, do not force enablebinary. Native transport binds after bootstrap completes by design. Monitor nodetool netstats until streaming finishes and the state transitions to UN.
If the node is DS after nodetool drain, enablebinary will not restore service. The node requires a full Cassandra process restart.
Prevention
Document runbooks clearly. Distinguish
nodetool disablebinary(reversible, gossip staysUP) fromnodetool drain(requires restart, showsDS). Never use them interchangeably. Post-maintenance verification must includenodetool statusbinary.Monitor native transport state alongside gossip. Alerting only on
nodetool statusmisses this failure mode. IncludeNativeTransportRunningornodetool statusbinaryin your availability checks, and suppress alerts onconnectedNativeClients = 0when a node is tagged for maintenance.Verify 9042 end-to-end. Run periodic connectivity checks from the client subnet, not just localhost, to catch firewall drift before applications fail. A node can pass all local health checks while a security group change blocks remote clients.
How Netdata helps
- Correlates
NativeTransportRunning = falsewith gossipUPstate to surface nodes that look healthy to the cluster but are unreachable by clients. - Tracks
connectedNativeClientsper node to detect sudden disconnection events that precede application timeouts. - Charts client request rates and errors against transport state changes, distinguishing binary transport outages from quorum loss or GC pauses.
- Alerts on zero connected clients for nodes that historically carry traffic, catching
disablebinaryor firewall issues before downstream services degrade.
Related guides
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra disk space exhaustion: emergency recovery when the data volume fills
- Cassandra dropped mutations: silent write loss and load shedding
- Cassandra dropped reads and other messages: reading nodetool tpstats Dropped
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery
- Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses
- Cassandra heap pressure: sizing the JVM heap and tuning G1GC
- Cassandra monitoring checklist: the signals every production cluster needs
- Cassandra monitoring maturity model: from survival to expert







