Cassandra monitoring checklist: the signals every production cluster needs

The four maturity levels below are cumulative. Do not instrument level 2 until level 1 is visible and alerted.

Cassandra’s peer-to-peer architecture and LSM storage engine create failure modes generic infrastructure monitoring misses. A node can be UP in gossip and accepting CQL connections while dropping mutations or accumulating compaction debt that only surfaces hours later. These signals catch liveness, performance, saturation, and consistency failures: GC death spirals, compaction avalanches, tombstone storms, and silent data divergence.

Level 1 is survival. Levels 2 and 3 add diagnostics and leading indicators. Level 4 is for capacity planning and post-incident analysis.

flowchart TD
    L4["Level 4: Expert
Runway projections, per-partition telemetry, heap-after-GC trends"]
    L3["Level 3: Mature
Repair tracking, tombstone density, off-heap, speculative retries"]
    L2["Level 2: Operational
Latency p99, compaction, thread pools, GC pauses, hints"]
    L1["Level 1: Survival
UN status, native transport, disk space, heap, dropped messages"]

    L1 --> L2
    L2 --> L3
    L3 --> L4

Level 1 - Survival

These six signals tell you if the node is alive, reachable, and has room to breathe. They require only nodetool, filesystem checks, and JMX. If any fire, the cluster is failing or one event away from stopping writes. Check these first during an incident. A node that is DN in nodetool status but still pings is likely in GC thrash or disk hang. Native transport stopped while gossip is UP usually means CQL port bind failure or excessive heap pressure.

Signal	What to watch	Threshold
Node liveness	`nodetool status` state letters	Any `DN` sustained for more than 5 minutes
Native transport active	`NativeTransportRunning` JMX attribute	`false` while the node is UP in gossip
Disk space	Free space on data and commitlog volumes	Less than 50 percent free for STCS; less than 30 percent for LCS or TWCS
JVM heap pressure	`HeapMemoryUsage` used versus max	Greater than 75 percent of max sustained
Dropped messages	`DroppedMessage` rate for `MUTATION` and `READ`	Non-zero rate sustained for more than 60 seconds
Storage exceptions	`StorageExceptions` counter	Any non-zero rate sustained for more than 30 seconds

Level 2 - Operational

These signals explain why a node is struggling. Coordinator latency, error rates, and compaction state turn “the node is slow” into actionable diagnostics. Thread pool backpressure and disk I/O latency separate CPU problems from storage problems. Check these before tuning caches or adding capacity. P99 latency approaching read_request_timeout_in_ms means requests are about to fail outright. Blocked thread pools indicate backpressure failure and imminent dropped messages.

Signal	What to watch	Threshold
Client request latency (coordinator)	`ClientRequest` Read and Write latency p99	Sustained elevation greater than 3 times baseline or approaching `read_request_timeout_in_ms`
Client timeouts	`ClientRequest` Read and Write `Timeouts` rate	Greater than zero sustained for more than 60 seconds
Client unavailables	`ClientRequest` Read and Write `Unavailables` rate	Greater than 5 in 5 minutes with rate greater than 0.1 percent of requests
Compaction pending	`Compaction` `PendingTasks` gauge	Trending upward over 4 or more hours
SSTable count	`LiveSSTableCount` per table	Greater than 50 for STCS; L0 greater than 32 for LCS
Thread pool saturation	`ThreadPools` pending and blocked in ReadStage, MutationStage, Native-Transport-Requests, MemtableFlushWriter	Pending greater than zero sustained for more than 60 seconds; any blocked count
GC overhead	`G1 Old Generation` `CollectionTime` rate for GC overhead; GC logs for max pause	Overhead greater than 5 percent of wall clock; max pause greater than 2 seconds in logs
Disk I/O latency	`iostat` await on data and commitlog devices	SSD await greater than 10 ms sustained; HDD await greater than 50 ms sustained
Hinted handoff	Hints directory size; `TotalHintsInProgress`	Hints present when all nodes should be UP
File descriptor usage	`OpenFileDescriptorCount` versus limit	Greater than 80 percent of ulimit
Request throughput	`ClientRequest` rate	Sudden drop or spike greater than 3 times baseline
Schema agreement	`SchemaVersions` map size	More than one version sustained for more than 5 minutes

Level 3 - Mature

These are leading indicators. Repair tracking prevents data resurrection. Tombstone density and bloom filter accuracy catch data model degradation before it becomes an outage. Off-heap monitoring closes the gap that causes OOM kills despite a “healthy” JVM heap. Tombstone scans above 1,000 per read indicate a missing filter or TTL design flaw. Bloom filter false positives above fp_chance mean the filter is undersized or data distribution has changed.

Signal	What to watch	Threshold
Local read and write latency	`ReadLatency` and `WriteLatency` per table	Node deviates greater than 2 times from cluster median
Tombstones per read	`TombstoneScannedHistogram`; system log warnings	Sustained scans greater than 1,000 tombstones
Bloom filter false positives	`BloomFilterFalseRatio` per table	Greater than 2 times configured `bloom_filter_fp_chance`
Memtable flush pressure	`MemtableFlushWriter` pending; `MemtableSwitchCount` rate	Pending greater than zero sustained; flush rate diverging from write rate
Commitlog pressure	`CommitLog` `PendingTasks`; segment count	`PendingTasks` greater than zero; `WaitingOnSegmentAllocation` greater than zero
Off-heap memory	RSS minus JVM heap; `BloomFilterOffHeapMemoryUsed`	Total process RSS greater than 80 percent of system RAM
Read repair rate	`ReadRepairRequests` per table	Spike greater than 5 times baseline without recent maintenance
Speculative retries	`SpeculativeRetries` per table	Greater than 10 percent of reads
Repair tracking	`system_distributed.repair_history`	Last repair older than 80 percent of `gc_grace_seconds`
Prepared statement evictions	`PreparedStatementsEvicted`	Non-zero rate sustained
Streaming progress	`nodetool netstats`; streaming bytes	Stalled for more than 30 minutes or failed sessions
Cross-DC latency	Internode round-trip between datacenters	Baseline deviation without traffic change
Compaction throughput	`BytesCompacted` versus bytes flushed	Flush rate exceeds compaction throughput by more than

[OUTPUT TRUNCATED: Response exceeded output token limit.]

The Netdata solution

Cassandra monitoring with Netdata

Netdata monitors Apache Cassandra with per-second metrics and automatic dashboards. Correlate GC pauses, compaction backlog, tombstone rates, pending hints, and disk usage across nodes to catch a creeping cluster before it tips over.

See Cassandra monitoring → Start monitoring free