Cassandra monitoring checklist: the signals every production cluster needs
The four maturity levels below are cumulative. Do not instrument level 2 until level 1 is visible and alerted.
Cassandra’s peer-to-peer architecture and LSM storage engine create failure modes generic infrastructure monitoring misses. A node can be UP in gossip and accepting CQL connections while dropping mutations or accumulating compaction debt that only surfaces hours later. These signals catch liveness, performance, saturation, and consistency failures: GC death spirals, compaction avalanches, tombstone storms, and silent data divergence.
Level 1 is survival. Levels 2 and 3 add diagnostics and leading indicators. Level 4 is for capacity planning and post-incident analysis.
flowchart TD
L4["Level 4: Expert
Runway projections, per-partition telemetry, heap-after-GC trends"]
L3["Level 3: Mature
Repair tracking, tombstone density, off-heap, speculative retries"]
L2["Level 2: Operational
Latency p99, compaction, thread pools, GC pauses, hints"]
L1["Level 1: Survival
UN status, native transport, disk space, heap, dropped messages"]
L1 --> L2
L2 --> L3
L3 --> L4Level 1 - Survival
These six signals tell you if the node is alive, reachable, and has room to breathe. They require only nodetool, filesystem checks, and JMX. If any fire, the cluster is failing or one event away from stopping writes. Check these first during an incident. A node that is DN in nodetool status but still pings is likely in GC thrash or disk hang. Native transport stopped while gossip is UP usually means CQL port bind failure or excessive heap pressure.
| Signal | What to watch | Threshold |
|---|---|---|
| Node liveness | nodetool status state letters | Any DN sustained for more than 5 minutes |
| Native transport active | NativeTransportRunning JMX attribute | false while the node is UP in gossip |
| Disk space | Free space on data and commitlog volumes | Less than 50 percent free for STCS; less than 30 percent for LCS or TWCS |
| JVM heap pressure | HeapMemoryUsage used versus max | Greater than 75 percent of max sustained |
| Dropped messages | DroppedMessage rate for MUTATION and READ | Non-zero rate sustained for more than 60 seconds |
| Storage exceptions | StorageExceptions counter | Any non-zero rate sustained for more than 30 seconds |
Level 2 - Operational
These signals explain why a node is struggling. Coordinator latency, error rates, and compaction state turn “the node is slow” into actionable diagnostics. Thread pool backpressure and disk I/O latency separate CPU problems from storage problems. Check these before tuning caches or adding capacity. P99 latency approaching read_request_timeout_in_ms means requests are about to fail outright. Blocked thread pools indicate backpressure failure and imminent dropped messages.
| Signal | What to watch | Threshold |
|---|---|---|
| Client request latency (coordinator) | ClientRequest Read and Write latency p99 | Sustained elevation greater than 3 times baseline or approaching read_request_timeout_in_ms |
| Client timeouts | ClientRequest Read and Write Timeouts rate | Greater than zero sustained for more than 60 seconds |
| Client unavailables | ClientRequest Read and Write Unavailables rate | Greater than 5 in 5 minutes with rate greater than 0.1 percent of requests |
| Compaction pending | Compaction PendingTasks gauge | Trending upward over 4 or more hours |
| SSTable count | LiveSSTableCount per table | Greater than 50 for STCS; L0 greater than 32 for LCS |
| Thread pool saturation | ThreadPools pending and blocked in ReadStage, MutationStage, Native-Transport-Requests, MemtableFlushWriter | Pending greater than zero sustained for more than 60 seconds; any blocked count |
| GC overhead | G1 Old Generation CollectionTime rate for GC overhead; GC logs for max pause | Overhead greater than 5 percent of wall clock; max pause greater than 2 seconds in logs |
| Disk I/O latency | iostat await on data and commitlog devices | SSD await greater than 10 ms sustained; HDD await greater than 50 ms sustained |
| Hinted handoff | Hints directory size; TotalHintsInProgress | Hints present when all nodes should be UP |
| File descriptor usage | OpenFileDescriptorCount versus limit | Greater than 80 percent of ulimit |
| Request throughput | ClientRequest rate | Sudden drop or spike greater than 3 times baseline |
| Schema agreement | SchemaVersions map size | More than one version sustained for more than 5 minutes |
Level 3 - Mature
These are leading indicators. Repair tracking prevents data resurrection. Tombstone density and bloom filter accuracy catch data model degradation before it becomes an outage. Off-heap monitoring closes the gap that causes OOM kills despite a “healthy” JVM heap. Tombstone scans above 1,000 per read indicate a missing filter or TTL design flaw. Bloom filter false positives above fp_chance mean the filter is undersized or data distribution has changed.
| Signal | What to watch | Threshold |
|---|---|---|
| Local read and write latency | ReadLatency and WriteLatency per table | Node deviates greater than 2 times from cluster median |
| Tombstones per read | TombstoneScannedHistogram; system log warnings | Sustained scans greater than 1,000 tombstones |
| Bloom filter false positives | BloomFilterFalseRatio per table | Greater than 2 times configured bloom_filter_fp_chance |
| Memtable flush pressure | MemtableFlushWriter pending; MemtableSwitchCount rate | Pending greater than zero sustained; flush rate diverging from write rate |
| Commitlog pressure | CommitLog PendingTasks; segment count | PendingTasks greater than zero; WaitingOnSegmentAllocation greater than zero |
| Off-heap memory | RSS minus JVM heap; BloomFilterOffHeapMemoryUsed | Total process RSS greater than 80 percent of system RAM |
| Read repair rate | ReadRepairRequests per table | Spike greater than 5 times baseline without recent maintenance |
| Speculative retries | SpeculativeRetries per table | Greater than 10 percent of reads |
| Repair tracking | system_distributed.repair_history | Last repair older than 80 percent of gc_grace_seconds |
| Prepared statement evictions | PreparedStatementsEvicted | Non-zero rate sustained |
| Streaming progress | nodetool netstats; streaming bytes | Stalled for more than 30 minutes or failed sessions |
| Cross-DC latency | Internode round-trip between datacenters | Baseline deviation without traffic change |
| Compaction throughput | BytesCompacted versus bytes flushed | Flush rate exceeds compaction throughput by more than |
[OUTPUT TRUNCATED: Response exceeded output token limit.]







