Cassandra monitoring checklist: the signals every production cluster needs

The four maturity levels below are cumulative. Do not instrument level 2 until level 1 is visible and alerted.

Cassandra’s peer-to-peer architecture and LSM storage engine create failure modes generic infrastructure monitoring misses. A node can be UP in gossip and accepting CQL connections while dropping mutations or accumulating compaction debt that only surfaces hours later. These signals catch liveness, performance, saturation, and consistency failures: GC death spirals, compaction avalanches, tombstone storms, and silent data divergence.

Level 1 is survival. Levels 2 and 3 add diagnostics and leading indicators. Level 4 is for capacity planning and post-incident analysis.

flowchart TD
    L4["Level 4: Expert
Runway projections, per-partition telemetry, heap-after-GC trends"] L3["Level 3: Mature
Repair tracking, tombstone density, off-heap, speculative retries"] L2["Level 2: Operational
Latency p99, compaction, thread pools, GC pauses, hints"] L1["Level 1: Survival
UN status, native transport, disk space, heap, dropped messages"] L1 --> L2 L2 --> L3 L3 --> L4

Level 1 - Survival

These six signals tell you if the node is alive, reachable, and has room to breathe. They require only nodetool, filesystem checks, and JMX. If any fire, the cluster is failing or one event away from stopping writes. Check these first during an incident. A node that is DN in nodetool status but still pings is likely in GC thrash or disk hang. Native transport stopped while gossip is UP usually means CQL port bind failure or excessive heap pressure.

SignalWhat to watchThreshold
Node livenessnodetool status state lettersAny DN sustained for more than 5 minutes
Native transport activeNativeTransportRunning JMX attributefalse while the node is UP in gossip
Disk spaceFree space on data and commitlog volumesLess than 50 percent free for STCS; less than 30 percent for LCS or TWCS
JVM heap pressureHeapMemoryUsage used versus maxGreater than 75 percent of max sustained
Dropped messagesDroppedMessage rate for MUTATION and READNon-zero rate sustained for more than 60 seconds
Storage exceptionsStorageExceptions counterAny non-zero rate sustained for more than 30 seconds

Level 2 - Operational

These signals explain why a node is struggling. Coordinator latency, error rates, and compaction state turn “the node is slow” into actionable diagnostics. Thread pool backpressure and disk I/O latency separate CPU problems from storage problems. Check these before tuning caches or adding capacity. P99 latency approaching read_request_timeout_in_ms means requests are about to fail outright. Blocked thread pools indicate backpressure failure and imminent dropped messages.

SignalWhat to watchThreshold
Client request latency (coordinator)ClientRequest Read and Write latency p99Sustained elevation greater than 3 times baseline or approaching read_request_timeout_in_ms
Client timeoutsClientRequest Read and Write Timeouts rateGreater than zero sustained for more than 60 seconds
Client unavailablesClientRequest Read and Write Unavailables rateGreater than 5 in 5 minutes with rate greater than 0.1 percent of requests
Compaction pendingCompaction PendingTasks gaugeTrending upward over 4 or more hours
SSTable countLiveSSTableCount per tableGreater than 50 for STCS; L0 greater than 32 for LCS
Thread pool saturationThreadPools pending and blocked in ReadStage, MutationStage, Native-Transport-Requests, MemtableFlushWriterPending greater than zero sustained for more than 60 seconds; any blocked count
GC overheadG1 Old Generation CollectionTime rate for GC overhead; GC logs for max pauseOverhead greater than 5 percent of wall clock; max pause greater than 2 seconds in logs
Disk I/O latencyiostat await on data and commitlog devicesSSD await greater than 10 ms sustained; HDD await greater than 50 ms sustained
Hinted handoffHints directory size; TotalHintsInProgressHints present when all nodes should be UP
File descriptor usageOpenFileDescriptorCount versus limitGreater than 80 percent of ulimit
Request throughputClientRequest rateSudden drop or spike greater than 3 times baseline
Schema agreementSchemaVersions map sizeMore than one version sustained for more than 5 minutes

Level 3 - Mature

These are leading indicators. Repair tracking prevents data resurrection. Tombstone density and bloom filter accuracy catch data model degradation before it becomes an outage. Off-heap monitoring closes the gap that causes OOM kills despite a “healthy” JVM heap. Tombstone scans above 1,000 per read indicate a missing filter or TTL design flaw. Bloom filter false positives above fp_chance mean the filter is undersized or data distribution has changed.

SignalWhat to watchThreshold
Local read and write latencyReadLatency and WriteLatency per tableNode deviates greater than 2 times from cluster median
Tombstones per readTombstoneScannedHistogram; system log warningsSustained scans greater than 1,000 tombstones
Bloom filter false positivesBloomFilterFalseRatio per tableGreater than 2 times configured bloom_filter_fp_chance
Memtable flush pressureMemtableFlushWriter pending; MemtableSwitchCount ratePending greater than zero sustained; flush rate diverging from write rate
Commitlog pressureCommitLog PendingTasks; segment countPendingTasks greater than zero; WaitingOnSegmentAllocation greater than zero
Off-heap memoryRSS minus JVM heap; BloomFilterOffHeapMemoryUsedTotal process RSS greater than 80 percent of system RAM
Read repair rateReadRepairRequests per tableSpike greater than 5 times baseline without recent maintenance
Speculative retriesSpeculativeRetries per tableGreater than 10 percent of reads
Repair trackingsystem_distributed.repair_historyLast repair older than 80 percent of gc_grace_seconds
Prepared statement evictionsPreparedStatementsEvictedNon-zero rate sustained
Streaming progressnodetool netstats; streaming bytesStalled for more than 30 minutes or failed sessions
Cross-DC latencyInternode round-trip between datacentersBaseline deviation without traffic change
Compaction throughputBytesCompacted versus bytes flushedFlush rate exceeds compaction throughput by more than

[OUTPUT TRUNCATED: Response exceeded output token limit.]