Cassandra monitoring maturity model: from survival to expert
Cassandra exposes JMX MBeans, virtual tables, and log signals. Without priority, teams miss compaction debt or drown in noise. This model structures production monitoring into four cumulative levels. Each level adds signals that reduce mean time to detection and catch the failures that dominate Cassandra incidents: data resurrection from missed repair, compaction death spirals, and GC-induced gossip flapping.
Audit your current instrumentation against these levels. See the Cassandra monitoring checklist for a condensed signal inventory.
Expert signals are only interpretable when foundational health is already visible. Eliminate survival and operational blind spots before tuning advanced alerts.
flowchart TD
L1[Level 1 Survival]
L2[Level 2 Operational]
L3[Level 3 Mature]
L4[Level 4 Expert]
L1 --> L2
L2 --> L3
L3 --> L4Level 1: Survival
Level 1 answers one question: is the node alive and does it have disk space? These are binary checks that require no Cassandra-specific tooling beyond nodetool status and basic OS commands. If any fail, the cluster is either unavailable or at immediate risk of write blockage.
- Process and port liveness. Verify the JVM process is present with
pgrep -f CassandraDaemonorsystemctl status cassandra. Also confirm the CQL native transport port (default 9042) is listening withss -tlnp | grep 9042. A live process that has closed the port is a zombie node.nodetoolcan hang under GC pressure, so prefer OS-level checks for survival paging. - Node status UP/DOWN. A
DNstate innodetool statusmeans gossip has marked the node unreachable and can lead to quorum loss. Run this from a peer if the local node is unresponsive. The JMX FailureDetector MBean (org.apache.cassandra.net:type=FailureDetector) exposes the same state. - Disk space remaining. Check filesystem free space on data and commitlog volumes with
df -h. Cassandra needs headroom for compaction. Running out of space halts flushes and blocks writes. - Basic client error rate. Watch driver metrics for
WriteTimeoutException,ReadTimeoutException, andUnavailableException. Coarse external signal is sufficient at this tier.
Level 2: Operational
Level 2 shifts from binary liveness to client-visible performance and resource health. The goal is to detect overload, backpressure, and coordination failures before they cause node flapping or data loss. All Level 1 signals remain relevant.
- Client request latency. Track coordinator-level P99 read and write latency via JMX (
org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency). Sustained elevation above baseline indicates compaction, GC, or slow replica issues. Compare coordinator latency against replica latency (org.apache.cassandra.metrics:type=Table,keyspace=*,scope=*,name=ReadLatency) to isolate local versus remote slowdown. Usenodetool proxyhistogramsonly for ad hoc inspection; it is too expensive for polling. - Client request throughput. Baseline read and write request rates from the
Countattribute on ClientRequest latency MBeans. TreatCountas a monotonically increasing counter and compute the rate. Sudden drops suggest client failures; unexpected spikes may indicate retry storms. - Client request timeouts and unavailables. Monitor the
TimeoutsandUnavailablesmeters underClientRequest(org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Timeouts). Distinguishing slow replicas from insufficient live replicas determines the correct incident response. - Dropped messages. Compute the rate from
org.apache.cassandra.metrics:type=DroppedMessage,scope=MUTATION,name=Droppedand equivalent scopes such asREADandRANGE_SLICE. Any sustained nonzero rate means the node is shedding load. Dropped mutations risk replica inconsistency. - JVM heap usage and GC pause times. Keep heap used after an old GC below roughly 75% of max, and pauses under 2 seconds. Longer pauses disrupt gossip and can trigger phi failure detection.
- Pending compactions. Monitor
org.apache.cassandra.metrics:type=Compaction,name=PendingTasks. A monotonically increasing count over hours signals compaction debt that leads to read amplification and disk space growth. - Node topology match. Confirm
nodetool statusreports the expected number of nodes inUNstate. An unexpected count reveals a stuck decommission or split-brain partition.
Level 3: Mature
Level 3 adds per-table and per-subsystem granularity. At this stage you can distinguish a hot partition from a cluster-wide problem, detect tombstone accumulation before queries abort, and confirm that repair is actually finishing. All Level 2 signals remain relevant.
- Per-table SSTable count and latency. Correlate per-table
LiveSSTableCountvia JMX ornodetool tablestatswith latency histograms fromnodetool tablehistograms. Growing SSTable counts increase read amplification and degrade query performance. - Thread pool pending and blocked tasks. Watch
PendingandBlockedcounts innodetool tpstatsfor stages such asReadStageandMutationStage. Sustained pending means the node cannot keep up; blocked tasks indicate the queue is full. - Hinted handoff status and delivery rate. Check JMX hints metrics and the size of the hints directory (default
/var/lib/cassandra/hints/, but confirmhints_directoryincassandra.yaml). Hints accumulating while all nodes are healthy indicate replica flapping or network partitions. - Key and row cache hit rates. Track key cache hit rate via JMX (
org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=HitRate) ornodetool info. A declining rate below 85% on read-heavy workloads suggests the working set has outgrown capacity. Leave row cache off unless you have verified the heap cost. - Commitlog pending tasks. Monitor
org.apache.cassandra.metrics:type=CommitLog,name=PendingTasks. Sustained nonzero values indicate commitlog device saturation or slow memtable flushes that prevent segment recycling. - Tombstone scan warnings. Watch
system.logfor lines containing “Read * live rows and * tombstone cells” that exceedtombstone_warn_threshold(default 1000). Queries that scan excessive tombstones degrade read performance and abort attombstone_failure_threshold(default 100000). - Repair completion tracking. Verify repairs complete within
gc_grace_seconds. Querysystem_distributed.repair_historyfor completed ranges and final status. Usenodetool repair_admin listfor scheduled or incremental repair status where applicable. Missing repair windows risk tombstone resurrection and silent data divergence. - Disk I/O per-device. Monitor
%util,await, and queue depth withiostat -xon separate data and commitlog devices. Saturation on either path degrades write durability or read latency. - File descriptor usage. Monitor
OpenFileDescriptorCountagainstMaxFileDescriptorCountviajava.lang:type=OperatingSystem. Each SSTable opens multiple file handles; approaching the ulimit prevents new SSTables and connections. - Schema agreement. Run
nodetool describeclusterand confirm exactly one schema UUID. Disagreement blocks DDL and may indicate a partitioned or stuck node. - Storage exceptions. Watch the
StorageExceptionscounter via JMX. Any nonzero rate indicates disk or filesystem integrity issues that require immediate investigation. - Streaming session status. Check
nodetool netstatsfor active bootstrap, decommission, or repair streams. Failed or stalled sessions leave the topology in an incomplete state.
Level 4: Expert
Level 4 targets predictive insight and specialized workloads. These signals expose pathological data models, cross-datacenter bottlenecks, and off-heap pressure that JVM heap metrics alone cannot see. All Level 3 signals remain relevant.
- Per-partition size distribution. Sample recent partitions with
nodetool toppartitionsor check maximum partition size innodetool tablestats. Note thattoppartitionssamples only recent traffic. Unbounded partition growth causes GC pressure, compaction stalls, and streaming failures. - Tombstone density per table. Derive tombstone density from table statistics in
nodetool tablestatsor virtual tables. A table whose live cells are outnumbered by tombstones has a data model or TTL problem. - Gossip phi failure detector values. Poll
org.apache.cassandra.metrics:type=FailureDetectorfor phi values per endpoint. Rising phi on specific peers predicts imminent DOWN marking before gossip flaps. - Off-heap memory usage. Track bloom filter and compression metadata memory via Table-level JMX MBeans, and monitor process RSS against the JVM max heap. RSS that grows far beyond the heap signals off-heap pressure that can trigger Linux OOM kills.
- Capacity planning runway. Project disk, heap, and IOPS consumption trends. STCS can transiently need up to 100% additional space during major compaction; linear extrapolation prevents surprise exhaustion.
- Inter-DC latency and streaming throughput. Monitor cross-DC latency and streaming throughput during repair. WAN saturation from streaming can trigger timeouts on
EACH_QUORUMwrites. - LWT contention metrics. Monitor
CASReadandCASWritescopes separately from standard reads and writes underClientRequest. Paxos-based transactions have different latency profiles that can mask normal operation health if combined. - Read repair and speculative retry rates. Track
ReadRepairRequestsandSpeculativeRetriesper table via JMX. Elevated read repair reveals replica inconsistency; high speculative retries double read load on the cluster. - Bloom filter false-positive ratios. Watch per-table
BloomFilterFalseRatioagainst the configuredbloom_filter_fp_chance(default 0.01). A rising ratio wastes I/O on negative lookups. - Virtual tables. Query
system_viewsfor latency histograms, thread pools, caches, and SSTable tasks. Virtual tables expose operational data without JMX polling overhead. - Guardrail violations. Monitor guardrail violations in Cassandra 4.1+. Soft and hard limits on SSTable count and partition size provide early warnings before hard failures.
- SAI index metrics. Monitor Storage Attached Index metrics in Cassandra 5.0+. SAI adds compaction and query overhead that requires separate tracking.
Netdata
Netdata collects Cassandra JMX metrics and virtual tables without manual MBean enumeration. Per-second resolution helps correlate GC pauses with gossip flapping or dropped message spikes. Overlay pending compactions, disk I/O utilization, and read latency on the same timeline to spot a compaction death spiral before reads time out. Alert on mature and expert signals, including repair age relative to gc_grace_seconds, off-heap RSS divergence, or per-table speculative retry rates.







