Cassandra monitoring maturity model: from survival to expert

Cassandra exposes JMX MBeans, virtual tables, and log signals. Without priority, teams miss compaction debt or drown in noise. This model structures production monitoring into four cumulative levels. Each level adds signals that reduce mean time to detection and catch the failures that dominate Cassandra incidents: data resurrection from missed repair, compaction death spirals, and GC-induced gossip flapping.

Audit your current instrumentation against these levels. See the Cassandra monitoring checklist for a condensed signal inventory.

Expert signals are only interpretable when foundational health is already visible. Eliminate survival and operational blind spots before tuning advanced alerts.

flowchart TD
    L1[Level 1 Survival]
    L2[Level 2 Operational]
    L3[Level 3 Mature]
    L4[Level 4 Expert]
    L1 --> L2
    L2 --> L3
    L3 --> L4

Level 1: Survival

Level 1 answers one question: is the node alive and does it have disk space? These are binary checks that require no Cassandra-specific tooling beyond nodetool status and basic OS commands. If any fail, the cluster is either unavailable or at immediate risk of write blockage.

  • Process and port liveness. Verify the JVM process is present with pgrep -f CassandraDaemon or systemctl status cassandra. Also confirm the CQL native transport port (default 9042) is listening with ss -tlnp | grep 9042. A live process that has closed the port is a zombie node. nodetool can hang under GC pressure, so prefer OS-level checks for survival paging.
  • Node status UP/DOWN. A DN state in nodetool status means gossip has marked the node unreachable and can lead to quorum loss. Run this from a peer if the local node is unresponsive. The JMX FailureDetector MBean (org.apache.cassandra.net:type=FailureDetector) exposes the same state.
  • Disk space remaining. Check filesystem free space on data and commitlog volumes with df -h. Cassandra needs headroom for compaction. Running out of space halts flushes and blocks writes.
  • Basic client error rate. Watch driver metrics for WriteTimeoutException, ReadTimeoutException, and UnavailableException. Coarse external signal is sufficient at this tier.

Level 2: Operational

Level 2 shifts from binary liveness to client-visible performance and resource health. The goal is to detect overload, backpressure, and coordination failures before they cause node flapping or data loss. All Level 1 signals remain relevant.

  • Client request latency. Track coordinator-level P99 read and write latency via JMX (org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency). Sustained elevation above baseline indicates compaction, GC, or slow replica issues. Compare coordinator latency against replica latency (org.apache.cassandra.metrics:type=Table,keyspace=*,scope=*,name=ReadLatency) to isolate local versus remote slowdown. Use nodetool proxyhistograms only for ad hoc inspection; it is too expensive for polling.
  • Client request throughput. Baseline read and write request rates from the Count attribute on ClientRequest latency MBeans. Treat Count as a monotonically increasing counter and compute the rate. Sudden drops suggest client failures; unexpected spikes may indicate retry storms.
  • Client request timeouts and unavailables. Monitor the Timeouts and Unavailables meters under ClientRequest (org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Timeouts). Distinguishing slow replicas from insufficient live replicas determines the correct incident response.
  • Dropped messages. Compute the rate from org.apache.cassandra.metrics:type=DroppedMessage,scope=MUTATION,name=Dropped and equivalent scopes such as READ and RANGE_SLICE. Any sustained nonzero rate means the node is shedding load. Dropped mutations risk replica inconsistency.
  • JVM heap usage and GC pause times. Keep heap used after an old GC below roughly 75% of max, and pauses under 2 seconds. Longer pauses disrupt gossip and can trigger phi failure detection.
  • Pending compactions. Monitor org.apache.cassandra.metrics:type=Compaction,name=PendingTasks. A monotonically increasing count over hours signals compaction debt that leads to read amplification and disk space growth.
  • Node topology match. Confirm nodetool status reports the expected number of nodes in UN state. An unexpected count reveals a stuck decommission or split-brain partition.

Level 3: Mature

Level 3 adds per-table and per-subsystem granularity. At this stage you can distinguish a hot partition from a cluster-wide problem, detect tombstone accumulation before queries abort, and confirm that repair is actually finishing. All Level 2 signals remain relevant.

  • Per-table SSTable count and latency. Correlate per-table LiveSSTableCount via JMX or nodetool tablestats with latency histograms from nodetool tablehistograms. Growing SSTable counts increase read amplification and degrade query performance.
  • Thread pool pending and blocked tasks. Watch Pending and Blocked counts in nodetool tpstats for stages such as ReadStage and MutationStage. Sustained pending means the node cannot keep up; blocked tasks indicate the queue is full.
  • Hinted handoff status and delivery rate. Check JMX hints metrics and the size of the hints directory (default /var/lib/cassandra/hints/, but confirm hints_directory in cassandra.yaml). Hints accumulating while all nodes are healthy indicate replica flapping or network partitions.
  • Key and row cache hit rates. Track key cache hit rate via JMX (org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=HitRate) or nodetool info. A declining rate below 85% on read-heavy workloads suggests the working set has outgrown capacity. Leave row cache off unless you have verified the heap cost.
  • Commitlog pending tasks. Monitor org.apache.cassandra.metrics:type=CommitLog,name=PendingTasks. Sustained nonzero values indicate commitlog device saturation or slow memtable flushes that prevent segment recycling.
  • Tombstone scan warnings. Watch system.log for lines containing “Read * live rows and * tombstone cells” that exceed tombstone_warn_threshold (default 1000). Queries that scan excessive tombstones degrade read performance and abort at tombstone_failure_threshold (default 100000).
  • Repair completion tracking. Verify repairs complete within gc_grace_seconds. Query system_distributed.repair_history for completed ranges and final status. Use nodetool repair_admin list for scheduled or incremental repair status where applicable. Missing repair windows risk tombstone resurrection and silent data divergence.
  • Disk I/O per-device. Monitor %util, await, and queue depth with iostat -x on separate data and commitlog devices. Saturation on either path degrades write durability or read latency.
  • File descriptor usage. Monitor OpenFileDescriptorCount against MaxFileDescriptorCount via java.lang:type=OperatingSystem. Each SSTable opens multiple file handles; approaching the ulimit prevents new SSTables and connections.
  • Schema agreement. Run nodetool describecluster and confirm exactly one schema UUID. Disagreement blocks DDL and may indicate a partitioned or stuck node.
  • Storage exceptions. Watch the StorageExceptions counter via JMX. Any nonzero rate indicates disk or filesystem integrity issues that require immediate investigation.
  • Streaming session status. Check nodetool netstats for active bootstrap, decommission, or repair streams. Failed or stalled sessions leave the topology in an incomplete state.

Level 4: Expert

Level 4 targets predictive insight and specialized workloads. These signals expose pathological data models, cross-datacenter bottlenecks, and off-heap pressure that JVM heap metrics alone cannot see. All Level 3 signals remain relevant.

  • Per-partition size distribution. Sample recent partitions with nodetool toppartitions or check maximum partition size in nodetool tablestats. Note that toppartitions samples only recent traffic. Unbounded partition growth causes GC pressure, compaction stalls, and streaming failures.
  • Tombstone density per table. Derive tombstone density from table statistics in nodetool tablestats or virtual tables. A table whose live cells are outnumbered by tombstones has a data model or TTL problem.
  • Gossip phi failure detector values. Poll org.apache.cassandra.metrics:type=FailureDetector for phi values per endpoint. Rising phi on specific peers predicts imminent DOWN marking before gossip flaps.
  • Off-heap memory usage. Track bloom filter and compression metadata memory via Table-level JMX MBeans, and monitor process RSS against the JVM max heap. RSS that grows far beyond the heap signals off-heap pressure that can trigger Linux OOM kills.
  • Capacity planning runway. Project disk, heap, and IOPS consumption trends. STCS can transiently need up to 100% additional space during major compaction; linear extrapolation prevents surprise exhaustion.
  • Inter-DC latency and streaming throughput. Monitor cross-DC latency and streaming throughput during repair. WAN saturation from streaming can trigger timeouts on EACH_QUORUM writes.
  • LWT contention metrics. Monitor CASRead and CASWrite scopes separately from standard reads and writes under ClientRequest. Paxos-based transactions have different latency profiles that can mask normal operation health if combined.
  • Read repair and speculative retry rates. Track ReadRepairRequests and SpeculativeRetries per table via JMX. Elevated read repair reveals replica inconsistency; high speculative retries double read load on the cluster.
  • Bloom filter false-positive ratios. Watch per-table BloomFilterFalseRatio against the configured bloom_filter_fp_chance (default 0.01). A rising ratio wastes I/O on negative lookups.
  • Virtual tables. Query system_views for latency histograms, thread pools, caches, and SSTable tasks. Virtual tables expose operational data without JMX polling overhead.
  • Guardrail violations. Monitor guardrail violations in Cassandra 4.1+. Soft and hard limits on SSTable count and partition size provide early warnings before hard failures.
  • SAI index metrics. Monitor Storage Attached Index metrics in Cassandra 5.0+. SAI adds compaction and query overhead that requires separate tracking.

Netdata

Netdata collects Cassandra JMX metrics and virtual tables without manual MBean enumeration. Per-second resolution helps correlate GC pauses with gossip flapping or dropped message spikes. Overlay pending compactions, disk I/O utilization, and read latency on the same timeline to spot a compaction death spiral before reads time out. Alert on mature and expert signals, including repair age relative to gc_grace_seconds, off-heap RSS divergence, or per-table speculative retry rates.