How Cassandra actually works in production: a mental model for operators

If you operate Cassandra in production, you are managing a distributed, partitioned, replicated log-structured merge tree where every node is a peer. There is no master to restart, no central query planner to tune, and no automatic load balancing that will save you from a hot partition. Every write is a sequential append to a commitlog and an in-memory update on multiple replicas. Every read is a merge of memtables and immutable SSTables, filtered by probabilistic bloom filters and reconciled by timestamp.

Cassandra delegates all hard decisions to the operator. Compaction strategy sets your I/O profile, consistency level sets availability boundaries, and the failure detector’s phi threshold sets the stall tolerance before the cluster reconfigures. If you do not understand how the write path, read path, gossip, compaction, and repair interact, you will misdiagnose GC pauses as network partitions, compaction backlog as slow queries, and disk space exhaustion as capacity growth when it is actually a strategy mismatch.

What it is and why it matters

Cassandra uses a peer-to-peer token ring. The partitioner hashes the partition key to a token, and the snitch places replicas clockwise on the ring across racks and datacenters. Any node can receive a CQL request and act as the coordinator, forwarding operations to the replica set and waiting for acknowledgements to satisfy the consistency level. This eliminates single points of failure but makes latency, availability, and consistency emergent properties of local storage behavior, network topology, and the phi accrual failure detector.

The storage engine is an LSM tree. Writes are append-only and sequential; reads merge live memtables with immutable SSTables, applying tombstones and resolving conflicts across replicas. Most production incidents start in the gap between fast writes and expensive reads. Understanding this asymmetry prevents you from adding disk when queries are slow instead of recognizing that compaction cannot keep up with ingestion.

How it works

Four subsystems run continuously and interact to produce the behavior you see in metrics and logs.

Write path. A write arrives at the coordinator, which hashes the partition key to determine the replica nodes that own the token range. On each replica, the mutation is appended to the commitlog and inserted into the memtable, a per-table sorted in-memory structure. The coordinator acknowledges the write once enough replicas have responded to satisfy the consistency level. By default, commitlog durability is periodic (commitlog_sync: periodic), acknowledging before fsync; use batch only if you accept the latency cost for lower durability risk. When a memtable reaches its size threshold, it flushes to disk as an immutable SSTable. The corresponding commitlog segments become eligible for recycling. SSTables accumulate until compaction merges them. On startup, the node replays unflushed commitlog segments to recover memtables.

Read path. The coordinator sends requests to enough replicas to satisfy the consistency level. For levels below ALL, it typically requests full data from one replica and digests from the others. If digests mismatch, it fetches full rows and resolves by timestamp. On each replica, Cassandra checks the memtable first, then consults per-SSTable bloom filters to rule out SSTables that cannot contain the partition. A false positive causes an unnecessary SSTable read; the false positive rate is configurable per table via bloom_filter_fp_chance. For candidate SSTables, the partition summary narrows the index lookup, and the partition index locates the exact byte offset. Data from the memtable and all relevant SSTables is merged, tombstones are applied, and the result is returned.

Background maintenance. Compaction runs continuously to merge SSTables, discard tombstones, and consolidate data. STCS groups SSTables by size and merges similar-sized files, favoring write throughput at the cost of bursty I/O and high space amplification. LCS organizes SSTables into levels with exponentially increasing size targets, bounding read amplification but increasing write I/O. TWCS compacts data within time windows and drops expired windows as units, making it ideal for TTL workloads. UCS (5.0+) uses density-based triggers.

Gossip runs every second. Each node exchanges state with up to three peers, propagating heartbeats, topology, schema versions, and load. The phi accrual failure detector evaluates heartbeat interarrival times; with a default threshold of 8, a node missing heartbeats long enough to exceed the threshold is marked DOWN.

Hinted handoff stores writes locally when a replica is unreachable, replaying them when the target recovers within max_hint_window_in_ms (default 3 hours). Anti-entropy repair compares Merkle trees across replicas and streams differences. It must complete within gc_grace_seconds (default 10 days) to prevent tombstones from expiring on unrepaired data and causing deleted data to reappear. Incremental repair is the default since Cassandra 4.0. It marks repaired SSTables separately, which changes compaction behavior; unrepaired data compacts only with unrepaired data. Full repair is still required periodically to guard against disk corruption and operator error. Streaming handles bulk transfers during bootstrap, decommission, rebuild, and full repair.

flowchart LR
    Client -->|CQL request| Coordinator
    Coordinator -->|hash partition key| TokenRing
    TokenRing -->|forward write| ReplicaA
    TokenRing -->|forward write| ReplicaB
    ReplicaA -->|append| CommitLog
    ReplicaA -->|update| Memtable
    Memtable -->|flush| SSTable
    SSTable -->|merge| Compaction
    ReplicaA -->|exchange state| Gossip
    Gossip -->|phi threshold| FailureDetector
    ReplicaA -->|store replay| HintedHandoff
    ReplicaA -->|compare trees| Repair

Where it shows up in production

The subsystems above compete for a fixed set of resources. Recognizing the competition patterns lets you distinguish root causes from symptoms.

ResourceCompetition pattern
Disk I/OCommitlog writes, memtable flushes, compaction, and reads all compete. Compaction is usually the dominant consumer. Plan for SSDs.
JVM heapMemtables, key cache, query results, compression metadata, and in-flight requests. GC pauses directly impact availability.
Off-heap memoryBloom filters, compression metadata, direct buffers, and chunk cache. Invisible to JVM GC but counts toward Linux OOM.
CPUCompaction, query processing, encryption, and GC.
NetworkClient traffic, inter-node replication, gossip, and streaming.
File descriptorsEach SSTable opens multiple file handles; hundreds of SSTables multiplied by several files per SSTable equals thousands of FDs.
Disk spaceSSTables, commitlog, hints, and snapshots. STCS can transiently need up to 100% additional space during major compaction.

Deployment choices change which signals matter most. Multi-DC deployments add cross-DC latency and streaming saturation during repair. Higher replication factors increase write coordination costs. Virtual nodes (vnodes) change compaction and repair dynamics. Lightweight Transactions (Paxos-based) add roughly four round-trips of latency that must be monitored separately from standard reads and writes.

You can observe these competitions directly. nodetool compactionstats shows whether compaction is keeping up. nodetool tpstats reveals thread pool saturation in the mutation, read, and native transport stages, including the HintedHandoff stage. nodetool info exposes heap usage. Check OS-level file descriptor counts because Cassandra opens thousands of handles across SSTables. Interpreting these outputs requires knowing which subsystem they belong to.

Operational tradeoffs and failure archetypes

Production incidents in Cassandra tend to follow repeatable archetypes that emerge from the interaction of the subsystems described above.

GC death spiral. Heap pressure triggers long GC pauses. During a pause, gossip cannot exchange heartbeats, so peers mark the node DOWN. Clients timeout and retry; other nodes store hints. When the GC finishes, the node is flooded with replayed hints and retried requests, increasing memory pressure and triggering longer pauses. The trigger is often a large partition read, tombstone-heavy scan, or memory leak that promotes objects to the old generation. Once old-gen occupancy exceeds what a full GC can reclaim, the JVM enters continuous stop-the-world collection.

Compaction death spiral. Write rate exceeds compaction throughput. SSTables accumulate, so every read must check more files, and read amplification increases. Latency degrades, but writes remain fast, creating a false sense of health. Disk I/O saturates under the combined load of reads and compaction. Eventually disk space runs out or reads become too slow to meet SLAs. The signature is write latency remaining healthy while read latency degrades.

Tombstone storm. Deletes and TTL expirations create tombstones that persist until compaction after all replicas have been repaired. If tombstones accumulate across many SSTables, reads must scan and merge enormous amounts of dead data. This consumes CPU, memory, and I/O. At tombstone_failure_threshold (default 100,000), queries abort entirely. The root cause is usually a data model mismatch: using Cassandra as a queue, issuing frequent range deletes, or failing to run repair so that tombstones cannot be purged.

Disk space exhaustion. Compaction needs temporary space to write merged SSTables before deleting old ones. With STCS, major compaction can temporarily require space equal to the largest table size. If snapshots, hints, or uncompacted data consume the remaining space, compaction stalls. Without compaction, old SSTables are never deleted, and writes eventually block when the commitlog cannot allocate new segments.

Hint overflow. When a replica is down, coordinators store hints. If the outage exceeds max_hint_window_in_ms (default 3 hours), hints stop being stored. Data written during the remainder of the outage is permanently missing from that replica unless anti-entropy repair is run after recovery.

Signals to watch in production

SignalWhy it mattersWarning sign
Node liveness (gossip/phi)A DOWN node reduces effective replication and can cause quorum loss for affected token ranges.DownEndpointCount > 0 sustained > 5 min; flapping > 3 transitions in 30 min.
Coordinator read/write latency (P99)Direct measure of user experience and end-to-end path cost across replicas and merging.P99 > 3x rolling baseline or approaching half of the configured request timeout.
Pending compactionsLeading indicator of compaction debt. Rising pending means read amplification and disk space consumption will follow.Trending upward over 4+ hours; > 50 sustained in STCS or above expected per-level count in LCS.
GC pause durationPauses exceeding the failure detector threshold mark the node DOWN. Shorter pauses still degrade latency and can cascade.Max pause > 2s or GC time > 5% of wall clock over 5 minutes.
Dropped messages (MUTATION/READ)The node is shedding load. Dropped mutations create silent inconsistency that only repair can fix.Any sustained non-zero rate for > 60 seconds.
SSTable count per tableEach SSTable adds bloom filter checks and merge work to every read.> 50 in STCS, L0 > 32 in LCS, or trending upward over days.
Large partition sizeWide rows cause heap pressure on reads and can stall compaction.nodetool tablestats reports max partition size > 100 MB.
Repair completion timeUnrepaired data beyond gc_grace_seconds allows tombstones to expire and deleted data to resurrect.Last successful repair > 80% of gc_grace_seconds (default > 8 days).
Disk space availableCompaction cannot run without headroom. Exhaustion blocks flushes and writes.< 30% free; with STCS, < 50% free is dangerous due to major compaction amplification.

How Netdata helps

  • Correlate coordinator latency with replica GC pauses. Collect JVM GarbageCollector CollectionTime and ClientRequest latency percentiles to confirm whether a P99 spike originates from a replica death spiral or a network issue.
  • Spot compaction debt before reads degrade. Track Compaction PendingTasks alongside per-table LiveSSTableCount to detect the write-amplification gap that precedes read latency cliffs.
  • Monitor off-heap RSS, not just JVM heap. Track process RSS and OS memory metrics to surface off-heap growth from bloom filters and compression metadata that JMX heap graphs hide.
  • Distinguish phi convict flapping from hardware failure. Combine FailureDetector states with OS-level CPU wait and disk latency to determine if a node is marked DOWN due to GC pressure or actual network loss.
  • Track repair cadence against gc_grace_seconds. Alert on repair completion timestamps to ensure anti-entropy runs before tombstones resurrect deleted data.