Cassandra frequent memtable flushes: small SSTables and compaction burden
When MemtableFlushWriter pending tasks climb on Cassandra nodes, SSTables often land on disk at only tens of megabytes and multiply quickly. Within hours, PendingCompactions rises, read latency drifts, and disk I/O stays pinned despite flat write throughput. Frequent memtable flushing under memory pressure creates compaction debt and read amplification. Once started, the cluster enters a feedback loop that is hard to unwind without targeted tuning.
Cassandra buffers writes in per-table memtables. When a memtable crosses memtable_heap_space_in_mb or memtable_offheap_space_in_mb, or when the oldest memtable exceeds the fraction configured by memtable_cleanup_threshold, the node queues the memtable for flush. MemtableFlushWriter threads serialize it to disk as an immutable SSTable. If flushes happen too often, each memtable contains relatively little data. The result is a flood of small SSTables.
Each SSTable carries fixed overhead: bloom filters, partition indexes, and compaction bookkeeping. One thousand 10 MB SSTables consume significantly more memory and I/O for these structures than ten 1 GB SSTables holding the same data. This magnifies memory pressure, which in turn forces more flushes. A high volume of small files also creates disproportionate scheduling overhead. The cluster spends more time compacting marginally useful files than serving queries.
If the flush rate exceeds compaction throughput, SSTable count grows monotonically. Under SizeTieredCompactionStrategy (STCS), this is especially dangerous because STCS needs multiple SSTables of similar size before it can merge them. Rapid small flushes starve STCS and delay space reclamation. LeveledCompactionStrategy (LCS) suffers too, as L0 accumulation triggers emergency compactions that steal I/O from reads.
flowchart TD A[Memory pressure or undersized thresholds] --> B[Frequent memtable flushes] B --> C[Small SSTables created] C --> D[Compaction backlog rises] D --> E[Read amplification increases] D --> F[Disk I/O saturation] E --> G[Coordinator latency degrades]
What this means
The write path is a balance between memory, flush throughput, and compaction throughput. Memtables act as the write buffer. Ideally, a memtable grows large enough to produce a reasonably sized SSTable, minimizing the fixed overhead per file. When flushes are forced prematurely by memory pressure or commitlog back-pressure, SSTables are small. Small SSTables mean more files per read, more bloom filter checks, and more compaction tasks to schedule.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Memtable thresholds too low for write volume | High flush rate despite moderate throughput; memtables stay small | memtable_heap_space_in_mb and memtable_offheap_space_in_mb in cassandra.yaml |
| Excessive table count | Hundreds of tables each holding a memtable competing for limited heap and off-heap space | Table count and per-table memtable sizes in nodetool tablestats |
| Commitlog pressure forcing flushes | Commitlog segments accumulate because flushes cannot free them fast enough, or total commitlog space is undersized | Commitlog segment count and commitlog_total_space_in_mb |
| Heap or off-heap memory pressure | GC pauses increase; off-heap usage near limits triggers aggressive cleanup | Heap usage after GC and off-heap bloom filter metrics |
Quick checks
# Check MemtableFlushWriter saturation
nodetool tpstats | grep MemtableFlushWriter
# Inspect memtable sizes and flush counts per table
nodetool tablestats | grep -E "Memtable data size|Memtable switch count"
# Review compaction backlog
nodetool compactionstats
# Count SSTables per table
nodetool tablestats | grep "SSTable count"
# Check commitlog segment accumulation
find <commitlog_directory> -maxdepth 1 -type f | wc -l
# Note: replace <commitlog_directory> with the path configured in cassandra.yaml
# Look for small flush sizes in logs
grep "Completed flushing" /var/log/cassandra/system.log | tail -n 20
# Check heap pressure
nodetool info | grep "Heap Memory"
How to diagnose it
- Establish whether flush frequency is abnormal. Sample
Memtable switch countfromnodetool tablestatstwice over a 10-minute window and compute the rate. A rate that diverges significantly from write throughput indicates pressure-driven flushes rather than natural write volume. - Identify hot tables. Compare
Memtable switch countandMemtable data sizeacross tables innodetool tablestats. If one table dominates, the issue may be a write hotspot rather than undersized global limits. - Compare total memtable size to configured limits. Sum
Memtable data sizeacross all tables. If the total repeatedly approaches the ceiling, flushes are memory-bound. - Inspect flush output in system logs. Look for
Completed flushingwith small byte counts. Repeated sub-hundred-megabyte flushes on a write-heavy node suggest thresholds are undersized. - Correlate flush timing with compaction health. If
PendingCompactionsgrows in step with flush activity, the flush rate has exceeded compaction throughput. - Exclude repair and streaming. Check
nodetool netstatsand system logs for running repairs or bootstraps. These can inflate memtable activity independently of workload pressure. - Check commitlog segment count. If segments are accumulating, flushes are not keeping up with commitlog rotation. Verify
commitlog_total_space_in_mbis not forcing premature flushes. - Review heap and GC metrics. If old-gen occupancy after GC is climbing, or if young collections are promoting aggressively, memory pressure may be forcing premature flushes even if write rates are flat.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Memtable switch count rate | Direct measure of flush frequency | Rate increasing without corresponding write rate increase |
| MemtableFlushWriter pending tasks | Indicates flush pipeline saturation | Pending > 0 sustained for more than a few minutes |
| SSTable count per table | Read amplification and compaction debt | Count growing steadily under STCS or L0 accumulation under LCS |
| Pending compactions | Compaction cannot keep up with SSTable creation | Trending upward over 4+ hours |
| Commitlog segment count | Commitlog cannot recycle segments because flushes are stalled | Segment count growing beyond baseline |
| Bloom filter memory usage | Off-heap pressure from high SSTable count | Growth correlating with SSTable count |
| Disk I/O utilization | Flush and compaction compete for sequential I/O | await elevated on data or commitlog devices |
| Heap usage after GC | Memory pressure forces premature flushing | Post-GC heap > 75% of max |
Fixes
- Raise memtable size limits. Increase
memtable_heap_space_in_mbandmemtable_offheap_space_in_mbif the node has headroom. Larger memtables produce fewer, bigger SSTables, which are more efficient to compact. Tradeoff: larger memtables hold writes in memory longer and increase the commitlog replay window. Requires restart. Warning: do not raise these beyond available heap or off-heap capacity; doing so risks OOM. - Reduce table count. If the schema has hundreds or thousands of tables, each active table holds a memtable. Consolidate tables or archive inactive ones. This is often the root cause in multi-tenant schemas and requires schema redesign.
- Increase commitlog total space. If segments are recycling too aggressively and forcing flushes, raise
commitlog_total_space_in_mb. This extends the window before commitlog back-pressure triggers an emergency flush. Requires restart. - Separate commitlog and data devices. If commitlog and data share a disk, flush I/O competes with compaction and reads. Moving commitlog to a separate volume eliminates contention and often reduces flush latency. This is a provisioning change.
- Evaluate memtable allocation type. If using
heap_buffers, memtables live on-heap and compete with caches and request processing. Switching tooffheap_buffersoroffheap_objectsmoves pressure off the heap, but ensure the node has sufficient native memory. Requires restart. - Tune flush writers cautiously. Increasing
memtable_flush_writersallows more concurrent flushes, but flush threads compete with compaction for CPU and disk. Only raise this ifiostatshows the flush device is underutilized and CPU has idle cores. Requires restart. - Increase compaction throughput. If flushes are necessary and small SSTables are already accumulating, temporarily raise the limit so compaction can drain the backlog. Use
nodetool setcompactionthroughputor adjustcompaction_throughput_mb_per_secin cassandra.yaml. Monitor read latency, as aggressive compaction steals I/O from queries. - Address memory pressure. If heap is the constraint, reduce on-heap consumers such as row cache or key cache using
nodetool setcachecapacity, or increase JVM heap size while staying within practical G1GC limits. If off-heap bloom filters are consuming space, check SSTable count, since bloom filter memory scales with SSTable count. - Consider Trie memtables (Cassandra 5.0+). Trie memtables reduce heap object count and can improve flush efficiency. They must be paired with the BTI SSTable format. Do not enable trie memtables without enabling BTI, and allow time for existing SSTable conversion on upgraded clusters.
Prevention
- Size memtable limits during provisioning based on expected table count and write patterns.
- Monitor memtable switch count rate as a leading indicator. A sudden uptick signals memory pressure before compaction debt becomes visible.
- Avoid schema sprawl. Every active table claims a memtable.
- Validate disk IOPS to ensure the storage can sustain both flush spikes and baseline compaction before deploying write-heavy workloads.
- Track commitlog segment count. A steady-state node should maintain a bounded number of segments.
How Netdata helps
- Correlates disk I/O latency and utilization on commitlog and data volumes to pinpoint contention between flushes and compaction.
- Tracks JVM heap usage and GC pause duration, surfacing memory pressure that forces premature flushes.
- Monitors pending compaction tasks and SSTable count trends, catching the compaction backlog early.
- Alerts on thread pool saturation, including
MemtableFlushWriterpending tasks, before writes stall. - Surfaces off-heap memory growth alongside RSS, helping distinguish heap pressure from native memory limits.
Related guides
- Cassandra adding and removing nodes safely: vnodes, tokens, and cleanup
- Cassandra node stuck in joining (UJ): bootstrap diagnosis
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra clock skew: how NTP drift silently corrupts data
- Cassandra commitlog pending tasks: write-path I/O pressure
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra disk space exhaustion: emergency recovery when the data volume fills
- Cassandra dropped mutations: silent write loss and load shedding
- Cassandra dropped reads and other messages: reading nodetool tpstats Dropped
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery







