Cassandra frequent memtable flushes: small SSTables and compaction burden

When MemtableFlushWriter pending tasks climb on Cassandra nodes, SSTables often land on disk at only tens of megabytes and multiply quickly. Within hours, PendingCompactions rises, read latency drifts, and disk I/O stays pinned despite flat write throughput. Frequent memtable flushing under memory pressure creates compaction debt and read amplification. Once started, the cluster enters a feedback loop that is hard to unwind without targeted tuning.

Cassandra buffers writes in per-table memtables. When a memtable crosses memtable_heap_space_in_mb or memtable_offheap_space_in_mb, or when the oldest memtable exceeds the fraction configured by memtable_cleanup_threshold, the node queues the memtable for flush. MemtableFlushWriter threads serialize it to disk as an immutable SSTable. If flushes happen too often, each memtable contains relatively little data. The result is a flood of small SSTables.

Each SSTable carries fixed overhead: bloom filters, partition indexes, and compaction bookkeeping. One thousand 10 MB SSTables consume significantly more memory and I/O for these structures than ten 1 GB SSTables holding the same data. This magnifies memory pressure, which in turn forces more flushes. A high volume of small files also creates disproportionate scheduling overhead. The cluster spends more time compacting marginally useful files than serving queries.

If the flush rate exceeds compaction throughput, SSTable count grows monotonically. Under SizeTieredCompactionStrategy (STCS), this is especially dangerous because STCS needs multiple SSTables of similar size before it can merge them. Rapid small flushes starve STCS and delay space reclamation. LeveledCompactionStrategy (LCS) suffers too, as L0 accumulation triggers emergency compactions that steal I/O from reads.

flowchart TD
  A[Memory pressure or undersized thresholds] --> B[Frequent memtable flushes]
  B --> C[Small SSTables created]
  C --> D[Compaction backlog rises]
  D --> E[Read amplification increases]
  D --> F[Disk I/O saturation]
  E --> G[Coordinator latency degrades]

What this means

The write path is a balance between memory, flush throughput, and compaction throughput. Memtables act as the write buffer. Ideally, a memtable grows large enough to produce a reasonably sized SSTable, minimizing the fixed overhead per file. When flushes are forced prematurely by memory pressure or commitlog back-pressure, SSTables are small. Small SSTables mean more files per read, more bloom filter checks, and more compaction tasks to schedule.

Common causes

CauseWhat it looks likeFirst thing to check
Memtable thresholds too low for write volumeHigh flush rate despite moderate throughput; memtables stay smallmemtable_heap_space_in_mb and memtable_offheap_space_in_mb in cassandra.yaml
Excessive table countHundreds of tables each holding a memtable competing for limited heap and off-heap spaceTable count and per-table memtable sizes in nodetool tablestats
Commitlog pressure forcing flushesCommitlog segments accumulate because flushes cannot free them fast enough, or total commitlog space is undersizedCommitlog segment count and commitlog_total_space_in_mb
Heap or off-heap memory pressureGC pauses increase; off-heap usage near limits triggers aggressive cleanupHeap usage after GC and off-heap bloom filter metrics

Quick checks

# Check MemtableFlushWriter saturation
nodetool tpstats | grep MemtableFlushWriter

# Inspect memtable sizes and flush counts per table
nodetool tablestats | grep -E "Memtable data size|Memtable switch count"

# Review compaction backlog
nodetool compactionstats

# Count SSTables per table
nodetool tablestats | grep "SSTable count"

# Check commitlog segment accumulation
find <commitlog_directory> -maxdepth 1 -type f | wc -l
# Note: replace <commitlog_directory> with the path configured in cassandra.yaml

# Look for small flush sizes in logs
grep "Completed flushing" /var/log/cassandra/system.log | tail -n 20

# Check heap pressure
nodetool info | grep "Heap Memory"

How to diagnose it

  1. Establish whether flush frequency is abnormal. Sample Memtable switch count from nodetool tablestats twice over a 10-minute window and compute the rate. A rate that diverges significantly from write throughput indicates pressure-driven flushes rather than natural write volume.
  2. Identify hot tables. Compare Memtable switch count and Memtable data size across tables in nodetool tablestats. If one table dominates, the issue may be a write hotspot rather than undersized global limits.
  3. Compare total memtable size to configured limits. Sum Memtable data size across all tables. If the total repeatedly approaches the ceiling, flushes are memory-bound.
  4. Inspect flush output in system logs. Look for Completed flushing with small byte counts. Repeated sub-hundred-megabyte flushes on a write-heavy node suggest thresholds are undersized.
  5. Correlate flush timing with compaction health. If PendingCompactions grows in step with flush activity, the flush rate has exceeded compaction throughput.
  6. Exclude repair and streaming. Check nodetool netstats and system logs for running repairs or bootstraps. These can inflate memtable activity independently of workload pressure.
  7. Check commitlog segment count. If segments are accumulating, flushes are not keeping up with commitlog rotation. Verify commitlog_total_space_in_mb is not forcing premature flushes.
  8. Review heap and GC metrics. If old-gen occupancy after GC is climbing, or if young collections are promoting aggressively, memory pressure may be forcing premature flushes even if write rates are flat.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Memtable switch count rateDirect measure of flush frequencyRate increasing without corresponding write rate increase
MemtableFlushWriter pending tasksIndicates flush pipeline saturationPending > 0 sustained for more than a few minutes
SSTable count per tableRead amplification and compaction debtCount growing steadily under STCS or L0 accumulation under LCS
Pending compactionsCompaction cannot keep up with SSTable creationTrending upward over 4+ hours
Commitlog segment countCommitlog cannot recycle segments because flushes are stalledSegment count growing beyond baseline
Bloom filter memory usageOff-heap pressure from high SSTable countGrowth correlating with SSTable count
Disk I/O utilizationFlush and compaction compete for sequential I/Oawait elevated on data or commitlog devices
Heap usage after GCMemory pressure forces premature flushingPost-GC heap > 75% of max

Fixes

  • Raise memtable size limits. Increase memtable_heap_space_in_mb and memtable_offheap_space_in_mb if the node has headroom. Larger memtables produce fewer, bigger SSTables, which are more efficient to compact. Tradeoff: larger memtables hold writes in memory longer and increase the commitlog replay window. Requires restart. Warning: do not raise these beyond available heap or off-heap capacity; doing so risks OOM.
  • Reduce table count. If the schema has hundreds or thousands of tables, each active table holds a memtable. Consolidate tables or archive inactive ones. This is often the root cause in multi-tenant schemas and requires schema redesign.
  • Increase commitlog total space. If segments are recycling too aggressively and forcing flushes, raise commitlog_total_space_in_mb. This extends the window before commitlog back-pressure triggers an emergency flush. Requires restart.
  • Separate commitlog and data devices. If commitlog and data share a disk, flush I/O competes with compaction and reads. Moving commitlog to a separate volume eliminates contention and often reduces flush latency. This is a provisioning change.
  • Evaluate memtable allocation type. If using heap_buffers, memtables live on-heap and compete with caches and request processing. Switching to offheap_buffers or offheap_objects moves pressure off the heap, but ensure the node has sufficient native memory. Requires restart.
  • Tune flush writers cautiously. Increasing memtable_flush_writers allows more concurrent flushes, but flush threads compete with compaction for CPU and disk. Only raise this if iostat shows the flush device is underutilized and CPU has idle cores. Requires restart.
  • Increase compaction throughput. If flushes are necessary and small SSTables are already accumulating, temporarily raise the limit so compaction can drain the backlog. Use nodetool setcompactionthroughput or adjust compaction_throughput_mb_per_sec in cassandra.yaml. Monitor read latency, as aggressive compaction steals I/O from queries.
  • Address memory pressure. If heap is the constraint, reduce on-heap consumers such as row cache or key cache using nodetool setcachecapacity, or increase JVM heap size while staying within practical G1GC limits. If off-heap bloom filters are consuming space, check SSTable count, since bloom filter memory scales with SSTable count.
  • Consider Trie memtables (Cassandra 5.0+). Trie memtables reduce heap object count and can improve flush efficiency. They must be paired with the BTI SSTable format. Do not enable trie memtables without enabling BTI, and allow time for existing SSTable conversion on upgraded clusters.

Prevention

  • Size memtable limits during provisioning based on expected table count and write patterns.
  • Monitor memtable switch count rate as a leading indicator. A sudden uptick signals memory pressure before compaction debt becomes visible.
  • Avoid schema sprawl. Every active table claims a memtable.
  • Validate disk IOPS to ensure the storage can sustain both flush spikes and baseline compaction before deploying write-heavy workloads.
  • Track commitlog segment count. A steady-state node should maintain a bounded number of segments.

How Netdata helps

  • Correlates disk I/O latency and utilization on commitlog and data volumes to pinpoint contention between flushes and compaction.
  • Tracks JVM heap usage and GC pause duration, surfacing memory pressure that forces premature flushes.
  • Monitors pending compaction tasks and SSTable count trends, catching the compaction backlog early.
  • Alerts on thread pool saturation, including MemtableFlushWriter pending tasks, before writes stall.
  • Surfaces off-heap memory growth alongside RSS, helping distinguish heap pressure from native memory limits.