Cassandra too many SSTables per table: read amplification and how to fix it
Your Cassandra read P99 latency is climbing. Queries that used to take milliseconds now time out. The application reports intermittent ReadTimeoutException. You check nodetool tablestats and see one table sitting at 200 SSTables. For LCS, that is a catastrophe. For STCS, anything sustained above 50 means compaction has fallen behind.
Each extra SSTable adds a bloom filter check and a potential disk seek to every read. The read path must merge fragments from memtables plus every SSTable that might contain the partition. When SSTable counts balloon, the node spends more time checking filters and seeking than returning data. This is read amplification.
nodetool compact can fix it, but running it blindly doubles disk usage and starves I/O. Confirm the diagnosis, choose the right intervention, and fix the root cause so the count stays low.
What this means
The number of SSTables per table determines how many files a read must consult. More SSTables means more bloom filter checks, more index lookups, and more merge-sort work per query.
Healthy thresholds depend on your compaction strategy:
- LCS: targets ~10 SSTables per level. L0 should stay below 32. Sustained counts above 100 indicate severe level imbalance.
- STCS: should stabilize below 32 in a healthy table. Sustained counts above 50 signal compaction debt. Above 100, reads are effectively broken.
- TWCS: old time windows should compact down to one SSTable. Multiple SSTables in old windows indicate problems.
When compaction cannot keep up with the flush rate, SSTables accumulate. This creates a feedback loop: reads slow down, consuming I/O and CPU that compaction also needs, which makes compaction fall further behind.
flowchart TD
A[High write rate or slow compaction] --> B[SSTables accumulate]
B --> C[More bloom filters checked per read]
C --> D[Disk seeks and merge overhead increase]
D --> E[Read latency P99 spikes]
E --> F[Compaction starved of I/O]
F --> BCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Write rate exceeds compaction throughput | Pending compactions rising for days; SSTable count growing steadily; disk I/O near saturation | nodetool compactionstats and disk await |
| Compaction throttled too aggressively | Low disk I/O despite high SSTable count; throughput limit set too low | nodetool getcompactionthroughput |
| LCS L0 backlog | L0 SSTable count > 32; higher levels look balanced | nodetool tablestats SSTables in each level |
| Memtable flush pressure | Many tiny SSTables (few KB each); flush writers busy | nodetool tablestats memtable switch count |
| Repair or streaming burst | SSTable count spike after bootstrap, decommission, or repair | nodetool netstats and repair history |
| Wrong compaction strategy for workload | Read-heavy workload on STCS with runaway SSTable growth; or time-series data not using TWCS | Schema and access pattern review |
Quick checks
# SSTable count for a specific table
nodetool tablestats <keyspace> <table> | grep "SSTable count"
# Pending compaction tasks
nodetool compactionstats
# Disk I/O latency on data device
iostat -x 1
# Thread pool saturation
nodetool tpstats
# Live SSTable count via JMX
# bean: org.apache.cassandra.metrics:type=Table,keyspace=<ks>,scope=<table>,name=LiveSSTableCount
How to diagnose it
- Confirm the symptom. Run
nodetool tablestats <keyspace> <table>and look forSSTable count. Compare against your compaction strategy threshold (LCS > 100, STCS > 50 sustained). - Determine if compaction is falling behind. Run
nodetool compactionstats. If pending tasks are increasing over hours or days, the node is creating SSTables faster than it can merge them. - Check disk I/O. Run
iostat -x 1on the data volume. If%utilis above 90% orawaitis elevated, compaction is likely I/O-starved. - Identify level imbalance for LCS.
nodetool tablestatsshowsSSTables in each level. If L0 is swollen (for example, 100/10) while L1+ are balanced, L0 compaction is the bottleneck. - Correlate with read latency. Check
nodetool proxyhistogramsor per-table coordinator latency. Rising P99 with stable write volume strongly suggests read amplification. - Check for tiny SSTables. If memtables are flushing prematurely due to memory pressure, you will see many small SSTables that overwhelm compaction. Review
Memtable switch countinnodetool tablestats. - Verify disk space headroom. Run
df -hon the data directory. If usage is above 50% with STCS or above 70% with LCS, compaction may be unable to allocate temporary space.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
LiveSSTableCount | Direct read amplification indicator | LCS > 100 or STCS > 50 sustained |
| Pending compactions | Leading indicator of compaction debt | Trending upward over 4+ hours |
Disk I/O await | Compaction and reads compete for I/O | SSD await > 10ms or HDD > 50ms sustained |
| Read latency P99 | Client-visible impact | P99 > 3x baseline sustained |
| File descriptor usage | Each SSTable opens ~6 FDs | Open FDs > 80% of ulimit |
| Off-heap memory | Bloom filters and compression metadata scale with SSTable count | RSS minus heap trending up with SSTable count |
Fixes
Compaction throughput too low
If compaction_throughput_mb_per_sec is throttled too low for your write volume, increase it temporarily:
# Check current limit
nodetool getcompactionthroughput
# Increase (resets to default on restart unless changed in cassandra.yaml)
nodetool setcompactionthroughput 256
Tradeoff: Higher throughput steals I/O bandwidth from reads. Increase it only when reads are already degraded and you need compaction to catch up.
Emergency manual compaction
If a single table is critically bloated and disk space permits:
# WARNING: This creates a new full SSTable alongside old ones.
# Ensure you have at least 30-50% free disk space before running.
nodetool compact <keyspace> <table>
Tradeoff: nodetool compact holds old and new SSTables on disk simultaneously. On a full disk it can trigger compaction failure or node instability. With STCS it rewrites SSTables into one. Use this as a bridge, not a cure.
Wrong compaction strategy
If STCS cannot keep up on a read-heavy workload, plan a migration to LCS or UCS (Cassandra 5.0+). Changing strategy triggers a full recompaction, which is disruptive. Schedule it during a maintenance window after verifying disk space.
LCS L0 backlog
If L0 is swollen but higher levels are healthy, increase concurrent_compactors in cassandra.yaml (requires restart) if CPU allows, and verify sstable_size_in_mb is not set below 64 MB. Small SSTables flood L0 faster than compaction can drain it.
Hinted handoff or repair debris
If the spike followed a node recovery or repair, allow compaction to settle. If hints are replaying aggressively and creating compaction debt, reduce hinted_handoff_throttle_in_kb in cassandra.yaml.
Prevention
- Monitor the derivative of pending compaction tasks, not just the absolute value. An increasing trend over 24 hours is a leading indicator.
- Maintain disk headroom: > 50% free for STCS, > 30% for LCS/TWCS. Compaction cannot run without temporary space.
- Place commitlog and data directories on separate devices. Shared I/O between commitlog writes and compaction reads/writes is a common bottleneck.
- Review tables with sustained
LiveSSTableCountgrowth weekly. Catching compaction debt at 30 SSTables is easier than at 300. - For time-series workloads with TTL, use TWCS so expired windows drop as units instead of requiring full compaction merges.
How Netdata helps
- Correlate per-table
LiveSSTableCountwith read latency P99 and diskawaiton one timeline to confirm read amplification. - Track compaction pending tasks with automatic trend detection to surface backlog before it becomes an incident.
- Monitor off-heap memory growth alongside SSTable count to catch bloom filter expansion before it triggers OOM.
- Alert on file descriptor usage approaching the ulimit.
- Flag nodes with read latency deviations from the cluster median to isolate SSTable bloat.
Related guides
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM: /guides/cassandra/cassandra-consistency-levels-explained/
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery: /guides/cassandra/cassandra-gc-death-spiral/
- Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses: /guides/cassandra/cassandra-gc-pauses-too-long/
- Cassandra heap pressure: sizing the JVM heap and tuning G1GC: /guides/cassandra/cassandra-heap-pressure-tuning/
- Cassandra monitoring checklist: the signals every production cluster needs: /guides/cassandra/cassandra-monitoring-checklist/
- Cassandra monitoring maturity model: from survival to expert: /guides/cassandra/cassandra-monitoring-maturity-model/
- Cassandra java.lang.OutOfMemoryError: Java heap space - causes and recovery: /guides/cassandra/cassandra-out-of-memory-error/
- Cassandra pending compactions growing: the compaction backlog runbook: /guides/cassandra/cassandra-pending-compactions-growing/
- How Cassandra actually works in production: a mental model for operators: /guides/cassandra/how-cassandra-works-in-production/







