Cassandra GC pauses too long: diagnosing G1 stop-the-world pauses
ReadTimeoutException and WriteTimeoutException from clients, GCInspector warnings in system.log, and nodes flapping between UP and DOWN in nodetool status without a JVM restart mean G1 is producing long stop-the-world pauses. Root causes include promotion pressure, humongous objects, or allocation bursts. Left unchecked, one node’s pauses trigger gossip failures, retries, and hint replay that drive cluster-wide degradation.
What this means
G1GC is the default collector for Cassandra 4.x on JDK 11+. During a stop-the-world pause, every thread freezes, including gossip, native transport, and compaction. Cassandra logs GCInspector warnings when a pause exceeds the configured threshold, commonly 500 ms . Pauses longer than ~2 seconds cause gossip rounds to be missed; under the default phi accrual failure detector threshold of 8, sustained pauses of tens of seconds result in the node being marked DOWN . While the JVM is paused, mutations queue, reads stall, hints accumulate on peers, and clients retry. On recovery, hint replay and retry bursts raise allocation pressure, creating a self-reinforcing spiral.
flowchart TD
A[Heap pressure] --> B[Long G1 STW pause]
B --> C[Gossip missed]
C --> D[Node marked DOWN]
D --> E[Hints and retries]
E --> F[Replay burst on recovery]
F --> ACommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Undersized heap or high promotion pressure | Heap floor after full GC stays above 70-75% of max; old-gen occupancy trending up | nodetool info and GC log “Heap after GC” lines |
| Large partition reads or tombstone scans | P99 latency spikes correlate with specific queries; tombstone warnings in logs | nodetool toppartitions and log greps |
| Humongous objects from undersized G1 regions | GC log shows frequent humongous allocations with auto-calculated regions | grep -i "humongous" /var/log/cassandra/gc.log* |
| Surge in allocation rate (batches, hint replay) | Young GC frequency rises sharply; hint dispatcher active after node recovery | jstat -gcutil YGC/YGCT columns; nodetool tpstats |
| Oversized on-heap caches | Heap floor is high but allocation rate is normal; key cache or row cache large | nodetool info cache lines; jstat -gcutil O column |
Quick checks
# Heap used vs max
nodetool info | grep -i "Heap Memory"
# Cumulative GC counts and time
nodetool gcstats
# Recent pause durations
grep -i "pause" /var/log/cassandra/gc.log* | tail -20
# Old gen occupancy live (O column)
jstat -gcutil <cassandra_pid> 1000
# Dropped messages indicating overload
nodetool tpstats
# Node liveness flapping
nodetool status
# Compaction backlog adding heap pressure
nodetool compactionstats
How to diagnose it
- Quantify pauses. Parse the GC log for G1 Young and G1 Old pause times. Look for
GCInspectorlines insystem.logexceeding 500 ms. Sustained pauses over 2 seconds predict gossip disruption. - Find the heap floor. After a full GC completes, check the “Heap after GC” value in the log, or sample old generation occupancy with
jstat -gcutil. If the floor is above 70-75% of max, the heap is effectively full of long-lived objects. - Map pauses to workload. Correlate pause timestamps with query patterns. Use
nodetool toppartitionsto identify large partitions being read. Searchsystem.logfor tombstone scan warnings around those times. Checknodetool tablestatsfor max partition size per table. - Check G1 region sizing. If using G1 with auto-calculated regions, inspect the GC log for humongous object allocations. Auto-calculation can produce small regions that turn medium objects into humongous allocations. Set an explicit
-XX:G1HeapRegionSizematched to your heap; it must be a power of two between 1 MB and 32 MB. - Measure allocation rate. A climbing young generation collection count (
YGC) with short intervals indicates high allocation pressure from batches, retries, or hint replay. - Check cache footprint. Run
nodetool infoand review key cache and row cache size. Row cache is disabled by default and should generally stay disabled; if enabled and large, it competes for old generation space. - Assess cluster-wide impact. Check
nodetool statusfor flapping andnodetool tpstatsfor dropped MUTATION or READ messages. If dropped mutations are nonzero, the pause is already causing data inconsistency.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| GC pause duration | Direct STW impact on availability | >500 ms (GCInspector warns); >2s risks gossip failure |
| Heap floor after full GC | Leading indicator of promotion pressure; old gen cannot be reclaimed | >70-75% of -Xmx sustained after old GC |
Old generation % used (jstat -gcutil O) | Old gen fill rate between collections | Trending upward over days |
| Dropped messages (all types) | Node is shedding load while paused | Non-zero rate sustained for >60s |
| Node liveness transitions | Phi detector convicting node due to missed heartbeats | >2 UP/DOWN transitions in 10 minutes |
| Pending compactions | Compaction debt increases metadata and bloom filter heap pressure | Trending upward over >4 hours |
Fixes
Immediate load reduction
If the node is flapping and dropping messages, run nodetool disablebinary to reject new native transport connections while keeping the node in the ring. This cuts allocation pressure without a restart. Warning: this drops client connectivity to the node; use it only when the node is already missing client deadlines. Identify the memory consumer before bouncing the process.
Heap sizing
If the heap floor is above 75%, increase the heap only if it is below 16 GB. G1 pause duration scales with heap size and object graph complexity; heaps above 16 GB often see worse STW behavior under write load. Stay below the compressed-OOPs threshold (roughly 32 GB; 31 GB is a safe ceiling). Always set -Xms equal to -Xmx to avoid resize cost.
G1 tuning
If the heap is appropriately sized but pauses remain long:
- Set an explicit
-XX:G1HeapRegionSizeto avoid small auto-calculated regions that inflate humongous object overhead. Choose a power of two between 1 MB and 32 MB based on average object size. - On hosts with many cores, raise
-XX:ParallelGCThreadsand-XX:ConcGCThreadsif JVM defaults leave cores idle during STW phases.
Workload and schema fixes
- Reduce batch statement sizes and limit in-flight requests.
- Fix large partitions identified by
nodetool toppartitions; redesign data models to bound partition size. - For TTL-heavy workloads, switch affected tables to
TimeWindowCompactionStrategyso tombstones are efficiently dropped in whole time windows, reducing tombstone-induced GC pressure. - Disable row cache if enabled; it stores full rows on-heap and is disabled by default for good reason.
Prevention
- Monitor the heap floor after full GC, not just peak heap usage. The floor is the irreducible minimum of long-lived objects; when it trends upward, promotion pressure is building.
- Keep G1 heaps in the 8-16 GB range for most workloads. Larger heaps increase STW duration under G1.
- Set explicit
G1HeapRegionSizeduring provisioning rather than relying on JVM auto-calculation. - Track pending compactions and SSTable count trends. Compaction backlog increases on-heap metadata and bloom filter overhead.
- Monitor tombstone warnings and partition size distributions with
nodetool toppartitionsor slow query logging to catch data-model issues before they trigger allocation spikes. - Run repair during off-peak hours; the I/O and streaming load competes for resources and can prolong pauses.
How Netdata helps
- Correlate G1 Young and Old Generation
CollectionTimewith node liveness transitions and dropped messages on the same timeline to confirm GC-induced gossip failures. - Alert on post-collection old-generation occupancy drawn from JMX to catch heap floor growth before pauses breach 500 ms.
- Visualize thread pool pending tasks and compaction queues alongside GC metrics to distinguish pure heap pressure from compaction-driven memory growth.
- Track client request timeout rates against GC pause duration to quantify the latency cost of each STW event.
- Monitor off-heap memory growth (RSS minus JVM heap) to catch Linux OOM kills caused by native memory exhaustion rather than GC pressure.
Related guides
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM: /guides/cassandra/cassandra-consistency-levels-explained/
- Cassandra GC death spiral: long pauses, gossip flapping, and recovery: /guides/cassandra/cassandra-gc-death-spiral/
- Cassandra monitoring checklist: the signals every production cluster needs: /guides/cassandra/cassandra-monitoring-checklist/
- Cassandra monitoring maturity model: from survival to expert: /guides/cassandra/cassandra-monitoring-maturity-model/
- How Cassandra actually works in production: a mental model for operators: /guides/cassandra/how-cassandra-works-in-production/







