Cassandra Batch too large warning: oversized batches and coordinator OOM
Oversized BEGIN BATCH statements cause Batch for [ks.table] is of size N, exceeding specified threshold of M by ... warnings and coordinator OutOfMemoryError. Unlike single-partition batches, which provide atomicity within one partition, multi-partition batches force the coordinator to hold mutation buffers for every affected partition until all replicas acknowledge. When the buffer grows large enough, it triggers heap pressure, long GC pauses, and eventual OOM.
Cassandra batches are not a bulk-loading optimization. A logged batch spanning many partitions requires the coordinator to write a batchlog to two additional nodes before forwarding mutations, then retain every mutation in memory until each replica responds. The batch_size_warn_threshold_in_kb and batch_size_fail_threshold_in_kb settings in cassandra.yaml exist to protect the coordinator from this memory pressure. Treat every batch size warning as a pre-incident signal: once the coordinator heap fills, Old Generation collections lengthen, gossip heartbeats miss their phi accrual threshold, and peers mark the node DOWN. After recovery, client retries and hinted handoff replays drive further GC pressure in a feedback loop.
What this means
When a batch arrives, the coordinator serializes every contained mutation into heap memory. For a logged batch, it first writes the batchlog to two other nodes to guarantee atomicity. It then routes each partition’s mutations to the replicas that own the corresponding token ranges. The coordinator buffers all mutations until acknowledgements return from every replica. Multi-partition batches multiply this cost: each distinct partition key triggers separate coordination and memory allocation.
As the batch grows, it consumes JVM heap. A full Old Generation GC pause stops the world. During the pause, the node cannot gossip, so the phi accrual failure detector marks it DOWN. Clients time out and retry. Other nodes store hints. When the coordinator recovers, the retry storm plus hint replay create a feedback loop that drives further GC pressure. The result is a GC Death Spiral that ends in OOM or indefinite flapping.
flowchart TD
A[Client sends multi-partition batch] --> B[Coordinator serializes mutations]
B --> C{Logged batch?}
C -->|Yes| D[Write batchlog to 2 peers]
C -->|No| E[Route to replica nodes]
D --> E
E --> F[Coordinator buffers all ACKs in memory]
F --> G[Batch size exceeds threshold]
G --> H[Log WARN: Batch for ks.table is of size N...]
H --> I[Heap pressure on coordinator]
I --> J[Long GC pauses]
J --> K[Gossip marks node DOWN]
K --> L[Client retries and hint replay]
L --> ICommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Multi-partition logged batch from application | Exact WARN pattern in system.log; coordinator heap rises before GC spikes | Application CQL for BEGIN BATCH without UNLOGGED spanning multiple partition keys |
| Client driver bulk-load misconfiguration | Uniform batch sizes in logs; often from ETL or stream ingestion | Driver batching settings or custom batch builders that accumulate rows |
| Batches used as a bulk insert optimization | High write latency on the coordinator despite fast replicas; batchlog overhead visible | Whether the code uses batches for throughput rather than atomic single-partition updates |
Quick checks
These checks are read-only and safe to run on a live coordinator.
# Check for batch size warnings in system logs
grep "Batch for.*exceeding specified threshold" /var/log/cassandra/system.log | tail -20
# Check coordinator heap usage
nodetool info | grep -i "Heap Memory"
# Check for dropped mutations and blocked thread pools
nodetool tpstats
# Check coordinator write latency distribution
nodetool proxyhistograms
# Identify connected clients (Cassandra 4.0+)
cqlsh -e "SELECT address, port, driver_name, driver_version FROM system_views.clients;"
# Check GC logs for pauses longer than 200 ms
grep -i "pause" /var/log/cassandra/gc.log* | awk '$NF > 200' | tail -20
How to diagnose it
Confirm the warning pattern. Look for
Batch for [ks.table] is of size N, exceeding specified threshold of M by ...insystem.log. Note the keyspace, table, and reported size. Sustained warnings mean the application is continuously emitting oversized batches.Correlate with coordinator heap pressure. Run
nodetool infoand compare heap usage against the max. If used heap is above 80% of max and climbing, and the timestamps align with batch warnings, the batches are the likely allocation source.Check GC behavior. Parse GC logs for Old Generation pauses longer than 500 ms. G1 GC pauses appear as
Pause FullorPause Young; CMS pauses appear as concurrent mode failure or promotion failed. If pauses correlate with warning timestamps, the coordinator is entering the GC Death Spiral.Check for load shedding. Run
nodetool tpstats. In the Dropped section, a risingMUTATIONcounter means the node is shedding load. In the Thread Pools section, checkMutationStageandNative-Transport-Requests: if pending tasks are consistently non-zero while active threads are at the pool maximum, the write path is saturated.Identify the client source. On Cassandra 4.0+, query
system_views.clientsto find which application hosts are connected. Map the client address back to an application instance using your infrastructure metadata. On earlier versions, check network connections withss -tnp | grep 9042or application-side connection logs. If multiple clients share an IP behind NAT, check application query logs instead.Determine batch scope. Review application code for the offending table. Count how many partition keys the batch touches. If it is more than one, it is a multi-partition batch. If the partition key is composite, ensure all components match across every statement in the batch. Check whether the code uses
BEGIN BATCH(logged) orBEGIN UNLOGGED BATCH. Logged batches carry extra coordination overhead.Review threshold configuration. Check
cassandra.yamlforbatch_size_warn_threshold_in_kbandbatch_size_fail_threshold_in_kb. Defaults are typically 5 KB warn and 50 KB fail. If an operator previously raised them to suppress noise, treat that as a smoking gun and revert.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Batch warning rate in logs | Direct indicator of oversized batches before OOM | Any sustained Batch for ... exceeding specified threshold messages |
| JVM heap used / max | Coordinator must buffer all batch mutations in heap | Used heap trending above 80% of max, especially after Old GC |
| GC pause duration | Long pauses block gossip and request processing | Old Generation pauses > 500 ms |
| Dropped MUTATION messages | Node is shedding load it cannot process | Non-zero or rising rate in nodetool tpstats |
| Coordinator write latency P99 | Reflects batch coordination and batchlog overhead | P99 write latency spiking while local replica latency stays flat |
| MutationStage pending tasks | Backpressure on the write path | Pending tasks > 0 sustained for more than 60 seconds |
Fixes
Stop the immediate bleed
If a coordinator is in a GC Death Spiral, disable native transport to stop new batches from arriving while you investigate. This interrupts all client traffic to the node.
# Dangerous: interrupts all client traffic to this node
nodetool disablebinary
Only use this when the node is already flapping between UP and DOWN and you need to break the retry storm. Once the heap recovers, re-enable with nodetool enablebinary.
Fix the client write pattern
Raising batch_size_warn_threshold_in_kb or batch_size_fail_threshold_in_kb masks the symptom and moves OOM risk to a higher number. Do not raise thresholds to accommodate a misbehaving client.
Replace multi-partition batches with individual async writes. Use your driver’s async execution to fire multiple independent writes in parallel. This distributes coordination across all replica nodes instead of concentrating memory pressure on one coordinator. Batches are not a performance optimization in Cassandra. Use them only when you need atomicity within a single partition.
If atomicity is required, restrict the batch to a single partition. Single-partition batches are safe because the coordinator only coordinates with replicas that own one token range. The memory footprint is bounded and predictable.
Switch from logged to unlogged only if atomicity is unnecessary. Unlogged batches skip the batchlog write to two additional nodes, which removes some overhead. However, the coordinator still buffers mutations for every partition until acknowledgements arrive. Unlogged batches reduce but do not eliminate coordinator memory pressure for multi-partition workloads.
Add client-side backpressure. If the application generates batches from a streaming source, implement rate limiting or bounded queues so that row accumulation cannot grow without bound. If you use the Java driver, configure a request throttler or place a semaphore around session.executeAsync() to bound in-flight requests. Without backpressure, async writes can shift overload from the coordinator to the client and the cluster.
Prevention
Educate developers that Cassandra batches are for atomicity, not throughput. Flag any BEGIN BATCH that touches multiple partition keys during code review.
Monitor batch warnings as a first-class signal. Treat any sustained batch size warning as a ticket-level finding.
Load test with realistic data sizes. Behavior that looks safe in development with 10 rows per batch can become catastrophic in production with 10,000 rows.
Keep thresholds at conservative defaults. Fix the client instead of raising batch limits.
Review ETL and migration jobs separately from application code. Batch misuse often appears in one-off scripts that use the same CQL driver but lack production tuning.
How Netdata helps
- Correlate Cassandra log warnings with JVM heap usage charts to confirm that batch size spikes precede memory pressure.
- Monitor GC pause duration alongside batch warning events to detect the GC Death Spiral before gossip marks the node DOWN.
- Alert on dropped mutation rates from
nodetool tpstatsas a lagging indicator that the coordinator is shedding load. - Track coordinator write latency percentiles to spot batch-induced tail latency before client timeouts trigger.
- Surface sudden connection count changes that may indicate a misbehaving client driver or batch loader connecting to the cluster.
Related guides
- Cassandra adding and removing nodes safely: vnodes, tokens, and cleanup
- Cassandra node stuck in joining (UJ): bootstrap diagnosis
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra clock skew: how NTP drift silently corrupts data
- Cassandra commitlog disk full: segment exhaustion and forced flushes
- Cassandra commitlog pending tasks: write-path I/O pressure
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra disk space exhaustion: emergency recovery when the data volume fills
- Cassandra dropped mutations: silent write loss and load shedding
- Cassandra dropped reads and other messages: reading nodetool tpstats Dropped







