Cassandra Batch too large warning: oversized batches and coordinator OOM

Oversized BEGIN BATCH statements cause Batch for [ks.table] is of size N, exceeding specified threshold of M by ... warnings and coordinator OutOfMemoryError. Unlike single-partition batches, which provide atomicity within one partition, multi-partition batches force the coordinator to hold mutation buffers for every affected partition until all replicas acknowledge. When the buffer grows large enough, it triggers heap pressure, long GC pauses, and eventual OOM.

Cassandra batches are not a bulk-loading optimization. A logged batch spanning many partitions requires the coordinator to write a batchlog to two additional nodes before forwarding mutations, then retain every mutation in memory until each replica responds. The batch_size_warn_threshold_in_kb and batch_size_fail_threshold_in_kb settings in cassandra.yaml exist to protect the coordinator from this memory pressure. Treat every batch size warning as a pre-incident signal: once the coordinator heap fills, Old Generation collections lengthen, gossip heartbeats miss their phi accrual threshold, and peers mark the node DOWN. After recovery, client retries and hinted handoff replays drive further GC pressure in a feedback loop.

What this means

When a batch arrives, the coordinator serializes every contained mutation into heap memory. For a logged batch, it first writes the batchlog to two other nodes to guarantee atomicity. It then routes each partition’s mutations to the replicas that own the corresponding token ranges. The coordinator buffers all mutations until acknowledgements return from every replica. Multi-partition batches multiply this cost: each distinct partition key triggers separate coordination and memory allocation.

As the batch grows, it consumes JVM heap. A full Old Generation GC pause stops the world. During the pause, the node cannot gossip, so the phi accrual failure detector marks it DOWN. Clients time out and retry. Other nodes store hints. When the coordinator recovers, the retry storm plus hint replay create a feedback loop that drives further GC pressure. The result is a GC Death Spiral that ends in OOM or indefinite flapping.

flowchart TD
    A[Client sends multi-partition batch] --> B[Coordinator serializes mutations]
    B --> C{Logged batch?}
    C -->|Yes| D[Write batchlog to 2 peers]
    C -->|No| E[Route to replica nodes]
    D --> E
    E --> F[Coordinator buffers all ACKs in memory]
    F --> G[Batch size exceeds threshold]
    G --> H[Log WARN: Batch for ks.table is of size N...]
    H --> I[Heap pressure on coordinator]
    I --> J[Long GC pauses]
    J --> K[Gossip marks node DOWN]
    K --> L[Client retries and hint replay]
    L --> I

Common causes

CauseWhat it looks likeFirst thing to check
Multi-partition logged batch from applicationExact WARN pattern in system.log; coordinator heap rises before GC spikesApplication CQL for BEGIN BATCH without UNLOGGED spanning multiple partition keys
Client driver bulk-load misconfigurationUniform batch sizes in logs; often from ETL or stream ingestionDriver batching settings or custom batch builders that accumulate rows
Batches used as a bulk insert optimizationHigh write latency on the coordinator despite fast replicas; batchlog overhead visibleWhether the code uses batches for throughput rather than atomic single-partition updates

Quick checks

These checks are read-only and safe to run on a live coordinator.

# Check for batch size warnings in system logs
grep "Batch for.*exceeding specified threshold" /var/log/cassandra/system.log | tail -20

# Check coordinator heap usage
nodetool info | grep -i "Heap Memory"

# Check for dropped mutations and blocked thread pools
nodetool tpstats

# Check coordinator write latency distribution
nodetool proxyhistograms

# Identify connected clients (Cassandra 4.0+)
cqlsh -e "SELECT address, port, driver_name, driver_version FROM system_views.clients;"
# Check GC logs for pauses longer than 200 ms
grep -i "pause" /var/log/cassandra/gc.log* | awk '$NF > 200' | tail -20

How to diagnose it

  1. Confirm the warning pattern. Look for Batch for [ks.table] is of size N, exceeding specified threshold of M by ... in system.log. Note the keyspace, table, and reported size. Sustained warnings mean the application is continuously emitting oversized batches.

  2. Correlate with coordinator heap pressure. Run nodetool info and compare heap usage against the max. If used heap is above 80% of max and climbing, and the timestamps align with batch warnings, the batches are the likely allocation source.

  3. Check GC behavior. Parse GC logs for Old Generation pauses longer than 500 ms. G1 GC pauses appear as Pause Full or Pause Young; CMS pauses appear as concurrent mode failure or promotion failed. If pauses correlate with warning timestamps, the coordinator is entering the GC Death Spiral.

  4. Check for load shedding. Run nodetool tpstats. In the Dropped section, a rising MUTATION counter means the node is shedding load. In the Thread Pools section, check MutationStage and Native-Transport-Requests: if pending tasks are consistently non-zero while active threads are at the pool maximum, the write path is saturated.

  5. Identify the client source. On Cassandra 4.0+, query system_views.clients to find which application hosts are connected. Map the client address back to an application instance using your infrastructure metadata. On earlier versions, check network connections with ss -tnp | grep 9042 or application-side connection logs. If multiple clients share an IP behind NAT, check application query logs instead.

  6. Determine batch scope. Review application code for the offending table. Count how many partition keys the batch touches. If it is more than one, it is a multi-partition batch. If the partition key is composite, ensure all components match across every statement in the batch. Check whether the code uses BEGIN BATCH (logged) or BEGIN UNLOGGED BATCH. Logged batches carry extra coordination overhead.

  7. Review threshold configuration. Check cassandra.yaml for batch_size_warn_threshold_in_kb and batch_size_fail_threshold_in_kb. Defaults are typically 5 KB warn and 50 KB fail. If an operator previously raised them to suppress noise, treat that as a smoking gun and revert.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Batch warning rate in logsDirect indicator of oversized batches before OOMAny sustained Batch for ... exceeding specified threshold messages
JVM heap used / maxCoordinator must buffer all batch mutations in heapUsed heap trending above 80% of max, especially after Old GC
GC pause durationLong pauses block gossip and request processingOld Generation pauses > 500 ms
Dropped MUTATION messagesNode is shedding load it cannot processNon-zero or rising rate in nodetool tpstats
Coordinator write latency P99Reflects batch coordination and batchlog overheadP99 write latency spiking while local replica latency stays flat
MutationStage pending tasksBackpressure on the write pathPending tasks > 0 sustained for more than 60 seconds

Fixes

Stop the immediate bleed

If a coordinator is in a GC Death Spiral, disable native transport to stop new batches from arriving while you investigate. This interrupts all client traffic to the node.

# Dangerous: interrupts all client traffic to this node
nodetool disablebinary

Only use this when the node is already flapping between UP and DOWN and you need to break the retry storm. Once the heap recovers, re-enable with nodetool enablebinary.

Fix the client write pattern

Raising batch_size_warn_threshold_in_kb or batch_size_fail_threshold_in_kb masks the symptom and moves OOM risk to a higher number. Do not raise thresholds to accommodate a misbehaving client.

Replace multi-partition batches with individual async writes. Use your driver’s async execution to fire multiple independent writes in parallel. This distributes coordination across all replica nodes instead of concentrating memory pressure on one coordinator. Batches are not a performance optimization in Cassandra. Use them only when you need atomicity within a single partition.

If atomicity is required, restrict the batch to a single partition. Single-partition batches are safe because the coordinator only coordinates with replicas that own one token range. The memory footprint is bounded and predictable.

Switch from logged to unlogged only if atomicity is unnecessary. Unlogged batches skip the batchlog write to two additional nodes, which removes some overhead. However, the coordinator still buffers mutations for every partition until acknowledgements arrive. Unlogged batches reduce but do not eliminate coordinator memory pressure for multi-partition workloads.

Add client-side backpressure. If the application generates batches from a streaming source, implement rate limiting or bounded queues so that row accumulation cannot grow without bound. If you use the Java driver, configure a request throttler or place a semaphore around session.executeAsync() to bound in-flight requests. Without backpressure, async writes can shift overload from the coordinator to the client and the cluster.

Prevention

Educate developers that Cassandra batches are for atomicity, not throughput. Flag any BEGIN BATCH that touches multiple partition keys during code review.

Monitor batch warnings as a first-class signal. Treat any sustained batch size warning as a ticket-level finding.

Load test with realistic data sizes. Behavior that looks safe in development with 10 rows per batch can become catastrophic in production with 10,000 rows.

Keep thresholds at conservative defaults. Fix the client instead of raising batch limits.

Review ETL and migration jobs separately from application code. Batch misuse often appears in one-off scripts that use the same CQL driver but lack production tuning.

How Netdata helps

  • Correlate Cassandra log warnings with JVM heap usage charts to confirm that batch size spikes precede memory pressure.
  • Monitor GC pause duration alongside batch warning events to detect the GC Death Spiral before gossip marks the node DOWN.
  • Alert on dropped mutation rates from nodetool tpstats as a lagging indicator that the coordinator is shedding load.
  • Track coordinator write latency percentiles to spot batch-induced tail latency before client timeouts trigger.
  • Surface sudden connection count changes that may indicate a misbehaving client driver or batch loader connecting to the cluster.