Cassandra dropped mutations: silent write loss and load shedding

Your application logs successful writes, but reads return stale or missing data. An alert fires on DroppedMessage rate for MUTATION scope. The client never received an error, yet a replica discarded the write after it sat in the MutationStage queue past timeout. This is Cassandra load shedding. Silent write loss occurs whenever not enough other replicas succeed to meet the consistency level.

Dropped mutations are a lagging indicator. By the time they appear, the replica is already choking on commitlog I/O, CPU starvation, GC pauses, or thread pool exhaustion. The DroppedMessage counter is cumulative since JVM start; a single large value means nothing unless the rate is increasing. Any sustained non-zero rate is abnormal and demands immediate investigation.

What this means

When a coordinator forwards a write to a replica, the replica places the mutation into the MutationStage queue. A worker thread eventually picks it up, appends it to the commitlog, inserts it into the memtable, and sends an ACK. If the queue is backed up because the node cannot process mutations fast enough, the mutation ages past write_request_timeout_in_ms (default 2000 ms) and is silently discarded.

The coordinator may have already ACKed the write to the client if enough other replicas responded. The drop is invisible to the application. The missing data on that replica creates an inconsistency that will not self-heal unless the key is later touched by read repair or by an anti-entropy repair run within gc_grace_seconds. Hinted handoff does not protect against this: hints are generated only when a replica is marked DOWN by gossip, not when it is UP but overloaded and dropping mutations.

flowchart TD
    A[Coordinator forwards write] --> B[Replica MutationStage queue]
    B --> C[Thread processes mutation]
    C --> D[Commitlog append + memtable insert]
    D --> E[ACK to coordinator]
    F[Commitlog disk slow] --> G[Queue backlog]
    H[GC pause] --> G
    I[CPU saturation] --> G
    G --> J[Mutation exceeds timeout]
    J --> K[MUTATION silently dropped]
    K --> L[Data loss on replica]

Common causes

CauseWhat it looks likeFirst thing to check
Commitlog disk I/O saturationHigh w_await or %util on the commitlog device; commitlog pending tasks growiostat -x 1 on the commitlog volume
MutationStage thread pool saturationPending > 0 or Blocked > 0 in MutationStage; CPU highnodetool tpstats
Long GC pausesGC logs show pauses > 500 ms; heap after GC > 75%; drops correlate with GC spikesnodetool info and GC logs
Compaction backlog starving disk I/OPendingCompactions trending up; disk %util > 80%nodetool compactionstats
CPU starvationHigh sys/user CPU with low idle; multiple thread pools lagmpstat or top

Quick checks

Run these safe, read-only commands to triage the scope of the drops and identify the saturated resource.

# Check dropped mutation counts and thread pool state
nodetool tpstats
# Check commitlog and data disk saturation
iostat -x 1
# Check compaction debt
nodetool compactionstats
# Check heap usage and pressure
nodetool info | grep "Heap Memory"
# Check for long GC pauses in recent logs
grep -iE "pause|stopped" /var/log/cassandra/gc.log | tail -20
# Check coordinator write latency
nodetool proxyhistograms
# Check node liveness and cluster view
nodetool status
# Check disk space on commitlog and data volumes
df -h /var/lib/cassandra/commitlog /var/lib/cassandra/data

How to diagnose it

  1. Confirm a sustained drop rate. Run nodetool tpstats and note the Dropped section. The counters are cumulative. Sample the value, wait 60 seconds, and sample again. Any increase in the MUTATION row is a problem.
  2. Inspect the MutationStage queue. In nodetool tpstats, look for Pending > 0 or Blocked > 0 under the request stages. Sustained pending means the write path cannot keep up; blocked means the queue is full and work is being rejected.
  3. Check commitlog disk latency. Use iostat -x on the commitlog device. If await is elevated or %util is near 100%, fsync stalls are backing up the write pipeline. If commitlog and data share a device, separation is strongly recommended.
  4. Check GC health. Parse GC logs for stop-the-world pauses. Pauses > 500 ms freeze all stage processing and cause queued mutations to expire. If heap after full GC > 75% of max, memory pressure is the root cause.
  5. Check compaction status. Run nodetool compactionstats. If pending tasks are trending upward over hours, compaction is stealing I/O bandwidth from commitlog writes and read processing.
  6. Check for asymmetric patterns. If only one rack or node shows drops, inspect that specific host for hardware degradation, uneven traffic, or a hot partition. Use nodetool tablehistograms to see if a single table dominates write latency.
  7. Verify hinted handoff is not masking the issue. Hints are stored when a replica is marked DOWN by gossip. An overloaded node that remains UP will not receive hints for dropped mutations, so do not rely on hint replay to backfill lost data.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
DroppedMessage rate for MUTATIONDirect measure of silent write lossAny sustained non-zero rate
MutationStage pending tasksBackpressure on the write pathPending > 0 for > 60 seconds
Commitlog pending tasksCommitlog fsync cannot keep upPending > 0 sustained
Disk await on commitlog devicePhysical I/O bottleneck delaying durabilityawait > 10 ms sustained
GC pause durationSTW freezes all stage processingPause > 500 ms
Compaction pending tasksBackground I/O debt steals bandwidth from writesTrending upward over hours
Client write timeoutsCoordinator-side view of replica slownessRate > 0.1% of write rate

Fixes

Fix the root cause; do not restart Cassandra as a first response. Restarting clears the backlog temporarily but destroys the diagnostic state and does not prevent immediate recurrence.

Commitlog I/O saturation

If the commitlog device is saturated, move commitlog to a dedicated volume. This requires updating cassandra.yaml and restarting the node, so plan for maintenance. As a temporary relief, reduce write throughput from clients or batch jobs. If you are using commitlog_sync: batch, which fsyncs before ACK, switching to periodic trades durability (up to commitlog_sync_period_in_ms, default 10000 ms) for lower latency. Understand the data-loss implications before changing this.

MutationStage saturation from CPU pressure

Thread pool saturation is usually a symptom, not a root cause. Identify what is starving CPU. If compaction is the consumer, you can temporarily increase compaction_throughput_mb_per_sec to clear debt faster, or decrease it to throttle compaction and leave headroom for writes. If GC is consuming CPU, see the Cassandra GC death spiral guide. Identify large partitions, tombstone scans, or misconfigured caches rather than blindly resizing pools.

Disk I/O contention from compaction backlog

Increase compaction_throughput_mb_per_sec to let compaction catch up, but monitor read latency because this adds I/O load. Postpone repairs, bootstraps, and streaming that compete for disk. If you use SizeTieredCompactionStrategy and disk usage is above 50%, you are at risk of space exhaustion during a major compaction. Clear old snapshots and add capacity before compaction can recover.

Memory pressure and GC pauses

Reduce in-flight writes by throttling clients at the application layer. Check for large batch statements or unbounded partitions that allocate massive objects on heap. If heap after full GC > 85%, a rolling restart with increased heap may be needed. Do not exceed roughly 16 GB with G1GC, or pause times will worsen.

Prevention

  • Alert on the rate of DroppedMessage, not the cumulative count. Counters reset on JVM restart, so rate is the only actionable signal.
  • Keep commitlog on a dedicated disk, separate from data directories.
  • Watch MutationStage pending tasks as a leading indicator. Any sustained pending predicts future drops.
  • Monitor compaction pending trends, not just absolute values. A rising trend means I/O debt is accumulating.
  • Keep JVM heap after full GC below 75% of max. Track this from GC logs, not just HeapMemoryUsage.
  • Run anti-entropy repair within gc_grace_seconds so that any inconsistencies from transient drops are eventually reconciled.

How Netdata helps

Netdata collects DroppedMessage rates and MutationStage pending tasks from JMX and places them on the same timeline as disk I/O latency and GC pauses. This correlation shows whether drops lag a commitlog stall or a GC spike by seconds.

Per-disk await and utilization metrics for the commitlog volume let you distinguish disk saturation from CPU saturation without logging into the node.

Process RSS tracking surfaces off-heap memory pressure that JVM heap metrics miss, catching OOM-killer risk before it triggers.

Composite alerting on drops plus thread pool saturation plus GC pauses reduces false positives from single-metric blips.