Cassandra lightweight transaction contention: Paxos round-trips and CAS latency
A normal Cassandra write reaches a coordinator, gets appended to the commitlog and memtable on the replicas, and returns. One network round-trip. A lightweight transaction (LWT) using IF NOT EXISTS, IF EXISTS, or IF <condition> triggers Paxos consensus instead: four round-trips of coordination overhead, plus serialization of contending requests against the same partition. If your monitoring only tracks aggregate read and write latency, LWT tail latency is invisible until clients time out.
Most operators first encounter LWT contention when application latency spikes while standard Cassandra dashboards look healthy. The usual root cause is either a monitoring pipeline that aggregates CASRead and CASWrite into generic read/write latency buckets, or an application that funnels high-frequency updates through a small set of partition keys.
What it is and why it matters
Lightweight transactions provide linearizable consistency for compare-and-set operations via the Paxos consensus protocol. Every conditional operation runs through prepare, read, propose, and commit phases across a quorum of replicas. In Cassandra versions prior to 4.1, this is a four-round-trip protocol. Cassandra 4.1+ introduces Paxos v2, which improves LWT performance, but CAS operations remain fundamentally more expensive than normal writes.
Normal writes to the same partition are parallelized across memtables and reconciled later. LWT operations targeting the same partition are serialized by the protocol: only one CAS operation on a given partition can make progress at a time. Others block or retry until the active transaction completes or times out. A hot partition under LWT load becomes a hard bottleneck, and latency degrades superlinearly as concurrency increases.
nodetool proxyhistograms exposes CASRead and CASWrite as separate scopes from normal Read and Write. Aggregating them into a single latency metric hides contention. A cluster can show healthy p99 read and write latency while contended CAS operations stack up in thread pools and time out.
How it works
When a coordinator receives a CAS write, it executes quorum-dependent phases. It sends prepare messages to replicas to establish a unique ballot. After receiving quorum promises, it performs a serial read of the current partition state to evaluate the condition. If the condition is satisfied, it proposes the new value and waits for quorum acceptance. Finally, it commits. Each phase is a network round-trip. Any phase can fail or stall if replicas are slow or if another coordinator is contending for the same partition key.
flowchart LR
A[CAS request arrives] --> B[Prepare / Promise]
B --> C[Serial read]
C --> D[Propose / Accept]
D --> E[Commit]
E --> F[Return applied status]Because the protocol uses strict per-partition serialization, concurrent CAS requests to the same key queue behind the active transaction. If that transaction stalls due to a slow replica or a long GC pause, the queue lengthens and subsequent requests time out or return [applied]=false even though the cluster is otherwise healthy. This is a coordination problem, not resource exhaustion. Use Paxos-specific diagnostics rather than standard read or write saturation checks.
Where it shows up in production
Hot partition contention. A small set of partition keys receiving concurrent updates causes Paxos collision. Coordinator CASWrite latency climbs while regular Write latency stays flat. The bottleneck is the serialization window inside the Paxos state machine, not disk I/O or compaction backlog. Adding nodes or faster disks will not help if the application continues to hammer the same partition key with conditional updates.
Monitoring blind spots. If your metrics system rolls CAS percentiles into generic latency averages, contention is invisible until client timeouts trigger alerts. These counters reset on node restart, so establish per-node baselines and allow post-restart warmup before alerting.
Quorum fragility. A normal write at LOCAL_QUORUM needs one round of acknowledgments. A CAS write needs quorums at multiple phases. A single replica experiencing GC pauses or network degradation can stall the entire Paxos round. LWT is disproportionately sensitive to the tail latency of individual nodes.
Cross-DC amplification. If replicas span datacenters, every Paxos round-trip pays the inter-DC latency penalty. A CAS write in a multi-DC cluster can accumulate hundreds of milliseconds of coordination overhead even under zero contention, and a single slow remote replica can dominate end-to-end latency.
Tradeoffs and when to use it
LWT is a correctness tool, not a throughput tool. Use it only when linearizable semantics are strictly required and the access pattern is naturally low-contention.
Appropriate uses include rare metadata updates such as schema registration, feature-flag toggles, or low-frequency inventory reservation where partition keys are well distributed. The Paxos overhead is acceptable when the operation rate is low and the correctness guarantee is worth the latency cost.
Avoid LWT for high-volume write paths, event-stream deduplication, frequent counter increments, queue implementations, or any workload where the same partition key might see concurrent updates. The Paxos serialization bottleneck will dominate performance and can destabilize healthy nodes by backing up thread pools.
If you run Cassandra 4.1 or later, Paxos v2 reduces overhead relative to earlier releases, but it does not remove the serialization constraint or the multi-round-trip latency profile. Upgrade if you rely on LWT, but do not treat the upgrade as a license to increase CAS volume.
Signals to watch in production
Monitor CAS scopes independently from normal traffic. Correlating CASRead and CASWrite with regular read/write latency, thread pool saturation, and GC pauses reveals whether Paxos is the bottleneck.
| Signal | Why it matters | Warning sign |
|---|---|---|
CASRead latency p99 | Coordinator time for CAS read phases | Sustained elevation above rolling baseline |
CASWrite latency p99 | Coordinator time for full CAS write rounds | Sustained elevation or > 3x rolling baseline |
CASRead / CASWrite timeouts | Replicas too slow during Paxos phases | Any sustained non-zero rate |
CASRead / CASWrite unavailables | Insufficient live replicas for a quorum | Immediate risk of CAS failure |
| Thread pool pending (MutationStage, ReadStage) | Backpressure from serialized CAS operations | Pending > 0 sustained for > 60 seconds |
| Node liveness (gossip DOWN) | Paxos needs quorums at every phase | Any replica DOWN in the token range |
| GC pause duration | Pauses stall Paxos rounds and inflate CAS latency | Pauses > 500 ms sustained |
nodetool proxyhistograms reports coordinator-level latency in microseconds and resets on node restart. Measure baseline deviations against a rolling window, not an absolute threshold, and allow cold-start warmup before alerting.
How Netdata helps
Netdata collects Cassandra JMX metrics per scope so you can monitor CAS traffic independently from normal reads and writes.
- Isolate CAS latency. View
CASReadandCASWritepercentiles alongside standardReadandWritelatency. A widening gap betweenWriteandCASWritep99 signals Paxos overhead or partition contention. - Correlate with saturation. Overlay CAS latency with MutationStage pending tasks, dropped messages, and GC pause duration. If CAS latency spikes while pending tasks climb and GC is quiet, you have partition-level serialization rather than systemic resource exhaustion.
- Detect quorum fragility. Track
ClientRequesttimeouts and unavailables for theCASReadandCASWritescopes. A sustained rate of CAS unavailables means your replica set is too small or unstable for reliable LWT. - Baseline-aware alerting. Use relationship-based thresholds, such as
CASWritep99 exceeding three times the one-hour rolling average, to catch contention early without false positives from restart-induced metric resets.
Related guides
- Cassandra adding and removing nodes safely: vnodes, tokens, and cleanup
- Cassandra Batch too large warning: oversized batches and coordinator OOM
- Cassandra node stuck in joining (UJ): bootstrap diagnosis
- Cassandra compaction strategies: STCS vs LCS vs TWCS vs UCS
- Cassandra clock skew: how NTP drift silently corrupts data
- Cassandra commitlog disk full: segment exhaustion and forced flushes
- Cassandra commitlog pending tasks: write-path I/O pressure
- Cassandra compaction death spiral: when writes outrun compaction throughput
- Cassandra consistency levels explained: QUORUM, ONE, LOCAL_QUORUM, and EACH_QUORUM
- Cassandra zombie data resurrection: gc_grace_seconds and unrepaired tombstones
- Cassandra disk space exhaustion: emergency recovery when the data volume fills
- Cassandra dropped mutations: silent write loss and load shedding







