Cassandra lightweight transaction contention: Paxos round-trips and CAS latency

A normal Cassandra write reaches a coordinator, gets appended to the commitlog and memtable on the replicas, and returns. One network round-trip. A lightweight transaction (LWT) using IF NOT EXISTS, IF EXISTS, or IF <condition> triggers Paxos consensus instead: four round-trips of coordination overhead, plus serialization of contending requests against the same partition. If your monitoring only tracks aggregate read and write latency, LWT tail latency is invisible until clients time out.

Most operators first encounter LWT contention when application latency spikes while standard Cassandra dashboards look healthy. The usual root cause is either a monitoring pipeline that aggregates CASRead and CASWrite into generic read/write latency buckets, or an application that funnels high-frequency updates through a small set of partition keys.

What it is and why it matters

Lightweight transactions provide linearizable consistency for compare-and-set operations via the Paxos consensus protocol. Every conditional operation runs through prepare, read, propose, and commit phases across a quorum of replicas. In Cassandra versions prior to 4.1, this is a four-round-trip protocol. Cassandra 4.1+ introduces Paxos v2, which improves LWT performance, but CAS operations remain fundamentally more expensive than normal writes.

Normal writes to the same partition are parallelized across memtables and reconciled later. LWT operations targeting the same partition are serialized by the protocol: only one CAS operation on a given partition can make progress at a time. Others block or retry until the active transaction completes or times out. A hot partition under LWT load becomes a hard bottleneck, and latency degrades superlinearly as concurrency increases.

nodetool proxyhistograms exposes CASRead and CASWrite as separate scopes from normal Read and Write. Aggregating them into a single latency metric hides contention. A cluster can show healthy p99 read and write latency while contended CAS operations stack up in thread pools and time out.

How it works

When a coordinator receives a CAS write, it executes quorum-dependent phases. It sends prepare messages to replicas to establish a unique ballot. After receiving quorum promises, it performs a serial read of the current partition state to evaluate the condition. If the condition is satisfied, it proposes the new value and waits for quorum acceptance. Finally, it commits. Each phase is a network round-trip. Any phase can fail or stall if replicas are slow or if another coordinator is contending for the same partition key.

flowchart LR
    A[CAS request arrives] --> B[Prepare / Promise]
    B --> C[Serial read]
    C --> D[Propose / Accept]
    D --> E[Commit]
    E --> F[Return applied status]

Because the protocol uses strict per-partition serialization, concurrent CAS requests to the same key queue behind the active transaction. If that transaction stalls due to a slow replica or a long GC pause, the queue lengthens and subsequent requests time out or return [applied]=false even though the cluster is otherwise healthy. This is a coordination problem, not resource exhaustion. Use Paxos-specific diagnostics rather than standard read or write saturation checks.

Where it shows up in production

Hot partition contention. A small set of partition keys receiving concurrent updates causes Paxos collision. Coordinator CASWrite latency climbs while regular Write latency stays flat. The bottleneck is the serialization window inside the Paxos state machine, not disk I/O or compaction backlog. Adding nodes or faster disks will not help if the application continues to hammer the same partition key with conditional updates.

Monitoring blind spots. If your metrics system rolls CAS percentiles into generic latency averages, contention is invisible until client timeouts trigger alerts. These counters reset on node restart, so establish per-node baselines and allow post-restart warmup before alerting.

Quorum fragility. A normal write at LOCAL_QUORUM needs one round of acknowledgments. A CAS write needs quorums at multiple phases. A single replica experiencing GC pauses or network degradation can stall the entire Paxos round. LWT is disproportionately sensitive to the tail latency of individual nodes.

Cross-DC amplification. If replicas span datacenters, every Paxos round-trip pays the inter-DC latency penalty. A CAS write in a multi-DC cluster can accumulate hundreds of milliseconds of coordination overhead even under zero contention, and a single slow remote replica can dominate end-to-end latency.

Tradeoffs and when to use it

LWT is a correctness tool, not a throughput tool. Use it only when linearizable semantics are strictly required and the access pattern is naturally low-contention.

Appropriate uses include rare metadata updates such as schema registration, feature-flag toggles, or low-frequency inventory reservation where partition keys are well distributed. The Paxos overhead is acceptable when the operation rate is low and the correctness guarantee is worth the latency cost.

Avoid LWT for high-volume write paths, event-stream deduplication, frequent counter increments, queue implementations, or any workload where the same partition key might see concurrent updates. The Paxos serialization bottleneck will dominate performance and can destabilize healthy nodes by backing up thread pools.

If you run Cassandra 4.1 or later, Paxos v2 reduces overhead relative to earlier releases, but it does not remove the serialization constraint or the multi-round-trip latency profile. Upgrade if you rely on LWT, but do not treat the upgrade as a license to increase CAS volume.

Signals to watch in production

Monitor CAS scopes independently from normal traffic. Correlating CASRead and CASWrite with regular read/write latency, thread pool saturation, and GC pauses reveals whether Paxos is the bottleneck.

Signal	Why it matters	Warning sign
`CASRead` latency p99	Coordinator time for CAS read phases	Sustained elevation above rolling baseline
`CASWrite` latency p99	Coordinator time for full CAS write rounds	Sustained elevation or > 3x rolling baseline
`CASRead` / `CASWrite` timeouts	Replicas too slow during Paxos phases	Any sustained non-zero rate
`CASRead` / `CASWrite` unavailables	Insufficient live replicas for a quorum	Immediate risk of CAS failure
Thread pool pending (MutationStage, ReadStage)	Backpressure from serialized CAS operations	Pending > 0 sustained for > 60 seconds
Node liveness (gossip DOWN)	Paxos needs quorums at every phase	Any replica DOWN in the token range
GC pause duration	Pauses stall Paxos rounds and inflate CAS latency	Pauses > 500 ms sustained

nodetool proxyhistograms reports coordinator-level latency in microseconds and resets on node restart. Measure baseline deviations against a rolling window, not an absolute threshold, and allow cold-start warmup before alerting.

How Netdata helps

Netdata collects Cassandra JMX metrics per scope so you can monitor CAS traffic independently from normal reads and writes.

Isolate CAS latency. View CASRead and CASWrite percentiles alongside standard Read and Write latency. A widening gap between Write and CASWrite p99 signals Paxos overhead or partition contention.
Correlate with saturation. Overlay CAS latency with MutationStage pending tasks, dropped messages, and GC pause duration. If CAS latency spikes while pending tasks climb and GC is quiet, you have partition-level serialization rather than systemic resource exhaustion.
Detect quorum fragility. Track ClientRequest timeouts and unavailables for the CASRead and CASWrite scopes. A sustained rate of CAS unavailables means your replica set is too small or unstable for reliable LWT.
Baseline-aware alerting. Use relationship-based thresholds, such as CASWrite p99 exceeding three times the one-hour rolling average, to catch contention early without false positives from restart-induced metric resets.

The Netdata solution

Cassandra monitoring with Netdata

Netdata monitors Apache Cassandra with per-second metrics and automatic dashboards. Correlate GC pauses, compaction backlog, tombstone rates, pending hints, and disk usage across nodes to catch a creeping cluster before it tips over.

See Cassandra monitoring → Start monitoring free

Cassandra lightweight transaction contention: Paxos round-trips and CAS latency

Cassandra lightweight transaction contention: Paxos round-trips and CAS latency

What it is and why it matters

How it works

Where it shows up in production

Tradeoffs and when to use it

Signals to watch in production

How Netdata helps

Related guides

Cassandra monitoring with Netdata