Cassandra clock skew: how NTP drift silently corrupts data
Cassandra’s last-write-wins conflict resolution is simple, fast, and unforgiving. Every write carries a timestamp; when replicas disagree, the highest timestamp wins. The database assumes larger timestamps correspond to later wall-clock events. When node clocks drift, that assumption collapses. A write that happened first can carry a later timestamp, or a later write an earlier one. The result is not a timeout, an UnavailableException, or an ERROR log entry. It is silent data loss, permanent shadowing of valid writes, or the sudden return of deleted rows.
Unlike systems that use vector clocks or logical timestamps, Cassandra does not track causality for regular mutations. The coordinator assigns a timestamp in microseconds since the Unix epoch from its local system clock or a client-supplied value. Replicas reconcile during read repair or anti-entropy repair by comparing these integers. There is no concurrent-write detection, no merge logic, and no validation that a timestamp is near the receiver’s clock. There is only integer comparison. When clocks skew, the larger integer is not necessarily the correct write.
This article explains the mechanics of timestamp-based conflict resolution, the specific failure modes that clock drift creates, and why these errors never surface as Cassandra exceptions. It also covers how to detect skew across the ring and the operational discipline required to prevent a quiet catastrophe.
What it is and why it matters
Clock skew in a Cassandra cluster occurs when nodes disagree about the current time by more than a few milliseconds. Cassandra uses these timestamps for two critical functions: determining which write survives when replicas hold different values, and deciding when a tombstone or TTL-expired row can be safely garbage-collected. The database does not validate that timestamps are close to the coordinator’s wall clock. It accepts whatever timestamp arrives and resolves conflicts arithmetically.
The danger is that clock skew produces logically wrong results without any error code. A client receives a success response for every write. The cluster remains UP. Gossip may even stay stable if the skew is small. But the dataset is silently corrupted. A user updates a profile, yet the old profile returns on the next read. A delete executes, yet the data reappears days later because a tombstone was compacted away on one replica while a drifted clock on another replica produced a future-timestamped resurrection. These are not hypothetical edge cases. They are direct consequences of last-write-wins semantics combined with unsynchronized clocks. Even drift of just a few seconds is enough to break data correctness.
How timestamp conflict resolution works
Every write path in Cassandra is timestamped. If the client does not provide a timestamp, the coordinator stamps the mutation with its local system time in microseconds. On each replica, the write is appended to the commitlog and inserted into the memtable with that timestamp. When a read request reaches a replica, Cassandra gathers all versions of the row from memtables and SSTables, compares their timestamps, and returns the value with the highest timestamp to the coordinator. During read repair or anti-entropy repair, the same rule applies: the higher timestamp wins, and the losing value is discarded.
This is last-write-wins in its purest form. It is fast because it requires no version vectors, no locking, and no multi-round consensus. But it embeds a single assumption: that timestamps are monotonic with respect to real-world event order. When all nodes use the same time source, the assumption mostly holds. When they do not, it fails catastrophically.
flowchart TD
W1[Write X] -->|t=1000| N1[Node A, correct clock]
W2[Write Y, 2s later] -->|t=998| N2[Node B, clock -3s]
N1 --> R[LWW conflict resolution]
N2 --> R
R --> O[Result: X wins, Y lost]In the diagram above, the second write arrives two seconds later in wall-clock time but carries an earlier timestamp because Node B’s clock is three seconds slow. During repair or read repair, Cassandra discards Y and keeps X. The application sent two writes, received two acknowledgments, and ended up with the wrong result. No error was logged.
Where drift surfaces in production
Clock skew does not produce a single symptom. It fractures into several failure modes that are usually misattributed to other causes.
Silent write loss. A node with a fast clock assigns future timestamps to its writes. When a normally clocked node later receives an update to the same partition, its timestamp is lower than the drifted write. Repair or read repair overwrites the legitimate update with the stale drifted value. The application sees the old data return.
Future-dated tombstone shadowing. A delete executed on a node with a fast clock receives a timestamp far in the future. Because tombstones are also resolved by timestamp, that future-timestamped tombstone shadows all later legitimate writes to the same partition. The data appears deleted forever, even as new writes succeed. The inverse of zombie data, this failure buries live data under a tombstone that compaction cannot purge until the wall clock catches up to the future timestamp.
TTL and expiration anomalies. TTL expiration is evaluated against local clock time during compaction. A node with a slow clock delays expiration, keeping data alive past its intended lifetime. A node with a fast clock expires data prematurely. If replicas skew in opposite directions, clients see inconsistent results depending on which replica answers.
Permanent inconsistency after repair. When skewed timestamps have already propagated, running nodetool repair does not restore correctness. Repair applies last-write-wins just like read repair. If the surviving replica holds a drifted timestamp, repair streams that value to other replicas, cementing the error across the entire ring. The damage becomes durable and consistent, which is worse than transient inconsistency.
Same-timestamp collision. When two writes arrive with identical timestamps, Cassandra breaks the tie arbitrarily. High-throughput systems or environments that mix client-side and server-side timestamp generators increase collision probability. Skew across nodes makes same-millisecond collisions more likely because nodes are operating in overlapping timestamp windows instead of sequential ones.
Detection and operational discipline
Clock skew is uniquely dangerous because it never raises a Cassandra error metric. There is no JMX bean for timestamp conflicts. You must measure time synchronization directly and treat it as a first-class operational signal.
Check node timestamps. Compare system clocks across the ring. Any delta over 100 milliseconds is a problem. Target sub-50 ms.
# Compare NTP offset across the ring
for node in node1 node2 node3; do
echo -n "$node: "
ssh "$node" chronyc tracking | grep "Last offset"
done
Monitor NTP or Chrony state. On each node, verify offset from the reference clock. Chrony has largely replaced legacy NTPd in modern distributions because it converges faster after large offsets and handles virtualized environments better.
# Check Chrony offset and stratum
chronyc tracking
# For legacy ntpd
ntpq -p
Stratum should be low and offset should remain below a few milliseconds. If you are running in virtualized environments, standard NTP peers may not be sufficient.
Use cloud provider time services. AWS EC2 instances should point to the Amazon Time Sync Service at 169.254.169.123 rather than generic internet NTP pools. On Azure, modern VMs expose /dev/ptp0 for Precision Time Protocol via the hypervisor. If your deployment is on bare metal or on-prem, configure multiple stratum-1 or stratum-2 sources and ensure your time daemon handles leap second discontinuities gracefully. A backward step is worse than a skewed clock.
Standardize timestamp sources. If your applications supply client-side timestamps, the application host clock becomes the source of truth. If those hosts are not synchronized to the same NTP source as the database, you have introduced skew at a layer that Cassandra cannot detect. Standardize on one approach per cluster and synchronize every participant to the same reference. Check application servers with the same rigor you apply to database nodes.
Signals to watch in production
| Signal | Why it matters | Warning sign |
|---|---|---|
| Cross-node clock skew | LWW resolution assumes synchronized clocks | Any node pair offset > 100 ms |
| NTP offset and stratum | Degraded time source causes drift | Offset > 50 ms sustained |
| Read repair rate | Skewed timestamps trigger unexpected reconciliation | Spike without maintenance or node recovery |
| Tombstone scan warnings | Skew accelerates or delays TTL tombstone expiration | Sustained warnings on TTL-heavy tables |
| Repair completion window | Unrepaired data amplifies skew-induced divergence | Last repair > 80% of gc_grace_seconds |
How Netdata helps
Netdata collects the system-level and Cassandra JMX metrics needed to correlate clock health with data integrity.
- System clock offset monitoring. Netdata’s NTP collector tracks offset, stratum, and reachability per node. Correlate a rising offset with read repair spikes or tombstone warnings to identify skew-induced reconciliation storms.
- JMX signal correlation. Netdata exposes Cassandra’s read repair rate, dropped message counters, and tombstone scan histograms alongside host metrics. A read repair spike without a corresponding node-down event is a strong clue that timestamps are diverging.
- Per-node comparison. Clock skew is a cluster-wide problem best seen by comparing nodes. Netdata’s distributed view aligns NTP offsets across the ring with latency and repair metrics.
- TTL and expiration monitoring. Netdata tracks table-level metrics that expose tombstone density. Rising tombstone counts on TTL tables paired with clock offset anomalies indicate that expiration is not happening uniformly across replicas.







