$ guides / cockroachdb / how-cockroachdb-works-in-production ▌

Operations Guides

How CockroachDB actually works in production: a mental model for operators

CockroachDB is a distributed, strongly-consistent SQL database built on a replicated key-value store. It layers SQL execution on top of a transactional KV engine that uses Raft consensus for replication and MVCC for concurrency control. To reason about its failures, you must hold several interacting subsystems in your head simultaneously.

This is not a tutorial. It is the mental model that experienced operators use when diagnosing slow queries, unavailable ranges, clock skew, and compaction death spirals. Each subsystem has its own failure modes, but the most damaging incidents happen when failures cascade across layers.

The key insight: CockroachDB’s layers are not independent. A disk stall at the Pebble layer cascades upward through Raft, through range availability, through admission control, and into SQL latency. A clock skew problem at the HLC layer causes read restarts that look like contention. Understanding the request path from SQL client to SSTable is the foundation for every diagnostic decision you will make.

The layered architecture

CockroachDB is organized in layers. Each layer has its own resource profile, its own failure modes, and its own observable signals.

SQL layer. Client connections arrive via the pgwire protocol (PostgreSQL wire protocol) at a gateway node. The gateway parses, plans, and coordinates query execution. Each connection gets a goroutine. There is no built-in connection pooler. That is the client’s responsibility.

Distributed SQL (DistSQL). The gateway transforms the optimized logical plan into a directed acyclic graph of physical SQL operators. DistSQL creates flows: pipelines of processors connected by streams that shuffle data between nodes. A single query can saturate inter-node bandwidth and consume memory on multiple nodes simultaneously.

Transactional KV layer. Below SQL sits a transactional key-value store. CockroachDB uses serializable snapshot isolation via MVCC timestamps. Each transaction acquires a timestamp from the node’s Hybrid Logical Clock. Transactions that conflict may be pushed (timestamp advanced) or aborted. Write intents (uncommitted writes) remain in storage and must be resolved by the transaction coordinator or by encountering transactions. Abandoned intents create garbage cleaned up asynchronously.

Range abstraction. The entire keyspace is divided into ranges, each approximately 512 MiB by default (changed from 64 MiB in v20.1). Every range is a unit of replication, load balancing, and Raft consensus. A range has multiple replicas (default 3) spread across nodes and zones. One replica is the leaseholder: it serves all reads and coordinates writes. One replica is the Raft leader: it drives consensus. In the common case, leaseholder and Raft leader are co-located on the same node.

Raft consensus. Every write to a range must be proposed through Raft and committed by a quorum of replicas before acknowledgment. Raft heartbeats flow continuously between replicas. A node with 10,000 ranges runs 10,000 Raft state machines concurrently, each consuming CPU for ticking, proposing, and applying entries. This per-range overhead is a non-obvious resource multiplier.

Storage layer (Pebble LSM). Each node stores data in Pebble, a Log-Structured Merge Tree engine (Pebble replaced RocksDB as default in v20.2; it is the only option since v21.1). Writes go into an in-memory memtable (default 64 MB), which is flushed to sorted SSTable files on disk at Level 0. Background compaction merges SSTables from L0 down through L6. When compaction falls behind ingestion, L0 file count and sublevel count grow, causing read latency to spike nonlinearly. This is the single most common performance cliff in CockroachDB.

The request path through the layers

A SQL write traverses every layer. Understanding this path is the foundation for correlating symptoms across subsystems.

flowchart TD
    C["SQL client"] -->|"pgwire"| GW["Gateway: SQL planning"]
    GW -->|"physical plan"| DS["DistSQL: cross-node execution"]
    DS -->|"KV operations"| LH["Leaseholder: reads, write coord"]
    LH -->|"propose"| RL["Raft leader: consensus"]
    RL -->|"fsync commit"| PB["Pebble LSM: L0 to L6"]
    RL -.->|"log replication"| FL["Follower replicas on other nodes"]

A client connects via pgwire to a gateway node. The gateway parses and optimizes the SQL statement, then either executes it locally or distributes it via DistSQL flows to nodes that hold the relevant range data. Each DistSQL processor performs a fragment of the work (scan, filter, join, aggregate) and streams results through inter-node connections.

For writes, the leaseholder for each affected range coordinates the transaction. It proposes the write through Raft. The Raft leader drives consensus: the write is committed once a quorum of replicas acknowledge. On each replica, the committed entry is written to the Pebble LSM: first to the memtable, then flushed to an SSTable at L0, eventually compacted down to L6.

Reads are simpler. The leaseholder serves reads directly from its local Pebble instance, subject to MVCC visibility rules and HLC uncertainty checks. No Raft round-trip is needed for a single-key read at the leaseholder.

Cross-cutting subsystems

Three subsystems span all layers and determine how the system behaves under load.

Admission control (v21.2+, default since v22.1). An internal flow control system that queues work to prevent overload. Five queues regulate different work types: kv, sql-kv-response, sql-sql-response, elastic-cpu, and store-write. The store-write queue is directly tied to LSM L0 health. It begins shaping regular traffic at 5 L0 sublevels and elastic traffic at 1 sublevel. When admission control is queuing, latency rises but the system stays stable. When it is overwhelmed, you get cascading failures.

Hybrid Logical Clocks (HLC). HLC combines physical wall-clock time with a logical counter. CockroachDB enforces a maximum clock offset between nodes (default 500ms). If a node detects its clock is skewed beyond 80% of max-offset relative to a majority of peers, it self-terminates. Clock skew within the allowed window causes read uncertainty restarts: invisible to operators but directly affecting tail latency.

Node liveness. Each node renews a liveness record via heartbeat with a fixed expiry. If a node fails to renew, the cluster considers it dead and redistributes leases and replicas. Liveness records are stored in the KV layer, not a SQL table. Recent versions are transitioning toward store-level liveness and lease-based failure detection, reducing dependence on the centralized liveness record.

How failures propagate through the layers

The layered design means failures do not stay contained. Understanding propagation is the core diagnostic skill.

LSM compaction death spiral. Write rate exceeds compaction throughput. L0 sublevel count climbs past 10, then 20. Read amplification increases, making reads and compaction itself slower. This is a positive feedback loop. Eventually Pebble stalls writes. The node cannot process Raft heartbeats during stalls, causing it to lose range leases and appear partially unavailable. If multiple nodes hit this simultaneously, the entire cluster becomes unavailable.

Raft liveness failure. A node becomes slow (GC pause, disk stall, CPU saturation), cannot process Raft heartbeats, loses leadership, and the cluster redistributes leases. This creates cascading unavailability windows. The oscillation pattern is characteristic: liveness lost, then recovered, then lost again.

Clock skew cascade. NTP failure causes drift. Initially only the uncertainty interval widens, causing more read restarts and higher latency. If drift exceeds 80% of max-offset (400ms by default), nodes self-terminate. Multiple nodes with shared NTP infrastructure means quorum loss.

Hot range bottleneck. A single range receives disproportionate traffic due to sequential key patterns (timestamp-prefixed primary keys, auto-incrementing IDs, single-row counters). One node saturates while the rest of the cluster idles. The asymmetry is the diagnostic marker.

Intent accumulation. Long-running or abandoned transactions leave write intents that block other transactions. If intent resolution cannot keep up, latency cascades across the cluster. Other transactions encountering intents must resolve them, adding latency system-wide.

Design tradeoffs that shape operations

Every design decision in CockroachDB creates operational consequences. Understanding these tradeoffs helps you anticipate where the system will break under your specific workload.

Serializable isolation by default. CockroachDB enforces serializable snapshot isolation. This is correctness-first: no read phenomena, no anomalies. The cost is higher contention under concurrent writes to the same keys. Transaction restarts (writetooold, txnpush) are normal in a serializable database, but they are also the primary mechanism by which contention manifests. If your workload has hot keys, serializable isolation will make it visible.

Per-range Raft overhead. Every range runs an independent Raft state machine. At 10,000 ranges per node, Raft ticking consumes real CPU even with zero query traffic. At 50,000+ ranges, Raft processing alone may consume a significant fraction of a core. This is why range count is a scaling dimension distinct from data volume. Adding data without adding nodes increases per-node range count and per-node CPU baseline.

LSM storage cliff-edge behavior. The LSM compaction model has a positive feedback loop: when compaction falls behind, reads get slower, which makes compaction itself slower because more disk seeks are needed per compaction job. The degradation curve is cliff-edge, not gradual. A system at 5 L0 sublevels today can hit 20+ and write stalls within minutes during a write burst. This is fundamentally different from B-tree storage engines where degradation is more linear.

No heartbeat prioritization. CockroachDB uses a single gRPC port for all inter-node communication: Raft heartbeats, log entries, DistSQL shuffling, and snapshots. There is no way to prioritize Raft heartbeats over data transfer at the application level. Under network saturation, heartbeats compete with bulk data transfers, which can cause liveness failures.

Signals to watch in production

The layered architecture means each layer has its own observable signals. The most effective monitoring correlates signals across layers rather than treating each in isolation.

Signal	Layer	Why it matters	Warning sign
L0 sublevel count (`storage_l0_sublevels`)	Storage	Most predictive signal for storage-driven degradation. Biggest blind spot for most teams.	Greater than 10 sustained. Greater than 20 means write stalls imminent.
WAL fsync latency (`storage_wal_fsync_latency`)	Storage	WAL fsync is on the critical path for every write. Directly impacts Raft commit time.	P99 above 50ms on SSDs.
Range unavailability (`ranges_unavailable`)	Raft / Range	Any nonzero value means some keyspace cannot be read or written. Live user impact.	Any nonzero value.
Clock offset (`clock_offset_meannanos`)	HLC	Clock skew causes read restarts and, at extreme values, node self-termination.	Above 250ms (50% of default max-offset).
Admission control wait time (`admission.wait_durations.*`)	Admission control	Queuing means the system is at capacity. The store-write queue directly reflects LSM health.	Sustained average wait above 10ms.
Node liveness status	Liveness	Loss of liveness triggers lease redistribution and transient unavailability.	Unexpected transition to not-live.
Transaction restart rate by cause (`txn_restarts`)	Transaction	Distinguishes contention (`writetooold`), clock skew (`readwithinuncertainty`), and app conflicts (`txnpush`).	Above 10% of total transactions. Any `readwithinuncertainty` is a clock signal.
MVCC garbage bytes	Storage / Transaction	Dead data accumulating. Protected timestamps can silently block GC.	Growing without bound.
KV write latency (`exec_latency`)	KV	Isolates storage and replication latency from SQL planning overhead.	P99 at 5x baseline.
Raft log commit latency (`raft.process.logcommit.latency`)	Raft / Storage	Rising means WAL writes are getting slower. Disk I/O bottleneck.	Above 50ms on SSDs.

How Netdata helps

Per-second granularity matters for CockroachDB because sub-second write stalls and brief unavailability windows are invisible at 15 to 30 second scrape intervals.
Correlating L0 sublevel count with admission control queue depth and SQL latency in a single view shortens the path from “queries are slow” to “compaction is behind and admission control is throttling.”
Clock offset and readwithinuncertainty restart rate side by side make clock skew immediately visible before it reaches the self-termination threshold.
WAL fsync latency correlated with disk I/O utilization and Raft log commit latency isolates whether write latency is storage-driven or network-driven.
Per-node breakdowns prevent hot range bottlenecks and single-node degradation from hiding under cluster averages.
ML-based anomaly detection on L0 sublevel trajectory catches the slowly growing pattern that precedes most compaction death spirals.

Netdata’s CockroachDB monitoring with Netdata brings these signals together with per-second metrics and ML anomaly detection.

The Netdata solution

CockroachDB monitoring with Netdata

Netdata monitors CockroachDB with per-second metrics and automatic dashboards. Watch LSM compaction, Raft liveness, clock skew, hot ranges, and intent buildup so the distributed-systems failure modes in these runbooks surface early.

See CockroachDB monitoring → Start monitoring free

How CockroachDB actually works in production: a mental model for operators

How CockroachDB actually works in production: a mental model for operators

The layered architecture

The request path through the layers

Cross-cutting subsystems

How failures propagate through the layers

Design tradeoffs that shape operations

Signals to watch in production

How Netdata helps

Related guides

CockroachDB monitoring with Netdata