CockroachDB LSM compaction death spiral: L0 sublevels, read amplification, and write stalls

SQL P99 latency jumps from milliseconds to seconds. KV write latency climbs. Nodes transfer leases. Logs show Pebble write stall messages. This is the LSM compaction death spiral: writes outpace the storage engine’s ability to compact data from Level 0 down the LSM tree. L0 sublevels stack up, read amplification rises, and the node eventually stalls writes to protect itself. By the time write stalls appear, the node is already at risk of losing Raft leases and appearing partially unavailable. This guide shows how to diagnose the spiral, stop it, and prevent it.

What this means

CockroachDB stores data in Pebble, a Log-Structured Merge Tree engine. Writes enter an in-memory memtable, which flushes to sorted SSTable files on disk at Level 0. Background compaction continuously merges L0 SSTables down through deeper levels. Each level below L0 contains non-overlapping key ranges, so a read consults at most one file per level. L0 is different: its files overlap, so Pebble organizes them into sublevels. Each sublevel is internally non-overlapping, but sublevels overlap with each other. A read must check every sublevel.

When write ingestion exceeds compaction throughput, flushed SSTables accumulate faster than they can be merged. The sublevel count climbs. At low counts the overhead is minimal. Past 10, read amplification becomes visible in KV latency. Past 20, write stalls are imminent or already occurring. Admission control throttles elastic traffic at low L0 sublevel counts and regular traffic as sublevels rise. Because every Raft log entry must be written to the WAL and acknowledged, a stalled storage engine blocks Raft progress. The node misses heartbeats, loses its leases, and the cluster redistributes its ranges. If multiple nodes enter this state at once, quorum can be lost across large portions of the keyspace.

flowchart TD
    A[Write ingestion exceeds compaction] --> B[L0 sublevels grow]
    B --> C[Read amplification rises]
    C --> D[Compaction slows further]
    D --> B
    B --> E[Admission control throttling]
    E --> F[KV write latency spikes]
    B --> G[Pebble write stalls]
    G --> H[Raft heartbeat backlog]
    H --> I[Lease transfers and client timeouts]

Common causes

CauseWhat it looks likeFirst thing to check
Bulk ingestion without rate limitingstorage_l0_sublevels spikes during IMPORT, RESTORE, or heavy batch inserts; specific stores hotter than otherscrdb_internal.jobs for running IMPORT/RESTORE jobs
Undersized disk I/ODisk utilization pegged; compaction throughput flat against device ceiling; cloud volume throttlingiostat -xz 1 and provisioned IOPS/throughput limits
Backup or snapshot competing for bandwidthScheduled backup overlapping write-heavy workload; stalls correlate with backup windowscrdb_internal.jobs for backup jobs and snapshot send rates
MVCC tombstone pressureL0 grows after large DELETE or DROP; MVCC garbage bytes are elevated and not decreasingProtected timestamp records and garbage collection metrics

Quick checks

Run these read-only checks from any node or monitoring host with access to the Admin UI ports.

# Check L0 sublevels per store (the primary signal)
curl -s http://localhost:8080/_status/vars | grep storage_l0_sublevels

# Check write stall count per store
curl -s http://localhost:8080/_status/vars | grep storage_write_stalls

# Check KV execution and Raft log commit latency
curl -s http://localhost:8080/_status/vars | grep -E 'exec_latency|raft_process_logcommit_latency'

# Check admission control overload signals
curl -s http://localhost:8080/_status/vars | grep admission

# Check node liveness status
curl -s http://localhost:8080/_status/nodes | python3 -c "
import json, sys
for n in json.load(sys.stdin)['nodes']:
    print('Node {}: {}'.format(n['desc']['node_id'], n.get('liveness_status', 'UNKNOWN')))"

# Check unavailable ranges
curl -s http://localhost:8080/_status/vars | grep ranges_unavailable

# Check lease transfer rate
curl -s http://localhost:8080/_status/vars | grep leases_transfers_success

# Check disk I/O latency and utilization (run for several intervals)
iostat -xz 1 5

How to diagnose it

  1. Confirm L0 sublevels are elevated and rising. Use storage_l0_sublevels. Values above 10 indicate active degradation; above 20 mean write stalls are imminent or already occurring. Check per-store, not per-node. A node with multiple stores can have one hot store and one healthy store.
  2. Verify the trend direction. If L0 is elevated but decreasing, compaction is catching up and the system is healing. If it is rising or flat at a high level, the deficit is accumulating.
  3. Check write stall activity. Any nonzero storage_write_stalls during normal OLTP workload is abnormal. A sustained rate above 1 per second means foreground writes are materially impaired.
  4. Correlate with disk I/O. Run iostat -xz 1. On SSDs, average write latency above 5 ms indicates queueing. If utilization is high and compaction throughput is flat at the device ceiling, disk I/O is the bottleneck.
  5. Check admission control store-write queue. If admission_io_overload is elevated and the store-write queue is deep, the system is deliberately throttling writes due to LSM pressure.
  6. Check KV and Raft latency. Rising exec_latency without SQL-level changes points to storage degradation. Rising raft_process_logcommit_latency points to WAL fsync slowdown, often from disk I/O saturation.
  7. Look for active bulk jobs. Query crdb_internal.jobs for IMPORT, RESTORE, or backup operations that could be overwhelming L0. For example:
    SELECT job_id, job_type, status, fraction_completed
    FROM crdb_internal.jobs
    WHERE status = 'running' AND job_type IN ('IMPORT', 'RESTORE', 'BACKUP');
    
  8. Check MVCC garbage and protected timestamps. Review protected timestamp metrics and crdb_internal.jobs for stalled changefeeds or backups that hold old timestamps. Growing MVCC garbage bytes prevent compaction from reclaiming space efficiently, which silently inflates L0.
  9. Confirm lease transfer rate. Elevated leases_transfers_success indicates nodes are losing leases because they cannot process Raft heartbeats during stalls. Correlate with node liveness status.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
storage_l0_sublevelsThe single most predictive signal for storage-driven degradationSustained > 10; critical if > 20 and rising
storage_write_stallsActive write unavailability; node cannot process Raft log entriesAny nonzero during normal workload; rate > 1/sec sustained
exec_latencyKV execution time isolates storage from SQL overheadP99 rising without workload changes
raft_process_logcommit_latencyWAL fsync is on the critical path for every writeP99 > 50 ms on SSDs
admission_io_overloadShows whether flow control is throttling due to LSM pressureElevated with store-write queue depth
rocksdb_read_amplificationMeasures LSM inefficiency; high values mean more disk I/O per readSustained > 25
ranges_unavailableDirect measure of lost availabilityAny nonzero sustained value
leases_transfers_successIndicates nodes are losing leaseholder status due to unresponsivenessRate > 10x baseline without operational cause

Fixes

Reduce write rate immediately

Pause or cancel bulk ingest jobs. Canceling an IMPORT or RESTORE is disruptive and requires cleanup; prefer pausing if the job supports it. If admission control is not already limiting traffic, the system is past the point where it can self-regulate without application impact. Reduce client connection count or batch size to lower ingestion pressure. This is the fastest way to give compaction room to catch up.

Relieve disk I/O contention

If a backup or snapshot is running and competing for disk bandwidth, pause it or reschedule it to a low-traffic window. If you are on cloud storage (EBS gp3, PD), verify that you are not hitting provisioned IOPS or throughput caps. Baseline gp3 provides 3000 IOPS and 125 MiB/s, which moderate CockroachDB workloads can exceed.

If WAL and compaction share the same device and you need immediate relief, separating WAL onto a dedicated fast device is a high-impact operational lever , though it requires planning and restart.

Address MVCC garbage and tombstones

If L0 pressure follows a large DELETE, DROP, or UPDATE, check for protected timestamp records that block garbage collection. Query crdb_internal.jobs and protected timestamp metrics. Cancel or resume stalled changefeeds or backups that hold old timestamps. Once protected timestamps release, monitor MVCC garbage byte reclamation.

Scale storage or add nodes

If write ingestion legitimately exceeds what a single store’s disk can compact, you need more compaction bandwidth. Options include upgrading to higher-IOPS storage, adding nodes to spread ranges and write load, or increasing the cluster’s aggregate compaction capacity. Adding nodes reduces ranges per node, which also lowers Raft CPU overhead. Do not decommission nodes during an active spiral; rebalancing adds background load and compaction pressure.

Do not restart nodes as a first fix. A restart forces Raft log replay and lease reacquisition, which adds write load and can worsen L0 pressure.

Avoid reducing compaction concurrency

A common reflex is to lower compaction concurrency to reduce background I/O. This trades temporary foreground relief for faster L0 growth. It accelerates the death spiral.

Prevention

  • Instrument storage_l0_sublevels as a primary storage signal. It gives early warning before write stalls. Most teams only notice disk utilization or IOPS, which miss LSM tree health entirely.
  • Size disk I/O for headroom. Compaction throughput should be at least 2x the sustained write ingestion rate. The LSM cliff is sharp; once L0 starts growing, minutes matter.
  • Rate-limit bulk operations. IMPORT, RESTORE, and large batch inserts should run with explicit rate limits or during off-peak windows.
  • Monitor protected timestamps and MVCC garbage. Stalled CDC or backup jobs silently prevent GC and inflate read amplification until disk space becomes critical.
  • Use /health?ready=1 for load balancer health checks. Simple TCP checks route traffic to nodes that are listening but write-stalled. The readiness endpoint returns 503 when the node is impaired.

How Netdata helps

  • Correlate storage_l0_sublevels per store with disk I/O latency and utilization in the same timeline to confirm whether compaction is disk-bound.
  • Trigger alerts only when L0 sublevels > 20, storage_write_stalls is rising, and ranges_unavailable is nonzero, eliminating false positives from transient bulk loads.
  • Track raft_process_logcommit_latency alongside WAL fsync latency and admission control queue depth to distinguish storage saturation from network or CPU issues.
  • Visualize lease transfer rate spikes correlated with node liveness transitions to confirm that write stalls are causing Raft lease loss.
  • Long-term retention of LSM and compaction metrics makes it possible to spot slow trends, such as L0 growing from 3 to 8 over weeks, before they become critical.