CockroachDB storage_l0_sublevels climbing: the earliest warning of write stalls

storage_l0_sublevels climbing on a single store is the earliest actionable signal for CockroachDB storage distress. The cluster may still serve traffic with slightly elevated SQL latency and no active write stalls. L0 sublevels typically provide 10-30 minutes of warning before write stalls begin, but only when monitored per-store.

As L0 sublevels rise, Pebble consults more SSTables per read, read amplification increases nonlinearly, and compaction falls behind. Admission control shapes regular traffic at 5 sublevels and elastic traffic at 1 sublevel. Past 20 sublevels, write stalls become imminent. Because this metric is per-store, a node with multiple stores can mask a hot disk behind healthy node-level aggregates.

What this means

CockroachDB uses Pebble, an LSM-tree engine. Writes buffer in a memtable, then flush to Level 0 as SSTables. Background compaction moves SSTables to deeper levels. L0 sublevels count how many SSTable layers stack up before compaction drains them.

  • 0-5 sublevels: healthy
  • 5-10: compaction lagging, admission control shaping traffic
  • 10-20: active degradation, read latency rising
  • 20+: write stalls imminent or active

This is a per-store metric. A node running multiple stores can have one store in distress while others are cool. If you only monitor aggregate node-level disk metrics, you will miss the hot store until it stalls.

flowchart TD
    A[Write rate exceeds compaction capacity] --> B[L0 sublevels climb]
    B --> C[Read amplification rises]
    C --> D[Compaction slows further]
    D --> B
    B --> E[Admission control throttles at 5 sublevels]
    E --> F[L0 exceeds 20 sublevels]
    F --> G[Write stalls begin]
    G --> H[Node loses Raft leases]

Common causes

CauseWhat it looks likeFirst thing to check
Bulk ingestion without rate limitingL0 spikes during IMPORT, RESTORE, or heavy batch inserts; SQL write throughput highActive IMPORT/RESTORE jobs in crdb_internal.jobs; sql_insert_count rate
Insufficient disk I/O bandwidthCompaction throughput flat at disk ceiling; I/O utilization high; L0 grows steadily during normal workloadiostat -xz 1 for device saturation; compaction bytes/sec vs write ingestion
MVCC tombstone pressureLarge DELETE or UPDATE workload; garbage bytes high; L0 grows despite flat write rateMVCC garbage metrics; protected timestamp records blocking GC
Backup or snapshot consuming bandwidthL0 rises during backup windows; disk I/O spikes correlate with backup startJob status for BACKUP; snapshot send/receive rates
Admission control throttling compaction too conservativelyL0 grows while compaction throughput is below disk capacity; store-write queue deep with low foreground loadAdmission control store-write queue depth and wait times

Quick checks

# Check L0 sublevels per store
curl -s http://localhost:8080/_status/vars | grep storage_l0_sublevels

# Check for active write stalls
curl -s http://localhost:8080/_status/vars | grep storage_write_stalls

# Check admission control state
curl -s http://localhost:8080/_status/vars | grep admission

# Check compaction backlog and L0 file count
curl -s http://localhost:8080/_status/vars | grep -E 'compaction|l0_num_files'

# Check WAL fsync latency
curl -s http://localhost:8080/_status/vars | grep storage_wal_fsync_latency

# Check KV execution latency for storage layer impact
curl -s http://localhost:8080/_status/vars | grep exec_latency

# Check disk I/O saturation at the OS level
iostat -xz 1

# List active bulk jobs
SELECT job_id, job_type, status, running_status
FROM crdb_internal.jobs
WHERE status = 'running'
  AND job_type IN ('IMPORT', 'RESTORE', 'BACKUP');

How to diagnose it

  1. Confirm scope and trajectory. Check storage_l0_sublevels per store. Is one store hot or is it cluster-wide? Is the value climbing steadily or plateauing? A brief spike during a known bulk job that resolves is different from sustained growth.
  2. Check admission control state. Look at the store-write queue and admission control overload metrics. If admission control is already shaping traffic, the system is protecting itself but has no headroom.
  3. Determine if compaction is disk-bound. Compare compaction throughput to your disk’s provisioned limits. If compaction bytes/sec is flat at the device ceiling, you need more I/O or less write load.
  4. Identify the write source. Correlate L0 growth with SQL write throughput, active IMPORT/RESTORE jobs, or MVCC tombstone generation from large deletes.
  5. Check WAL fsync latency. Elevated WAL fsync latency means the write path is stalling.
  6. Correlate with latency impact. Rising KV execution or SQL service latency without SQL query changes confirms the storage layer is the bottleneck.
  7. Rule out transient warmup. If the node recently restarted, expect brief L0 elevation during Raft log replay. Gate alerts on node uptime above 10 minutes.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
storage_l0_sublevelsMost predictive storage signal; measures LSM tree health before stalls>5 sustained (planning), >10 (ticket), >20 (page)
storage_write_stallsPebble pausing writes; active unavailabilityAny nonzero during normal workload; rate >1/sec sustained
admission.io.overloadAdmission control reacting to LSM pressureElevated with store-write queue depth
storage_wal_fsync_latencyDirect measure of write-path healthP99 >50ms on SSDs
exec_latencyKV-layer latency isolating storage from SQLP99 rising without SQL query changes
rocksdb_read_amplificationFiles consulted per read; rises with L0 debt>25 sustained
storage_marked_for_compaction_filesCompaction backlog proxyIncreasing over 30+ minutes
sql_service_latencyClient-visible tail latencyP99 >2x rolling baseline

Fixes

Bulk ingestion overwhelming L0

Pause or cancel active IMPORT, RESTORE, or large batch operations. Reduce client write concurrency if admission control is not already limiting it. If the bulk job is necessary, schedule it during low-traffic windows and monitor L0 continuously. Tradeoff: slower data loading prevents foreground impact.

Insufficient disk I/O bandwidth

If compaction throughput is flat against the device ceiling, you are underprovisioned. In cloud environments, increase provisioned IOPS or throughput. If using network-attached storage, ensure burst credits are not exhausted. Where possible, ensure WAL and data directories reside on low-latency storage. Tradeoff: higher infrastructure cost or operational complexity.

MVCC tombstone pressure

If L0 growth follows a large DELETE or UPDATE, check whether MVCC garbage collection is keeping up. Verify that protected timestamp records from CDC changefeeds or backups are not blocking GC. If garbage bytes are growing without bound, investigate stalled jobs. Tradeoff: reducing GC TTL or clearing protected timestamps affects data retention and CDC consistency.

Backup or snapshot consuming bandwidth

Reschedule full backups to off-peak windows. If snapshot transfers during recovery are compounding the issue, verify that kv.snapshot_rebalance.max_rate and kv.snapshot_recovery.max_rate are not set so high that they starve compaction. Tradeoff: slower recovery or longer backup windows.

Prevention

Treat storage_l0_sublevels as a primary storage signal from Level 2 of your monitoring maturity model. Alert on it per-store, not per-node. Set thresholds at 5 sublevels for planning review, 10 for tickets, and 20 for pages. Gate alerts on node uptime above 10 minutes to avoid cold-start noise.

Size disk I/O so compaction is not device-bound during sustained write load. Compaction amplifies writes, so provision bandwidth well above raw ingestion rates. Monitor disk I/O utilization and WAL fsync latency alongside L0 sublevels to catch I/O saturation before LSM debt accumulates.

Rate-limit bulk operations at the application or job level. Do not rely solely on admission control to absorb unplanned ingestion spikes. Review backup schedules and snapshot rate limits regularly to ensure background work cannot monopolize disk bandwidth.

Monitor MVCC garbage bytes and protected timestamp records weekly. A blocked GC process will eventually create enough tombstones to pressure L0, even with a stable write rate.

How Netdata helps

  • Per-store storage_l0_sublevels correlated with disk I/O, WAL fsync latency, and admission control queue depth in one view.
  • Tiered alerts at 5, 10, and 20 sublevels correlated with storage_write_stalls and KV latency to reduce false positives.
  • L0 growth contextualized against SQL throughput and active job metrics to distinguish bulk ingestion from gradual compaction debt.