CockroachDB storage_l0_sublevels climbing: the earliest warning of write stalls
storage_l0_sublevels climbing on a single store is the earliest actionable signal for CockroachDB storage distress. The cluster may still serve traffic with slightly elevated SQL latency and no active write stalls. L0 sublevels typically provide 10-30 minutes of warning before write stalls begin, but only when monitored per-store.
As L0 sublevels rise, Pebble consults more SSTables per read, read amplification increases nonlinearly, and compaction falls behind. Admission control shapes regular traffic at 5 sublevels and elastic traffic at 1 sublevel. Past 20 sublevels, write stalls become imminent. Because this metric is per-store, a node with multiple stores can mask a hot disk behind healthy node-level aggregates.
What this means
CockroachDB uses Pebble, an LSM-tree engine. Writes buffer in a memtable, then flush to Level 0 as SSTables. Background compaction moves SSTables to deeper levels. L0 sublevels count how many SSTable layers stack up before compaction drains them.
- 0-5 sublevels: healthy
- 5-10: compaction lagging, admission control shaping traffic
- 10-20: active degradation, read latency rising
- 20+: write stalls imminent or active
This is a per-store metric. A node running multiple stores can have one store in distress while others are cool. If you only monitor aggregate node-level disk metrics, you will miss the hot store until it stalls.
flowchart TD
A[Write rate exceeds compaction capacity] --> B[L0 sublevels climb]
B --> C[Read amplification rises]
C --> D[Compaction slows further]
D --> B
B --> E[Admission control throttles at 5 sublevels]
E --> F[L0 exceeds 20 sublevels]
F --> G[Write stalls begin]
G --> H[Node loses Raft leases]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Bulk ingestion without rate limiting | L0 spikes during IMPORT, RESTORE, or heavy batch inserts; SQL write throughput high | Active IMPORT/RESTORE jobs in crdb_internal.jobs; sql_insert_count rate |
| Insufficient disk I/O bandwidth | Compaction throughput flat at disk ceiling; I/O utilization high; L0 grows steadily during normal workload | iostat -xz 1 for device saturation; compaction bytes/sec vs write ingestion |
| MVCC tombstone pressure | Large DELETE or UPDATE workload; garbage bytes high; L0 grows despite flat write rate | MVCC garbage metrics; protected timestamp records blocking GC |
| Backup or snapshot consuming bandwidth | L0 rises during backup windows; disk I/O spikes correlate with backup start | Job status for BACKUP; snapshot send/receive rates |
| Admission control throttling compaction too conservatively | L0 grows while compaction throughput is below disk capacity; store-write queue deep with low foreground load | Admission control store-write queue depth and wait times |
Quick checks
# Check L0 sublevels per store
curl -s http://localhost:8080/_status/vars | grep storage_l0_sublevels
# Check for active write stalls
curl -s http://localhost:8080/_status/vars | grep storage_write_stalls
# Check admission control state
curl -s http://localhost:8080/_status/vars | grep admission
# Check compaction backlog and L0 file count
curl -s http://localhost:8080/_status/vars | grep -E 'compaction|l0_num_files'
# Check WAL fsync latency
curl -s http://localhost:8080/_status/vars | grep storage_wal_fsync_latency
# Check KV execution latency for storage layer impact
curl -s http://localhost:8080/_status/vars | grep exec_latency
# Check disk I/O saturation at the OS level
iostat -xz 1
# List active bulk jobs
SELECT job_id, job_type, status, running_status
FROM crdb_internal.jobs
WHERE status = 'running'
AND job_type IN ('IMPORT', 'RESTORE', 'BACKUP');
How to diagnose it
- Confirm scope and trajectory. Check
storage_l0_sublevelsper store. Is one store hot or is it cluster-wide? Is the value climbing steadily or plateauing? A brief spike during a known bulk job that resolves is different from sustained growth. - Check admission control state. Look at the
store-writequeue and admission control overload metrics. If admission control is already shaping traffic, the system is protecting itself but has no headroom. - Determine if compaction is disk-bound. Compare compaction throughput to your disk’s provisioned limits. If compaction bytes/sec is flat at the device ceiling, you need more I/O or less write load.
- Identify the write source. Correlate L0 growth with SQL write throughput, active IMPORT/RESTORE jobs, or MVCC tombstone generation from large deletes.
- Check WAL fsync latency. Elevated WAL fsync latency means the write path is stalling.
- Correlate with latency impact. Rising KV execution or SQL service latency without SQL query changes confirms the storage layer is the bottleneck.
- Rule out transient warmup. If the node recently restarted, expect brief L0 elevation during Raft log replay. Gate alerts on node uptime above 10 minutes.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
storage_l0_sublevels | Most predictive storage signal; measures LSM tree health before stalls | >5 sustained (planning), >10 (ticket), >20 (page) |
storage_write_stalls | Pebble pausing writes; active unavailability | Any nonzero during normal workload; rate >1/sec sustained |
admission.io.overload | Admission control reacting to LSM pressure | Elevated with store-write queue depth |
storage_wal_fsync_latency | Direct measure of write-path health | P99 >50ms on SSDs |
exec_latency | KV-layer latency isolating storage from SQL | P99 rising without SQL query changes |
rocksdb_read_amplification | Files consulted per read; rises with L0 debt | >25 sustained |
storage_marked_for_compaction_files | Compaction backlog proxy | Increasing over 30+ minutes |
sql_service_latency | Client-visible tail latency | P99 >2x rolling baseline |
Fixes
Bulk ingestion overwhelming L0
Pause or cancel active IMPORT, RESTORE, or large batch operations. Reduce client write concurrency if admission control is not already limiting it. If the bulk job is necessary, schedule it during low-traffic windows and monitor L0 continuously. Tradeoff: slower data loading prevents foreground impact.
Insufficient disk I/O bandwidth
If compaction throughput is flat against the device ceiling, you are underprovisioned. In cloud environments, increase provisioned IOPS or throughput. If using network-attached storage, ensure burst credits are not exhausted. Where possible, ensure WAL and data directories reside on low-latency storage. Tradeoff: higher infrastructure cost or operational complexity.
MVCC tombstone pressure
If L0 growth follows a large DELETE or UPDATE, check whether MVCC garbage collection is keeping up. Verify that protected timestamp records from CDC changefeeds or backups are not blocking GC. If garbage bytes are growing without bound, investigate stalled jobs. Tradeoff: reducing GC TTL or clearing protected timestamps affects data retention and CDC consistency.
Backup or snapshot consuming bandwidth
Reschedule full backups to off-peak windows. If snapshot transfers during recovery are compounding the issue, verify that kv.snapshot_rebalance.max_rate and kv.snapshot_recovery.max_rate are not set so high that they starve compaction. Tradeoff: slower recovery or longer backup windows.
Prevention
Treat storage_l0_sublevels as a primary storage signal from Level 2 of your monitoring maturity model. Alert on it per-store, not per-node. Set thresholds at 5 sublevels for planning review, 10 for tickets, and 20 for pages. Gate alerts on node uptime above 10 minutes to avoid cold-start noise.
Size disk I/O so compaction is not device-bound during sustained write load. Compaction amplifies writes, so provision bandwidth well above raw ingestion rates. Monitor disk I/O utilization and WAL fsync latency alongside L0 sublevels to catch I/O saturation before LSM debt accumulates.
Rate-limit bulk operations at the application or job level. Do not rely solely on admission control to absorb unplanned ingestion spikes. Review backup schedules and snapshot rate limits regularly to ensure background work cannot monopolize disk bandwidth.
Monitor MVCC garbage bytes and protected timestamp records weekly. A blocked GC process will eventually create enough tombstones to pressure L0, even with a stable write rate.
How Netdata helps
- Per-store
storage_l0_sublevelscorrelated with disk I/O, WAL fsync latency, and admission control queue depth in one view. - Tiered alerts at 5, 10, and 20 sublevels correlated with
storage_write_stallsand KV latency to reduce false positives. - L0 growth contextualized against SQL throughput and active job metrics to distinguish bulk ingestion from gradual compaction debt.







