$ guides / cockroachdb / cockroachdb-l0-sublevels-high ▌

Operations Guides

CockroachDB storage_l0_sublevels climbing: the earliest warning of write stalls

storage_l0_sublevels climbing on a single store is the earliest actionable signal for CockroachDB storage distress. The cluster may still serve traffic with slightly elevated SQL latency and no active write stalls. L0 sublevels typically provide 10-30 minutes of warning before write stalls begin, but only when monitored per-store.

As L0 sublevels rise, Pebble consults more SSTables per read, read amplification increases nonlinearly, and compaction falls behind. Admission control shapes regular traffic at 5 sublevels and elastic traffic at 1 sublevel. Past 20 sublevels, write stalls become imminent. Because this metric is per-store, a node with multiple stores can mask a hot disk behind healthy node-level aggregates.

What this means

CockroachDB uses Pebble, an LSM-tree engine. Writes buffer in a memtable, then flush to Level 0 as SSTables. Background compaction moves SSTables to deeper levels. L0 sublevels count how many SSTable layers stack up before compaction drains them.

0-5 sublevels: healthy
5-10: compaction lagging, admission control shaping traffic
10-20: active degradation, read latency rising
20+: write stalls imminent or active

This is a per-store metric. A node running multiple stores can have one store in distress while others are cool. If you only monitor aggregate node-level disk metrics, you will miss the hot store until it stalls.

flowchart TD
    A[Write rate exceeds compaction capacity] --> B[L0 sublevels climb]
    B --> C[Read amplification rises]
    C --> D[Compaction slows further]
    D --> B
    B --> E[Admission control throttles at 5 sublevels]
    E --> F[L0 exceeds 20 sublevels]
    F --> G[Write stalls begin]
    G --> H[Node loses Raft leases]

Common causes

Cause	What it looks like	First thing to check
Bulk ingestion without rate limiting	L0 spikes during IMPORT, RESTORE, or heavy batch inserts; SQL write throughput high	Active IMPORT/RESTORE jobs in `crdb_internal.jobs`; `sql_insert_count` rate
Insufficient disk I/O bandwidth	Compaction throughput flat at disk ceiling; I/O utilization high; L0 grows steadily during normal workload	`iostat -xz 1` for device saturation; compaction bytes/sec vs write ingestion
MVCC tombstone pressure	Large DELETE or UPDATE workload; garbage bytes high; L0 grows despite flat write rate	MVCC garbage metrics; protected timestamp records blocking GC
Backup or snapshot consuming bandwidth	L0 rises during backup windows; disk I/O spikes correlate with backup start	Job status for BACKUP; snapshot send/receive rates
Admission control throttling compaction too conservatively	L0 grows while compaction throughput is below disk capacity; store-write queue deep with low foreground load	Admission control `store-write` queue depth and wait times

Quick checks

# Check L0 sublevels per store
curl -s http://localhost:8080/_status/vars | grep storage_l0_sublevels

# Check for active write stalls
curl -s http://localhost:8080/_status/vars | grep storage_write_stalls

# Check admission control state
curl -s http://localhost:8080/_status/vars | grep admission

# Check compaction backlog and L0 file count
curl -s http://localhost:8080/_status/vars | grep -E 'compaction|l0_num_files'

# Check WAL fsync latency
curl -s http://localhost:8080/_status/vars | grep storage_wal_fsync_latency

# Check KV execution latency for storage layer impact
curl -s http://localhost:8080/_status/vars | grep exec_latency

# Check disk I/O saturation at the OS level
iostat -xz 1

# List active bulk jobs
SELECT job_id, job_type, status, running_status
FROM crdb_internal.jobs
WHERE status = 'running'
  AND job_type IN ('IMPORT', 'RESTORE', 'BACKUP');

How to diagnose it

Confirm scope and trajectory. Check storage_l0_sublevels per store. Is one store hot or is it cluster-wide? Is the value climbing steadily or plateauing? A brief spike during a known bulk job that resolves is different from sustained growth.
Check admission control state. Look at the store-write queue and admission control overload metrics. If admission control is already shaping traffic, the system is protecting itself but has no headroom.
Determine if compaction is disk-bound. Compare compaction throughput to your disk’s provisioned limits. If compaction bytes/sec is flat at the device ceiling, you need more I/O or less write load.
Identify the write source. Correlate L0 growth with SQL write throughput, active IMPORT/RESTORE jobs, or MVCC tombstone generation from large deletes.
Check WAL fsync latency. Elevated WAL fsync latency means the write path is stalling.
Correlate with latency impact. Rising KV execution or SQL service latency without SQL query changes confirms the storage layer is the bottleneck.
Rule out transient warmup. If the node recently restarted, expect brief L0 elevation during Raft log replay. Gate alerts on node uptime above 10 minutes.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`storage_l0_sublevels`	Most predictive storage signal; measures LSM tree health before stalls	>5 sustained (planning), >10 (ticket), >20 (page)
`storage_write_stalls`	Pebble pausing writes; active unavailability	Any nonzero during normal workload; rate >1/sec sustained
`admission.io.overload`	Admission control reacting to LSM pressure	Elevated with `store-write` queue depth
`storage_wal_fsync_latency`	Direct measure of write-path health	P99 >50ms on SSDs
`exec_latency`	KV-layer latency isolating storage from SQL	P99 rising without SQL query changes
`rocksdb_read_amplification`	Files consulted per read; rises with L0 debt	>25 sustained
`storage_marked_for_compaction_files`	Compaction backlog proxy	Increasing over 30+ minutes
`sql_service_latency`	Client-visible tail latency	P99 >2x rolling baseline

Fixes

Bulk ingestion overwhelming L0

Pause or cancel active IMPORT, RESTORE, or large batch operations. Reduce client write concurrency if admission control is not already limiting it. If the bulk job is necessary, schedule it during low-traffic windows and monitor L0 continuously. Tradeoff: slower data loading prevents foreground impact.

Insufficient disk I/O bandwidth

If compaction throughput is flat against the device ceiling, you are underprovisioned. In cloud environments, increase provisioned IOPS or throughput. If using network-attached storage, ensure burst credits are not exhausted. Where possible, ensure WAL and data directories reside on low-latency storage. Tradeoff: higher infrastructure cost or operational complexity.

MVCC tombstone pressure

If L0 growth follows a large DELETE or UPDATE, check whether MVCC garbage collection is keeping up. Verify that protected timestamp records from CDC changefeeds or backups are not blocking GC. If garbage bytes are growing without bound, investigate stalled jobs. Tradeoff: reducing GC TTL or clearing protected timestamps affects data retention and CDC consistency.

Backup or snapshot consuming bandwidth

Reschedule full backups to off-peak windows. If snapshot transfers during recovery are compounding the issue, verify that kv.snapshot_rebalance.max_rate and kv.snapshot_recovery.max_rate are not set so high that they starve compaction. Tradeoff: slower recovery or longer backup windows.

Prevention

Treat storage_l0_sublevels as a primary storage signal from Level 2 of your monitoring maturity model. Alert on it per-store, not per-node. Set thresholds at 5 sublevels for planning review, 10 for tickets, and 20 for pages. Gate alerts on node uptime above 10 minutes to avoid cold-start noise.

Size disk I/O so compaction is not device-bound during sustained write load. Compaction amplifies writes, so provision bandwidth well above raw ingestion rates. Monitor disk I/O utilization and WAL fsync latency alongside L0 sublevels to catch I/O saturation before LSM debt accumulates.

Rate-limit bulk operations at the application or job level. Do not rely solely on admission control to absorb unplanned ingestion spikes. Review backup schedules and snapshot rate limits regularly to ensure background work cannot monopolize disk bandwidth.

Monitor MVCC garbage bytes and protected timestamp records weekly. A blocked GC process will eventually create enough tombstones to pressure L0, even with a stable write rate.

How Netdata helps

Per-store storage_l0_sublevels correlated with disk I/O, WAL fsync latency, and admission control queue depth in one view.
Tiered alerts at 5, 10, and 20 sublevels correlated with storage_write_stalls and KV latency to reduce false positives.
L0 growth contextualized against SQL throughput and active job metrics to distinguish bulk ingestion from gradual compaction debt.

The Netdata solution

CockroachDB monitoring with Netdata

Netdata monitors CockroachDB with per-second metrics and automatic dashboards. Watch LSM compaction, Raft liveness, clock skew, hot ranges, and intent buildup so the distributed-systems failure modes in these runbooks surface early.

See CockroachDB monitoring → Start monitoring free

CockroachDB storage_l0_sublevels climbing: the earliest warning of write stalls

CockroachDB storage_l0_sublevels climbing: the earliest warning of write stalls

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Bulk ingestion overwhelming L0

Insufficient disk I/O bandwidth

MVCC tombstone pressure

Backup or snapshot consuming bandwidth

Prevention

How Netdata helps

Related guides

CockroachDB monitoring with Netdata