$ guides / cockroachdb / cockroachdb-compaction-backlog-growing ▌

Operations Guides

CockroachDB compaction backlog growing: when Pebble can't keep pace with writes

CockroachDB compaction backlog growing: when Pebble can’t keep pace with writes

Pebble’s background compaction threads sometimes fall behind foreground writes. In CockroachDB, this does not cause immediate failure. SSTable files accumulate in Level 0 and the compaction queue, read amplification rises, and the node drifts toward write stalls. Because the database continues to serve traffic, the backlog is easy to miss until it becomes severe.

The earliest visible sign is a gentle upward trend in Level 0 file counts or marked-for-compaction files over hours. These metrics fluctuate with workload bursts, but a sustained upward slope means the store is consuming headroom. Healthy operation requires compaction throughput at least twice the write ingestion rate. Less than that leaves no margin for bursts, MVCC garbage collection, or rebalancing.

This guide explains how to read those signals, distinguish normal fluctuation from a dangerous trend, and intervene before the backlog triggers an LSM compaction death spiral. If your cluster is already exhibiting write stalls or cascading lease transfers, see the related guide on the full death spiral pattern.

What this means

CockroachDB uses Pebble, a Log-Structured Merge Tree engine. Writes buffer in a memtable, then flush to immutable SSTable files in Level 0. Background compaction merges these files down through deeper levels. When ingestion exceeds compaction throughput, files accumulate faster than they merge.

Two metrics proxy backlog depth: storage_l0_num_files and storage_marked_for_compaction_files. An upward trend over hours means the store is approaching write capacity. As the backlog grows, L0 sublevels accumulate. Past 10 sublevels, read latency degrades. Past 20, admission control throttles regular traffic at the store-write queue and shapes elastic traffic. If compaction continues to fall behind, Pebble triggers write stalls, pausing new writes entirely. During a stall, the node can still serve reads, but may lose Raft leadership because it cannot append log entries, which cascades into lease transfers and transient unavailability.

Backlog is per-store, not per-node. A node with multiple stores can have one hot store and one cool store. Aggregating at the node level hides the bottleneck.

flowchart TD
    A[Write ingestion exceeds compaction] --> B[L0 files accumulate]
    B --> C[L0 sublevels rise]
    C --> D[Read amplification spikes]
    D --> E[Compaction slows]
    E --> B
    C --> F[Write stalls begin]
    F --> G[Raft commits stall]
    G --> H[Leases transfer away]

Common causes

Cause	What it looks like	First thing to check
Sustained write burst	L0 files rise across all stores during batch load or IMPORT	SQL throughput and running jobs
Insufficient disk I/O	Compaction throughput flat at device ceiling; `iostat` shows elevated await or %util	Disk I/O utilization and cloud volume caps
Backup or snapshot competing	Backlog grows during backup windows; compaction bytes drop	Active backup jobs and snapshot rates
MVCC tombstone pressure	Garbage bytes high; read amplification rising even without write spikes	Protected timestamps blocking GC
Single hot store	One store’s L0 count spikes while others stay flat	Per-store `storage_l0_num_files` imbalance

Quick checks

Run these read-only checks to assess backlog depth and disk health.

# Check L0 file count, backlog, and sublevels per store
curl -s http://localhost:8080/_status/vars | grep -E 'storage_l0_num_files|storage_marked_for_compaction_files|storage_l0_sublevels'

# Check compaction throughput counters
curl -s http://localhost:8080/_status/vars | grep -E 'compaction|compacted'

# Check for active or recent write stalls
curl -s http://localhost:8080/_status/vars | grep 'storage_write_stalls'

# Check admission control store-write queue state
curl -s http://localhost:8080/_status/vars | grep 'admission'

# Check disk I/O latency and utilization
iostat -xz 1 3

If storage_l0_num_files or storage_marked_for_compaction_files is higher on one store than others, the bottleneck is local to that disk. If the values are uniform across stores, the workload has exceeded cluster-wide compaction capacity.

How to diagnose it

Confirm the trend, not a spike. A brief jump during a bulk load is normal if it recovers within minutes. An upward slope over 30 minutes or more is the danger signal.
Check per-store distribution. One node can have one hot store and one cool store.
Compare compaction bytes to write bytes. If compaction throughput is near the disk write bandwidth ceiling, there is no headroom. Healthy clusters maintain compaction throughput at roughly twice the sustained write rate.
Look for competing I/O. Correlate backlog growth with backup jobs, snapshot transfers, or schema change backfills. These consume the same disk bandwidth as compaction.
Watch L0 sublevels. If storage_l0_sublevels climbs past 10 and trends toward 20, the store is entering active degradation. Past 20, write stalls are imminent.
Check admission control queues. Deep, sustained queuing in the store-write queue confirms the storage engine is throttling itself.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`storage_l0_num_files`	Direct backlog proxy	Trending upward over hours
`storage_marked_for_compaction_files`	Pending compaction work	Growing while L0 is elevated
`storage_l0_sublevels`	Predictor of read amplification and stalls	Sustained > 10
Compaction throughput (bytes/sec)	Whether background I/O keeps pace	Flat at disk ceiling
`storage_write_stalls`	Last-line defense triggering	Any nonzero during normal workload
Admission control `store-write` queue	Internal throttling due to LSM pressure	Sustained depth > 0
Disk I/O await / %util	Physical storage saturation	SSD write await > 5ms sustained
WAL fsync latency	Write path health	P99 > 50ms on SSDs

Fixes

Reduce write ingestion. If the backlog coincides with a bulk load, IMPORT, RESTORE, or heavy batch job, pause or throttle it. Reducing the write rate is the fastest way to let compaction catch up. Do not restart the node; a restart forces Raft log replay and cache warmup, which adds I/O pressure and delays recovery.

Add disk I/O capacity. If compaction is at the device ceiling and the workload is legitimate, scale the storage layer. On cloud volumes, increase provisioned IOPS and throughput. Adding nodes reduces per-node write rate and range count, which lowers compaction demand on each store.

Move competing work off the critical path. Reschedule backups to low-traffic windows, or reduce snapshot transfer rates if recovery traffic is competing with compaction. Schema change backfills also generate heavy writes; avoid running multiple backfills concurrently on the same table.

Separate WAL from data. Placing WAL on a dedicated fast device removes the most latency-sensitive I/O from the compaction bottleneck. This is a high-impact operational change, but it is one of the most effective levers for write-path latency.

Unblock MVCC garbage collection. If MVCC garbage bytes are high and protected timestamp records are stale, cancel or resume stalled changefeeds or backups to allow GC to reclaim tombstones. A tombstone avalanche increases compaction work and can push a lagged store into stalls.

Prevention

Maintain compaction throughput at least 2x sustained write ingestion.
Monitor storage_l0_num_files and storage_l0_sublevels continuously; do not wait for write stalls.
Run bulk ingestion during low-traffic windows and with rate limiting.
Size storage with enough provisioned IOPS and throughput to cover foreground writes, compaction, backups, and rebalancing simultaneously.
Keep 20-30% disk free to ensure compaction has room for temporary space amplification.

How Netdata helps

Correlate Pebble backlog signals (storage_l0_num_files, storage_marked_for_compaction_files) with disk I/O latency and utilization per device.
Surface sustained upward trends in L0 metrics before write stalls trigger.
Overlay SQL latency, KV latency, and admission control queue depth on the same timeline to distinguish storage pressure from application contention.
Break down metrics per store to isolate whether a backlog is node-wide or localized to a single disk.

The Netdata solution

CockroachDB monitoring with Netdata

Netdata monitors CockroachDB with per-second metrics and automatic dashboards. Watch LSM compaction, Raft liveness, clock skew, hot ranges, and intent buildup so the distributed-systems failure modes in these runbooks surface early.

See CockroachDB monitoring → Start monitoring free