CockroachDB compaction backlog growing: when Pebble can’t keep pace with writes

Pebble’s background compaction threads sometimes fall behind foreground writes. In CockroachDB, this does not cause immediate failure. SSTable files accumulate in Level 0 and the compaction queue, read amplification rises, and the node drifts toward write stalls. Because the database continues to serve traffic, the backlog is easy to miss until it becomes severe.

The earliest visible sign is a gentle upward trend in Level 0 file counts or marked-for-compaction files over hours. These metrics fluctuate with workload bursts, but a sustained upward slope means the store is consuming headroom. Healthy operation requires compaction throughput at least twice the write ingestion rate. Less than that leaves no margin for bursts, MVCC garbage collection, or rebalancing.

This guide explains how to read those signals, distinguish normal fluctuation from a dangerous trend, and intervene before the backlog triggers an LSM compaction death spiral. If your cluster is already exhibiting write stalls or cascading lease transfers, see the related guide on the full death spiral pattern.

What this means

CockroachDB uses Pebble, a Log-Structured Merge Tree engine. Writes buffer in a memtable, then flush to immutable SSTable files in Level 0. Background compaction merges these files down through deeper levels. When ingestion exceeds compaction throughput, files accumulate faster than they merge.

Two metrics proxy backlog depth: storage_l0_num_files and storage_marked_for_compaction_files. An upward trend over hours means the store is approaching write capacity. As the backlog grows, L0 sublevels accumulate. Past 10 sublevels, read latency degrades. Past 20, admission control throttles regular traffic at the store-write queue and shapes elastic traffic. If compaction continues to fall behind, Pebble triggers write stalls, pausing new writes entirely. During a stall, the node can still serve reads, but may lose Raft leadership because it cannot append log entries, which cascades into lease transfers and transient unavailability.

Backlog is per-store, not per-node. A node with multiple stores can have one hot store and one cool store. Aggregating at the node level hides the bottleneck.

flowchart TD
    A[Write ingestion exceeds compaction] --> B[L0 files accumulate]
    B --> C[L0 sublevels rise]
    C --> D[Read amplification spikes]
    D --> E[Compaction slows]
    E --> B
    C --> F[Write stalls begin]
    F --> G[Raft commits stall]
    G --> H[Leases transfer away]

Common causes

CauseWhat it looks likeFirst thing to check
Sustained write burstL0 files rise across all stores during batch load or IMPORTSQL throughput and running jobs
Insufficient disk I/OCompaction throughput flat at device ceiling; iostat shows elevated await or %utilDisk I/O utilization and cloud volume caps
Backup or snapshot competingBacklog grows during backup windows; compaction bytes dropActive backup jobs and snapshot rates
MVCC tombstone pressureGarbage bytes high; read amplification rising even without write spikesProtected timestamps blocking GC
Single hot storeOne store’s L0 count spikes while others stay flatPer-store storage_l0_num_files imbalance

Quick checks

Run these read-only checks to assess backlog depth and disk health.

# Check L0 file count, backlog, and sublevels per store
curl -s http://localhost:8080/_status/vars | grep -E 'storage_l0_num_files|storage_marked_for_compaction_files|storage_l0_sublevels'
# Check compaction throughput counters
curl -s http://localhost:8080/_status/vars | grep -E 'compaction|compacted'
# Check for active or recent write stalls
curl -s http://localhost:8080/_status/vars | grep 'storage_write_stalls'
# Check admission control store-write queue state
curl -s http://localhost:8080/_status/vars | grep 'admission'
# Check disk I/O latency and utilization
iostat -xz 1 3

If storage_l0_num_files or storage_marked_for_compaction_files is higher on one store than others, the bottleneck is local to that disk. If the values are uniform across stores, the workload has exceeded cluster-wide compaction capacity.

How to diagnose it

  1. Confirm the trend, not a spike. A brief jump during a bulk load is normal if it recovers within minutes. An upward slope over 30 minutes or more is the danger signal.
  2. Check per-store distribution. One node can have one hot store and one cool store.
  3. Compare compaction bytes to write bytes. If compaction throughput is near the disk write bandwidth ceiling, there is no headroom. Healthy clusters maintain compaction throughput at roughly twice the sustained write rate.
  4. Look for competing I/O. Correlate backlog growth with backup jobs, snapshot transfers, or schema change backfills. These consume the same disk bandwidth as compaction.
  5. Watch L0 sublevels. If storage_l0_sublevels climbs past 10 and trends toward 20, the store is entering active degradation. Past 20, write stalls are imminent.
  6. Check admission control queues. Deep, sustained queuing in the store-write queue confirms the storage engine is throttling itself.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
storage_l0_num_filesDirect backlog proxyTrending upward over hours
storage_marked_for_compaction_filesPending compaction workGrowing while L0 is elevated
storage_l0_sublevelsPredictor of read amplification and stallsSustained > 10
Compaction throughput (bytes/sec)Whether background I/O keeps paceFlat at disk ceiling
storage_write_stallsLast-line defense triggeringAny nonzero during normal workload
Admission control store-write queueInternal throttling due to LSM pressureSustained depth > 0
Disk I/O await / %utilPhysical storage saturationSSD write await > 5ms sustained
WAL fsync latencyWrite path healthP99 > 50ms on SSDs

Fixes

Reduce write ingestion. If the backlog coincides with a bulk load, IMPORT, RESTORE, or heavy batch job, pause or throttle it. Reducing the write rate is the fastest way to let compaction catch up. Do not restart the node; a restart forces Raft log replay and cache warmup, which adds I/O pressure and delays recovery.

Add disk I/O capacity. If compaction is at the device ceiling and the workload is legitimate, scale the storage layer. On cloud volumes, increase provisioned IOPS and throughput. Adding nodes reduces per-node write rate and range count, which lowers compaction demand on each store.

Move competing work off the critical path. Reschedule backups to low-traffic windows, or reduce snapshot transfer rates if recovery traffic is competing with compaction. Schema change backfills also generate heavy writes; avoid running multiple backfills concurrently on the same table.

Separate WAL from data. Placing WAL on a dedicated fast device removes the most latency-sensitive I/O from the compaction bottleneck. This is a high-impact operational change, but it is one of the most effective levers for write-path latency.

Unblock MVCC garbage collection. If MVCC garbage bytes are high and protected timestamp records are stale, cancel or resume stalled changefeeds or backups to allow GC to reclaim tombstones. A tombstone avalanche increases compaction work and can push a lagged store into stalls.

Prevention

  • Maintain compaction throughput at least 2x sustained write ingestion.
  • Monitor storage_l0_num_files and storage_l0_sublevels continuously; do not wait for write stalls.
  • Run bulk ingestion during low-traffic windows and with rate limiting.
  • Size storage with enough provisioned IOPS and throughput to cover foreground writes, compaction, backups, and rebalancing simultaneously.
  • Keep 20-30% disk free to ensure compaction has room for temporary space amplification.

How Netdata helps

  • Correlate Pebble backlog signals (storage_l0_num_files, storage_marked_for_compaction_files) with disk I/O latency and utilization per device.
  • Surface sustained upward trends in L0 metrics before write stalls trigger.
  • Overlay SQL latency, KV latency, and admission control queue depth on the same timeline to distinguish storage pressure from application contention.
  • Break down metrics per store to isolate whether a backlog is node-wide or localized to a single disk.