CockroachDB Pebble write stalls: when the storage engine refuses writes

When application writes time out or return ambiguous errors, check CockroachDB logs for pebble: write stall and watch storage_write_stalls. Pebble pauses writes to a store when Level 0 compaction debt exceeds safe thresholds. Until compaction drains L0, that store cannot accept new writes.

Write stalls are the most severe storage signal in CockroachDB. A brief stall during bulk loading may be harmless, but sustained stalls at one per second mean the node cannot meet its Raft obligations. The node may still serve reads, but it can lose Raft leadership because it cannot append log entries. That cascades into lease transfers and temporary range unavailability.

The root cause is almost always that writes arrive faster than the disk can compact, or that compaction is blocked by disk space exhaustion or I/O contention. The stall is protective: without it, L0 would grow without bound and read amplification would explode. Break the feedback loop by reducing write pressure or increasing compaction capacity.

What this means

Pebble is a log-structured merge tree. Writes enter a memtable and flush to SSTables in Level 0. Background compaction merges L0 files downward. When the write rate outpaces compaction, the L0 sublevel count grows.

Admission control shapes regular traffic when L0 reaches 5 sublevels and elastic traffic at 1 sublevel. These are soft backpressure signals. When L0 crosses roughly 20 sublevels, Pebble triggers a hard write stall. Foreground writes pause until compaction drains L0.

During a stall, SQL writes block. The node cannot persist Raft log entries, which can cause it to forfeit leadership. The cluster transfers leases away, creating brief unavailability for affected ranges.

flowchart TD
    A[Heavy writes or bulk ingestion] --> B[L0 sublevel count climbs]
    B --> C[Read amplification rises]
    C --> D[Compaction slows]
    D --> B
    B --> E[Pebble write stall]
    E --> F[Raft log writes blocked]
    F --> G[Lease transfers and unavailable ranges]

The cycle is self-reinforcing: a large L0 increases read amplification and compaction cost, which lets L0 grow further. Break it by reducing write pressure or increasing compaction capacity.

Common causes

CauseWhat it looks likeFirst thing to check
Write rate exceeds disk compaction throughputstorage_l0_sublevels climbing steadily; compaction throughput at disk ceiling; disk I/O saturatedPer-store L0 count versus compaction bytes per second
Bulk ingestion without rate limitingIMPORT or RESTORE running; stalls correlate with job execution; admission control store-write queue deepcrdb_internal.jobs for active IMPORT, RESTORE, or index backfills
Backup or snapshot I/O consuming disk bandwidthBackup window overlaps with peak traffic; disk I/O utilization jumps before L0 spikesJob metrics and disk I/O latency
MVCC tombstone pressure from large deletesElevated L0 after a batch DELETE or UPDATE; MVCC garbage bytes growingRecent DML patterns and MVCC garbage metrics
Disk space exhaustion preventing compactionLow free space; L0 grows even though foreground write rate is moderatecapacity_available per store

Quick checks

Run these read-only checks from a node or monitoring host.

# Check write stall counter per store
curl -s http://localhost:8080/_status/vars | grep 'storage_write_stalls'

# Check L0 sublevel count per store
curl -s http://localhost:8080/_status/vars | grep 'storage_l0_sublevels'

# Check admission control signals
curl -s http://localhost:8080/_status/vars | grep 'admission'

# Check compaction activity
curl -s http://localhost:8080/_status/vars | grep -E 'compaction|compacted'

# Check per-store disk usage
curl -s http://localhost:8080/_status/vars | grep -E 'capacity_used|capacity_available'

# Check for disk stall detection (distinct from Pebble stalls)
curl -s http://localhost:8080/_status/vars | grep 'storage_disk_stalled'

# List running jobs that may generate heavy writes
# Add connection flags (e.g., --host, --certs-dir) as required for your deployment
cockroach sql -e "SELECT job_id, job_type, status FROM crdb_internal.jobs WHERE status = 'running';"

# Search logs for write stall events (adjust path to your deployment)
grep "pebble: write stall" /path/to/logs/cockroach.log

A stall rate of one per minute is usually not material during bulk operations. A rate of one per second sustained for more than a minute is an active emergency.

How to diagnose it

  1. Confirm the stall is active. Verify that storage_write_stalls is incrementing or that log lines are appearing in the last minute. Use a short time window; brief spikes during batch jobs can self-resolve.
  2. Identify the affected store or stores. storage_l0_sublevels is a per-store gauge. If one store is elevated while others are flat, investigate that specific disk. A node with multiple stores can have one hot and one cool.
  3. Determine whether L0 is rising or falling. A falling L0 during a stall means compaction is catching up. A rising L0 means the stall will persist or worsen. Check the exact stall reason in the Pebble logs if L0 does not match the stall pattern.
  4. Correlate with foreground workload. Check sql_insert_count, sql_update_count, and active jobs. If SQL throughput is normal but a backup is running, the backup is likely saturating I/O. If SQL throughput itself is high, the workload has outgrown the disk.
  5. Check disk I/O saturation. Use OS-level metrics or Prometheus disk stats. If compaction is already at the device throughput ceiling, ingestion exceeds physical capacity. Upgrade the disk or add nodes.
  6. Check disk space. If capacity_available is below 15-20% of total capacity, compaction may be unable to stage temporary files. This creates a space-amplification death spiral where lack of free space prevents the very operation that would reclaim it.
  7. Look for MVCC tombstone spikes. Large DELETE or UPDATE operations generate tombstones that increase compaction work. Check intent and MVCC garbage metrics if L0 spiked after a schema change or data purge. Protected timestamps from stalled changefeeds can also block garbage collection, amplifying compaction debt.
  8. Evaluate Raft impact. Check leases_transfers_success and ranges_unavailable. If leadership is flapping, the stall is affecting cluster availability, not just latency. You may need to shed client traffic from the affected node until L0 recovers.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
storage_write_stallsDirect count of write refusal eventsAny nonzero rate during normal OLTP; > 1/sec sustained
storage_l0_sublevelsLeading indicator before stalls occurSustained > 10; > 20 with rising trend
admission.io.overload and store-write queueIntentional throttling before hard stallsQueue depth > 0 with growing wait times
Compaction throughputWhether background maintenance keeps paceThroughput at disk maximum with backlog growing
capacity_availableCompaction needs free space to operate< 20% free and trending down
raft.process.logcommit.latencyWAL path health; stalls block this directlyElevated commit latency correlating with stall events
leases_transfers_successLeadership instability from inability to writeSpike during stall windows
storage_disk_stalledDistinct signal for disk hardware failureNonzero value indicates possible node self-termination
intentbytes or MVCC garbageTombstones inflate compaction workGrowth after large DELETE or UPDATE

Fixes

Reduce write pressure

Pause any running IMPORT, RESTORE, or large index backfills. Reduce client concurrency or batch sizes. Verify admission control is enabled so the store-write queue can throttle elastic work before hard stalls occur. If you cannot pause the workload, redirect write traffic away from the affected node using zone configurations or load balancer weights. Tradeoff: slower bulk operations, reduced ingest throughput, and possible temporary imbalance.

Reschedule or throttle background I/O

If a backup or snapshot stream is competing with foreground writes, move the backup window to off-peak hours or reduce snapshot recovery rates to free disk I/O headroom for compaction. Tradeoff: longer backup windows and slower node recovery, but foreground availability is preserved.

Add disk I/O capacity

If sustained write ingestion exceeds the disk’s compaction capacity, upgrade to higher-IOPS storage or add nodes to distribute the write load. Pebble compactions are single-store operations, so a node with multiple stores can also help isolate hot disks. In cloud environments, check whether EBS burst credits or provisioned IOPS limits are throttling compaction. Tradeoff: infrastructure cost and possible rebalancing traffic during the upgrade.

Free disk space

When disk space is critically low, compaction cannot run. Remove unnecessary data, cancel hung jobs that generate temp files, or expand the underlying volume. If protected timestamps are blocking MVCC garbage collection, resolve the stalled changefeed or backup that owns the record. Monitor spanconfig_kvsubscriber_protected_record_count to confirm the blockage. Tradeoff: deleting data is destructive; expanding volumes may require maintenance windows depending on your storage layer.

Split hot ranges

If a single range is driving L0 growth on one store, a manual split can reduce per-range write load and let the cluster rebalance the resulting ranges. Tradeoff: splits create brief unavailability windows for the range and can increase total range count, which raises per-node Raft overhead.

Prevention

  • Monitor storage_l0_sublevels proactively. Do not wait for write stalls. A sustained L0 above 10 is an early warning that compaction is lagging.
  • Size disk I/O for compaction headroom. Compaction throughput should be at least twice the sustained write ingestion rate to absorb bursts.
  • Keep disk utilization below 80%. CockroachDB recommends at least 20% free space (30% preferred) so compaction can stage temporary files.
  • Rate-limit bulk jobs. Schedule IMPORT, RESTORE, and large schema changes during low-traffic windows and rely on admission control to shape elastic traffic at 1 L0 sublevel.
  • Track MVCC garbage and protected timestamps. A stalled changefeed or abandoned backup can block garbage collection, silently inflating compaction work until disk fills.
  • Watch backup duration trends. A backup that grows from minutes to hours may soon overlap with peak traffic and saturate I/O.

How Netdata helps

  • Chart storage_write_stalls against storage_l0_sublevels to spot the leading indicator before application impact.
  • Monitor admission control queue depths and wait durations alongside stall counters to distinguish workload overload from disk hardware issues.
  • Cross-reference write stalls with per-node disk I/O, WAL fsync latency, and CPU to determine whether the bottleneck is storage bandwidth or tombstone pressure.
  • Alert on L0 sublevels > 10 and escalate when write stall rate exceeds 1 per second to reduce noise from brief bulk-load events.
  • Overlay lease transfer rates and unavailable range counts to reveal whether a stall is cascading into Raft leadership loss.
  • Track disk space and protected timestamp age to catch the silent precursors of compaction failure.