$ guides / cockroachdb / cockroachdb-pebble-write-stalls ▌

Operations Guides

CockroachDB Pebble write stalls: when the storage engine refuses writes

When application writes time out or return ambiguous errors, check CockroachDB logs for pebble: write stall and watch storage_write_stalls. Pebble pauses writes to a store when Level 0 compaction debt exceeds safe thresholds. Until compaction drains L0, that store cannot accept new writes.

Write stalls are the most severe storage signal in CockroachDB. A brief stall during bulk loading may be harmless, but sustained stalls at one per second mean the node cannot meet its Raft obligations. The node may still serve reads, but it can lose Raft leadership because it cannot append log entries. That cascades into lease transfers and temporary range unavailability.

The root cause is almost always that writes arrive faster than the disk can compact, or that compaction is blocked by disk space exhaustion or I/O contention. The stall is protective: without it, L0 would grow without bound and read amplification would explode. Break the feedback loop by reducing write pressure or increasing compaction capacity.

What this means

Pebble is a log-structured merge tree. Writes enter a memtable and flush to SSTables in Level 0. Background compaction merges L0 files downward. When the write rate outpaces compaction, the L0 sublevel count grows.

Admission control shapes regular traffic when L0 reaches 5 sublevels and elastic traffic at 1 sublevel. These are soft backpressure signals. When L0 crosses roughly 20 sublevels, Pebble triggers a hard write stall. Foreground writes pause until compaction drains L0.

During a stall, SQL writes block. The node cannot persist Raft log entries, which can cause it to forfeit leadership. The cluster transfers leases away, creating brief unavailability for affected ranges.

flowchart TD
    A[Heavy writes or bulk ingestion] --> B[L0 sublevel count climbs]
    B --> C[Read amplification rises]
    C --> D[Compaction slows]
    D --> B
    B --> E[Pebble write stall]
    E --> F[Raft log writes blocked]
    F --> G[Lease transfers and unavailable ranges]

The cycle is self-reinforcing: a large L0 increases read amplification and compaction cost, which lets L0 grow further. Break it by reducing write pressure or increasing compaction capacity.

Common causes

Cause	What it looks like	First thing to check
Write rate exceeds disk compaction throughput	`storage_l0_sublevels` climbing steadily; compaction throughput at disk ceiling; disk I/O saturated	Per-store L0 count versus compaction bytes per second
Bulk ingestion without rate limiting	IMPORT or RESTORE running; stalls correlate with job execution; admission control store-write queue deep	`crdb_internal.jobs` for active IMPORT, RESTORE, or index backfills
Backup or snapshot I/O consuming disk bandwidth	Backup window overlaps with peak traffic; disk I/O utilization jumps before L0 spikes	Job metrics and disk I/O latency
MVCC tombstone pressure from large deletes	Elevated L0 after a batch DELETE or UPDATE; MVCC garbage bytes growing	Recent DML patterns and MVCC garbage metrics
Disk space exhaustion preventing compaction	Low free space; L0 grows even though foreground write rate is moderate	`capacity_available` per store

Quick checks

Run these read-only checks from a node or monitoring host.

# Check write stall counter per store
curl -s http://localhost:8080/_status/vars | grep 'storage_write_stalls'

# Check L0 sublevel count per store
curl -s http://localhost:8080/_status/vars | grep 'storage_l0_sublevels'

# Check admission control signals
curl -s http://localhost:8080/_status/vars | grep 'admission'

# Check compaction activity
curl -s http://localhost:8080/_status/vars | grep -E 'compaction|compacted'

# Check per-store disk usage
curl -s http://localhost:8080/_status/vars | grep -E 'capacity_used|capacity_available'

# Check for disk stall detection (distinct from Pebble stalls)
curl -s http://localhost:8080/_status/vars | grep 'storage_disk_stalled'

# List running jobs that may generate heavy writes
# Add connection flags (e.g., --host, --certs-dir) as required for your deployment
cockroach sql -e "SELECT job_id, job_type, status FROM crdb_internal.jobs WHERE status = 'running';"

# Search logs for write stall events (adjust path to your deployment)
grep "pebble: write stall" /path/to/logs/cockroach.log

A stall rate of one per minute is usually not material during bulk operations. A rate of one per second sustained for more than a minute is an active emergency.

How to diagnose it

Confirm the stall is active. Verify that storage_write_stalls is incrementing or that log lines are appearing in the last minute. Use a short time window; brief spikes during batch jobs can self-resolve.
Identify the affected store or stores. storage_l0_sublevels is a per-store gauge. If one store is elevated while others are flat, investigate that specific disk. A node with multiple stores can have one hot and one cool.
Determine whether L0 is rising or falling. A falling L0 during a stall means compaction is catching up. A rising L0 means the stall will persist or worsen. Check the exact stall reason in the Pebble logs if L0 does not match the stall pattern.
Correlate with foreground workload. Check sql_insert_count, sql_update_count, and active jobs. If SQL throughput is normal but a backup is running, the backup is likely saturating I/O. If SQL throughput itself is high, the workload has outgrown the disk.
Check disk I/O saturation. Use OS-level metrics or Prometheus disk stats. If compaction is already at the device throughput ceiling, ingestion exceeds physical capacity. Upgrade the disk or add nodes.
Check disk space. If capacity_available is below 15-20% of total capacity, compaction may be unable to stage temporary files. This creates a space-amplification death spiral where lack of free space prevents the very operation that would reclaim it.
Look for MVCC tombstone spikes. Large DELETE or UPDATE operations generate tombstones that increase compaction work. Check intent and MVCC garbage metrics if L0 spiked after a schema change or data purge. Protected timestamps from stalled changefeeds can also block garbage collection, amplifying compaction debt.
Evaluate Raft impact. Check leases_transfers_success and ranges_unavailable. If leadership is flapping, the stall is affecting cluster availability, not just latency. You may need to shed client traffic from the affected node until L0 recovers.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`storage_write_stalls`	Direct count of write refusal events	Any nonzero rate during normal OLTP; > 1/sec sustained
`storage_l0_sublevels`	Leading indicator before stalls occur	Sustained > 10; > 20 with rising trend
`admission.io.overload` and store-write queue	Intentional throttling before hard stalls	Queue depth > 0 with growing wait times
Compaction throughput	Whether background maintenance keeps pace	Throughput at disk maximum with backlog growing
`capacity_available`	Compaction needs free space to operate	< 20% free and trending down
`raft.process.logcommit.latency`	WAL path health; stalls block this directly	Elevated commit latency correlating with stall events
`leases_transfers_success`	Leadership instability from inability to write	Spike during stall windows
`storage_disk_stalled`	Distinct signal for disk hardware failure	Nonzero value indicates possible node self-termination
`intentbytes` or MVCC garbage	Tombstones inflate compaction work	Growth after large DELETE or UPDATE

Fixes

Reduce write pressure

Pause any running IMPORT, RESTORE, or large index backfills. Reduce client concurrency or batch sizes. Verify admission control is enabled so the store-write queue can throttle elastic work before hard stalls occur. If you cannot pause the workload, redirect write traffic away from the affected node using zone configurations or load balancer weights. Tradeoff: slower bulk operations, reduced ingest throughput, and possible temporary imbalance.

Reschedule or throttle background I/O

If a backup or snapshot stream is competing with foreground writes, move the backup window to off-peak hours or reduce snapshot recovery rates to free disk I/O headroom for compaction. Tradeoff: longer backup windows and slower node recovery, but foreground availability is preserved.

Add disk I/O capacity

If sustained write ingestion exceeds the disk’s compaction capacity, upgrade to higher-IOPS storage or add nodes to distribute the write load. Pebble compactions are single-store operations, so a node with multiple stores can also help isolate hot disks. In cloud environments, check whether EBS burst credits or provisioned IOPS limits are throttling compaction. Tradeoff: infrastructure cost and possible rebalancing traffic during the upgrade.

Free disk space

When disk space is critically low, compaction cannot run. Remove unnecessary data, cancel hung jobs that generate temp files, or expand the underlying volume. If protected timestamps are blocking MVCC garbage collection, resolve the stalled changefeed or backup that owns the record. Monitor spanconfig_kvsubscriber_protected_record_count to confirm the blockage. Tradeoff: deleting data is destructive; expanding volumes may require maintenance windows depending on your storage layer.

Split hot ranges

If a single range is driving L0 growth on one store, a manual split can reduce per-range write load and let the cluster rebalance the resulting ranges. Tradeoff: splits create brief unavailability windows for the range and can increase total range count, which raises per-node Raft overhead.

Prevention

Monitor storage_l0_sublevels proactively. Do not wait for write stalls. A sustained L0 above 10 is an early warning that compaction is lagging.
Size disk I/O for compaction headroom. Compaction throughput should be at least twice the sustained write ingestion rate to absorb bursts.
Keep disk utilization below 80%. CockroachDB recommends at least 20% free space (30% preferred) so compaction can stage temporary files.
Rate-limit bulk jobs. Schedule IMPORT, RESTORE, and large schema changes during low-traffic windows and rely on admission control to shape elastic traffic at 1 L0 sublevel.
Track MVCC garbage and protected timestamps. A stalled changefeed or abandoned backup can block garbage collection, silently inflating compaction work until disk fills.
Watch backup duration trends. A backup that grows from minutes to hours may soon overlap with peak traffic and saturate I/O.

How Netdata helps

Chart storage_write_stalls against storage_l0_sublevels to spot the leading indicator before application impact.
Monitor admission control queue depths and wait durations alongside stall counters to distinguish workload overload from disk hardware issues.
Cross-reference write stalls with per-node disk I/O, WAL fsync latency, and CPU to determine whether the bottleneck is storage bandwidth or tombstone pressure.
Alert on L0 sublevels > 10 and escalate when write stall rate exceeds 1 per second to reduce noise from brief bulk-load events.
Overlay lease transfer rates and unavailable range counts to reveal whether a stall is cascading into Raft leadership loss.
Track disk space and protected timestamp age to catch the silent precursors of compaction failure.

The Netdata solution

CockroachDB monitoring with Netdata

Netdata monitors CockroachDB with per-second metrics and automatic dashboards. Watch LSM compaction, Raft liveness, clock skew, hot ranges, and intent buildup so the distributed-systems failure modes in these runbooks surface early.

See CockroachDB monitoring → Start monitoring free

CockroachDB Pebble write stalls: when the storage engine refuses writes

CockroachDB Pebble write stalls: when the storage engine refuses writes

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Reduce write pressure

Reschedule or throttle background I/O

Add disk I/O capacity

Free disk space

Split hot ranges

Prevention

How Netdata helps

Related guides

CockroachDB monitoring with Netdata