Kubernetes etcd disk fsync latency: detection and tuning

etcd serializes every Kubernetes write. When disk fsync latency rises, Raft heartbeats stall, leaders step down, and the API server returns 500s and 429s while controllers retry into a death spiral. Unlike CPU or memory pressure, etcd disk latency is invisible until it is catastrophic. This guide shows how to detect it early, isolate the root cause between storage hardware, database size, and configuration, and tune the cluster to survive production load.

What this means

etcd writes every mutation to a write-ahead log (WAL) and fsyncs before acknowledging the write. etcd_disk_wal_fsync_duration_seconds measures this fsync latency; etcd_disk_backend_commit_duration_seconds tracks the periodic BoltDB backend commit. WAL fsync is on the critical path of every write. Backend commit is periodic but affects read performance and compaction.

In production, WAL fsync p99 should stay below 10ms. Above 50ms, the cluster is stressed. Above 100ms, the default 100ms Raft heartbeat interval is at risk. If the leader cannot fsync and send heartbeats within the 1000ms election timeout, followers start a new election. Each election causes a brief write outage, and the resulting retry storm amplifies load.

When a single WAL fsync exceeds 1 second, etcd logs the string "slow fdatasync". Any occurrence indicates the disk was saturated at that moment; repeated occurrences confirm chronic saturation.

Common causes

CauseWhat it looks likeFirst thing to check
Shared or slow diskWAL fsync p99 > 10ms, steady growth, leader flappingiostat -x on the etcd data volume
Database approaching quotaBackend commit latency spikes, writes failingetcd_mvcc_db_total_size_in_bytes vs --quota-backend-bytes
Compaction or defragmentation runningPeriodic latency spikes aligned with maintenance windowetcd logs for compaction or defrag activity
Noisy neighbor on same volumeFsync latency spikes without matching etcd write rateHost-level disk utilization and competing processes
Network-attached storageHighly variable fsync latency, “slow fdatasync” logsStorage mount type (NAS/NFS is unsuitable)

Quick checks

# WAL fsync latency histogram
curl -s http://localhost:2379/metrics | grep ^etcd_disk_wal_fsync_duration_seconds

# Backend commit latency
curl -s http://localhost:2379/metrics | grep ^etcd_disk_backend_commit_duration_seconds

# Slow fdatasync log lines (adjust for systemd vs static Pod)
journalctl -u etcd --since "1 hour ago" | grep "slow fdatasync"

# Leader stability
curl -s http://localhost:2379/metrics | grep ^etcd_server_leader_changes_seen_total

# Database size vs quota
etcdctl endpoint status --cluster -w table

# NOSPACE alarm
etcdctl alarm list

# Disk I/O wait on the etcd node
iostat -x 1 5

If you have Prometheus available, calculate WAL fsync p99 with:

histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

A p99 above 0.01 (10ms) warrants investigation. A p99 above 0.1 (100ms) warrants a page.

How to diagnose it

  1. Confirm the symptom from etcd metrics. Look at etcd_disk_wal_fsync_duration_seconds p99. If it is elevated, check etcd_disk_backend_commit_duration_seconds p99. If only WAL fsync is high, the disk subsystem is the bottleneck. If both are high, the database may be fragmented or near quota.

  2. Correlate with leader changes. Check etcd_server_leader_changes_seen_total. Any sustained increase in a stable cluster means disk latency is crossing the heartbeat threshold. If leader changes are zero but WAL fsync is 50ms, the cluster is at risk of leader instability.

  3. Check etcd logs for “slow fdatasync”. This log line appears when a WAL fsync exceeds 1 second. Its presence confirms the disk is not merely slow but saturated.

  4. Inspect disk I/O at the OS layer. Run iostat -x 1 on the etcd host. Look for high %util, high await, or elevated queue depth on the device holding the etcd data directory. If the disk is shared with the OS, container logs, or other workloads, the competition is likely the root cause.

  5. Measure database size and quota. Use etcdctl endpoint status --cluster -w table to view DB SIZE, or query etcd_mvcc_db_total_size_in_bytes. Compare against --quota-backend-bytes. If the database is above 80% of quota, BoltDB write amplification is inflating commit latency.

  6. Identify maintenance-induced spikes. If latency spikes are periodic and align with the compaction interval (default every 5 minutes via --etcd-compaction-interval on the API server), the spikes are expected but their magnitude should stay under 100ms. Defragmentation causes larger spikes and should be run only during maintenance windows, one member at a time.

  7. Verify hardware expectations. etcd requires dedicated locally attached SSD or NVMe. If the data directory lives on a cloud burst-IOPS volume, spinning disk, or network-attached storage, move it. Measure sequential write fsync with fio targeting the etcd data directory to validate the disk before it enters production.

  8. Review heartbeat and election timeouts only after storage issues are ruled out. The defaults are 100ms and 1000ms respectively, and the election timeout must be at least 10x the heartbeat. Raising these values masks disk problems but does not fix them. Tuning is only appropriate when cross-AZ latency legitimately consumes a significant fraction of the heartbeat interval.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
etcd_disk_wal_fsync_duration_secondsCritical path for every writep99 > 10ms
etcd_disk_backend_commit_duration_secondsBoltDB commit and compaction pressurep99 > 250ms
etcd_server_leader_changes_seen_totalRaft instability from missed heartbeatsAny increase per hour in a stable cluster
etcd_mvcc_db_total_size_in_bytesSpace pressure and write amplification> 80% of --quota-backend-bytes
etcd_server_slow_apply_totalOperations backing up behind slow diskCounter increasing
Disk I/O wait (iowait)Root cause indicator for fsync latencySustained > 20% on etcd data device
etcd_server_has_leaderQuorum healthValue 0 on any member
API server mutating request latencyCascade indicator from etcd latencyp99 > 1s correlated with etcd fsync

Fixes

If the cause is storage hardware or contention

Move etcd to a dedicated local SSD or NVMe. Do not share the etcd data directory with the OS, logs, or other containers. In cloud environments, use provisioned-IOPS volumes rather than general-purpose burst volumes. If etcd is stacked with the API server on the same node, separate their disks immediately. If NAS or NFS is in use, migrate to locally attached storage; etcd requires durable local writes.

If the cause is database size or fragmentation

Compact historical revisions to reclaim logical space. Enable --auto-compaction-mode=periodic and set --auto-compaction-retention. After compaction, defragment each member individually to reclaim physical disk space. Defragmentation blocks the member, so run it on one member at a time while the cluster retains quorum. If the database consistently approaches the default 2GB quota, increase --quota-backend-bytes up to 8GB and plan capacity accordingly.

If the cause is snapshot or heartbeat pressure

The default --snapshot-count is 100000 in current etcd versions. If snapshot write spikes correlate with fsync latency spikes, snapshot frequency is a contributor, but the root cause is still disk throughput. Do not lower snapshot count to hide disk limits. If the cluster spans availability zones and network round-trip time consumes a material fraction of the 100ms heartbeat, you may increase --heartbeat-interval and --election-timeout while maintaining the 10x ratio. Treat this as a configuration workaround, not a storage fix.

If the cause is a noisy neighbor

Identify competing I/O workloads on the etcd volume using iotop or pidstat -d. Stop non-essential processes, move log aggregation off the etcd disk, and ensure the container runtime image store resides on a separate volume. If the node is a VM, verify that the hypervisor is not oversubscribing storage.

Prevention

  • Dedicated disk. Provision a separate SSD or NVMe volume for the etcd data directory. Never run etcd on the root filesystem or a shared VM disk.
  • Baseline before provisioning. Run a sequential fsync benchmark against the target volume before production deployment to confirm it sustains sub-10ms fsync under load.
  • Monitor WAL fsync p99 as a first-class metric. Set alert thresholds: 10ms warning, 50ms critical, 100ms page.
  • Enable auto-compaction. Use periodic compaction and monitor etcd_mvcc_db_total_size_in_bytes trend. Alert when it exceeds 50% of quota.
  • Defragment during maintenance. Schedule defragmentation during low-traffic windows, one member at a time.
  • Avoid stacked etcd disk sharing. In kubeadm deployments, ensure etcd and kube-apiserver do not contend for the same physical spindles.
  • Track leader changes. Any leader change outside of maintenance is an automatic investigation trigger.

How Netdata helps

Netdata correlates etcd disk latency with control plane metrics to expose the cascade:

  • Plots etcd WAL fsync p99 alongside API server mutating request latency and 429 rejection rate.
  • Shows node-level iowait and per-device utilization aligned with etcd fsync spikes to isolate hardware contention.
  • Tracks etcd database size and leader change counters, providing leading indicators before quota or heartbeat thresholds are breached.
  • Correlates etcd_server_slow_apply_total with API Priority and Fairness (APF) queue depth to show when disk latency backs up the control plane.
flowchart TD
    A[WAL fsync p99 elevated] --> B{Leader changes increasing?}
    B -->|Yes| C[Disk latency crossing heartbeat threshold]
    B -->|No| D[Disk stressed but stable]
    C --> E[Check iostat for disk saturation]
    D --> E
    E --> F{Shared or slow disk?}
    F -->|Yes| G[Move etcd to dedicated local SSD/NVMe]
    F -->|No| H[Check db size vs quota]
    H --> I{Near quota or fragmented?}
    I -->|Yes| J[Compact and defragment one member at a time]
    I -->|No| K[Check for noisy neighbors or NAS]
    K --> L[Remove competing workloads]
    J --> M[Rebaseline with fio]
    G --> M
    L --> M