Kubernetes API server etcd latency: detection and cascading failures
When etcd slows down, the entire control plane slows with it. A few extra milliseconds on disk fsync turns into hung kubectl commands, backed-up controller queues, and eventually a cluster that cannot schedule pods or update endpoints. Detect the etcd latency cascade, confirm whether storage is the root cause, and break the feedback loop before the cluster becomes effectively read-only.
What this means
etcd serializes every Kubernetes mutation. Every API server write becomes a Raft proposal that must fsync to the WAL before etcd acknowledges it. When the disk under etcd is slow, every fsync waits longer. The API server holds mutating requests open until etcd responds. Requests pile up in the inflight queue. Once the queue hits the limit, the API server returns 429 Too Many Requests. Controllers that depend on writes (scheduler, replica set controller, and others) fall behind and retry. Retries generate more write load. The result is a feedback loop: slow disk -> slow etcd -> slow API server -> retry storm -> amplified etcd load.
The failure is asymmetric. Read operations served from the API server watch cache may still respond quickly, so kubectl get can look healthy while kubectl create or kubectl delete hangs. This asymmetry makes the cascade easy to misdiagnose as an API server problem rather than a storage problem.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Disk I/O saturation on the etcd host | WAL fsync p99 climbing above 10ms; leader election storms | iostat -x 1 on etcd nodes |
| Network-attached storage latency | Variable fsync spikes; cloud burst credit exhaustion | Disk type and burst balance |
| etcd database approaching quota | DB size near 80% of the default 2GB; writes fail with NOSPACE alarm | etcdctl endpoint status --write-out=table |
| etcd compaction or defragmentation | Periodic latency spikes aligned with maintenance windows | etcd logs for “compact” or “defrag” |
| Network partition between API server and etcd | Uniform mutating latency elevation; API server readyz etcd check fails | etcdctl endpoint health and peer RTT |
Quick checks
Run these in order. All are read-only.
# Check etcd WAL fsync latency (etcd metrics endpoint)
curl -s http://localhost:2379/metrics | grep ^etcd_disk_wal_fsync_duration_seconds
# Check etcd backend commit latency
curl -s http://localhost:2379/metrics | grep ^etcd_disk_backend_commit_duration_seconds
# Check etcd cluster health and member status
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
endpoint health
# Check etcd DB size, leader status, and Raft index
ETCDCTL_API=3 etcdctl endpoint status --cluster -w table \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Check API server readyz etcd sub-check
kubectl get --raw='/readyz?verbose' | grep -A2 etcd
# Check disk I/O wait on the etcd host
iostat -x 1 5
# Check API server inflight mutating requests
kubectl get --raw='/metrics' | grep ^apiserver_current_inflight_requests
# Check 429 rejection rate
kubectl get --raw='/metrics' | grep 'apiserver_request_total' | grep 'code="429"'
# Check etcd leader changes
curl -s http://localhost:2379/metrics | grep ^etcd_server_leader_changes_seen_total
# Check pending Raft proposals
curl -s http://localhost:2379/metrics | grep ^etcd_server_proposals_pending
How to diagnose it
Confirm the cascade and find the root cause.
Confirm mutating API latency is elevated. Check
apiserver_request_duration_secondsfor POST, PUT, and PATCH verbs. If p99 is above 500ms sustained, the control plane is degrading. If it is above 1s, the cluster is in active failure.Check etcd WAL fsync latency. Look at
etcd_disk_wal_fsync_duration_secondson the etcd metrics endpoint. In a healthy cluster, p99 is below 10ms. Above 100ms is critical. This is the root cause signal. If it is elevated, the problem is under etcd, not in the API server.Check etcd leader stability. Look at
etcd_server_leader_changes_seen_total. In a stable cluster, this should be near zero. If it is incrementing, the etcd leader is missing heartbeats because disk latency is exceeding the default 100ms heartbeat interval or the 1000ms election timeout.Check etcd database size versus quota. Run
etcdctl endpoint status --write-out=table. Compare DB SIZE to the configured--quota-backend-bytes(default 2GB). If the database is above 80% of quota, etcd is approaching the NOSPACE alarm, which makes writes progressively slower and eventually stops them entirely.Check API server inflight requests and 429 rate. Look at
apiserver_current_inflight_requestsandapiserver_request_total{code="429"}. If inflight is climbing toward the limit (default 200 mutating, 400 read-only) and 429s are appearing, the API server is saturated because it is waiting on etcd.Check disk I/O on the etcd host. Run
iostat -x 1and look for high%util, elevatedawait, or queue depth near the device limit. If disk utilization is near 100%, the storage subsystem is the bottleneck. If the disk is network-attached, check for burst credit exhaustion.Distinguish from admission webhook slowdown. Check
apiserver_admission_webhook_admission_duration_seconds. If webhook latency is normal while mutating API latency is high, etcd is the culprit. If webhook latency is also elevated, the bottleneck may be a slow webhook instead.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
etcd_disk_wal_fsync_duration_seconds | WAL fsync is on the critical path of every write | p99 > 10ms trending upward |
etcd_disk_backend_commit_duration_seconds | Backend commit affects read performance and compaction | p99 > 25ms sustained |
etcd_request_duration_seconds | API server’s client-side view of etcd latency | p99 > 100ms for writes |
apiserver_request_duration_seconds (mutating verbs) | End-to-end latency of writes through the API server | p99 > 500ms sustained |
etcd_server_leader_changes_seen_total | Leader elections cause brief write outages | Any increase in a stable cluster |
etcd_mvcc_db_total_size_in_bytes | Approaching quota causes write rejection | > 50% of --quota-backend-bytes |
apiserver_current_inflight_requests | Indicates API server saturation | > 80% of configured limit |
apiserver_request_total{code="429"} | Confirms APF or inflight saturation | Sustained rate above zero |
etcd_server_proposals_pending | Rising value means Raft cannot reach consensus fast enough | Value increasing over time |
etcd_network_peer_round_trip_time_seconds | High peer RTT causes leader instability | p99 > 1ms between peers |
Fixes
If the cause is disk I/O saturation
Identify competing I/O workloads on the etcd host. If etcd is stacked with the API server or with logging agents, move etcd to dedicated SSD or NVMe storage. Do not run etcd on network-attached storage in production. If the disk is degraded, fail over to another etcd member if one is available.
If the cause is database size or fragmentation
Compact old revisions with etcdctl compact (requires a target revision; check etcdctl endpoint status) and then defragment one member at a time (followers first, leader last). Defragmentation blocks the member and can cause latency spikes. After freeing space, disarm any active NOSPACE alarm with etcdctl alarm disarm. Consider increasing --quota-backend-bytes if the cluster legitimately needs more than 2GB.
If the cause is periodic compaction or defragmentation
Compaction causes predictable latency spikes. Ensure the Kubernetes API server --etcd-compaction-interval and etcd’s own --auto-compaction-retention are aligned and not conflicting. Schedule defragmentation during maintenance windows, not during peak load.
If the cause is network latency or partition
Check etcd_network_peer_round_trip_time_seconds between members. If RTT is above 1ms, investigate the network path. Ensure etcd members are deployed with odd cardinality (3 or 5) so the cluster can tolerate member loss without losing quorum. If the API server cannot reach etcd, verify network policies, firewalls, and certificate validity on the etcd client paths.
Prevention
- Monitor
etcd_disk_wal_fsync_duration_secondswith the same urgency as API server latency. Alert when p99 exceeds 10ms. - Keep etcd database size below 50% of quota. Track the trend and schedule compaction before reaching 75%.
- Run etcd on dedicated local SSD or NVMe. Never share the disk with workloads, logging, or the API server if stacked.
- Alert on any etcd leader change in a stable cluster. Even one per hour indicates disk or network stress.
- Ensure client certificate rotation is working. Expired etcd client or peer certificates can appear as latency or connectivity failures.
- Size API server inflight limits and APF concurrency shares to leave headroom for bursts. Sustained utilization above 50% of inflight capacity should trigger capacity review.
How Netdata helps
- Correlate
etcd_disk_wal_fsync_duration_secondswithapiserver_request_duration_secondson the same timeline to confirm the cascade. - Track
apiserver_current_inflight_requestsand 429 rates alongside etcd metrics to watch saturation build before an outage. - Monitor disk I/O wait, utilization, and queue depth on etcd nodes to distinguish disk saturation from application-level slowdown.
- Alert on etcd leader changes and database size trends.
Related guides
- Kubernetes API server slow or unresponsive: causes and fixes
- Kubernetes node NotReady: kubelet, runtime, and network diagnosis
- Kubernetes pod stuck Pending: scheduling failures explained
- Kubernetes pod CrashLoopBackOff: causes, diagnosis, and fixes
- Kubernetes monitoring checklist: the signals every production cluster needs
- Kubernetes node DiskPressure: detection, eviction, and recovery
flowchart TD
A[Slow disk I/O] --> B[etcd WAL fsync delay]
B --> C[Leader misses heartbeat]
C --> D[Raft leader election]
D --> E[Brief write unavailability]
E --> F[API server mutating requests timeout]
F --> G[Inflight requests accumulate]
G --> H[429 Too Many Requests]
H --> I[Controllers retry]
I --> J[Amplified write load on etcd]
J --> B





