Kafka enable.auto.commit data loss: committed offsets that outrun processing
Downstream systems are missing events, but your Kafka consumer group reports lag near zero and the group state is Stable. There are no broker errors, no rebalances, and no visible backpressure. The application crashed or deployed minutes ago, then caught up instantly on recovery. Messages are lost silently. This is the signature of enable.auto.commit=true: offsets are committed to __consumer_offsets on a fixed schedule, not after processing completes.
When enable.auto.commit=true, the consumer commits offsets on a timer based on the position of the last poll batch. The committed offset is the next record the consumer will read, not the last one it processed. If the application crashes after the commit fires but before processing completes, those messages are skipped on restart. The broker only knows the committed offset, so standard lag metrics show a healthy gap while your application silently drops data.
sequenceDiagram
participant App as Application
participant Client as Consumer client
participant Broker as Broker
Client->>Broker: poll() returns offsets 100-199
Broker-->>Client: record batch
Note over Client: auto.commit fires
Client->>Broker: commit offset 200
loop Process batch
Client->>App: deliver records
end
Note over App: Crash after offset 150
App--xClient:
Client->>Broker: restart, fetch from 200
Note over Broker: Lag = 0
Offsets 151-199 lostCommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Crash between auto-commit and processing completion | Missing records downstream; broker-reported lag is flat at zero; consumer group remains Stable | Application logs and orchestration events for restarts, OOM kills, or SIGKILL during the incident window |
Slow processing exceeding auto.commit.interval.ms | Intermittent skipped offsets; consumer appears healthy but downstream watermark drifts ahead of processed data | End-to-end processing latency per poll batch versus the configured auto-commit interval |
| Reliance on default consumer configuration | Newly deployed services lose data silently; team assumed offsets commit after processing finishes | Consumer configuration for enable.auto.commit and auto.commit.interval.ms across all deployment layers |
Quick checks
# Check broker-reported lag and active members for the group.
# Look for CURRENT-OFFSET close to LOG-END-OFFSET while downstream reports gaps.
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group {group-id}
# Verify consumer client configuration for auto-commit settings.
# Check consumer.properties, environment variables, and framework overrides.
grep -E "enable.auto.commit|auto.commit.interval.ms" /path/to/consumer.properties
# Inspect application logs for crashes or restarts in the incident window.
journalctl -u consumer-service --since "30 minutes ago" | grep -iE "error|fatal|oom|sigterm"
# Check for container or process restarts that narrow the commit window.
kubectl get pods -l app=consumer --sort-by=.status.startTime
# Sample committed offsets twice to observe automatic advance.
# Run this, wait 10 s, then run again. If CURRENT-OFFSET moves while the app is still processing, auto-commit is active.
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group {group-id}
How to diagnose it
- Confirm auto-commit is enabled. Check application configuration, environment variables, and framework defaults. Some stream processing frameworks override explicit properties or default to auto-commit even when not explicitly set.
- Collect broker-reported lag at the time of the incident. If lag was zero or near-zero while downstream systems report a data gap, the committed offset outran processing.
- Map missing record offsets to poll batches. The lost messages will be contiguous offsets immediately preceding the committed position. If your application logs the last successfully processed offset, compare it to the broker’s
CURRENT-OFFSETfor the partition. Any gap is the loss window. - Check application logs and orchestration events. Look for crashes, restarts, or SIGTERM delivery between poll return and expected processing completion.
- Calculate processing latency per batch. If average batch processing time exceeds
auto.commit.interval.ms, a background commit fires during processing, guaranteeing the loss window. - Reproduce in a test environment. Consume a test topic with
enable.auto.commit=true, kill the consumer mid-batch, and observe skipped offsets on restart.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Consumer group lag | Reflects committed offset against log-end offset, not processed offset | Lag near zero while downstream reports missing data |
| Application processing watermark | Tracks the last offset successfully handled by your code, independent of broker commits | Broker-reported lag is near zero but application watermark lags behind |
| Consumer group state | Rebalances and member loss expose the commit window | Unexpected transitions to Empty or rapid rebalance cycles |
| Consumer rebalance rate | Repeated rebalances increase the chance of commits during unstable processing | More than 2-3 rebalances per hour outside deployments |
| Broker bytes out per second | Confirms the consumer was actively fetching before failure | Sudden drop correlated with application restarts |
| Application restart rate | Each restart is a potential commit-to-processing gap | Restarts occurring within auto.commit.interval.ms windows |
Fixes
Manual commits after processing
Set enable.auto.commit=false. Call commitSync() or commitAsync() only after the application has successfully processed the batch. This couples commit timing to completion and eliminates the loss window. The tradeoff is at-least-once delivery: if the commit fails or the consumer crashes before committing, records are re-delivered on restart. Your application or downstream system must handle duplicates.
Use commitSync() when you need a clear failure signal and can tolerate the blocking round-trip. Use commitAsync() with a failure callback when throughput is critical, but handle errors to avoid silent commit loss.
Idempotent processing
If you cannot switch to manual commits (for example, in a legacy consumer or a framework that abstracts offset management), make re-processing the same record produce the same downstream state. Use natural keys, conditional writes, or upsert semantics so that duplicates overwrite identical state instead of creating duplicates. This does not prevent the loss window, but it prevents data corruption when the window closes with re-delivery rather than a skip.
Shrink the exposure window
Lowering max.poll.records so batches finish faster reduces the likelihood that a crash lands inside the commit interval, but it does not eliminate the risk. The only way to close the window is to tie commit timing to processing completion.
Prevention
- Set
enable.auto.commit=falsefor any consumer where message loss is unacceptable. - Commit offsets only after successful processing, never before.
- Monitor application-level processing watermarks or lag, not only broker-reported lag.
- Test consumer failure injection in staging: kill a consumer mid-batch and verify whether records are re-delivered or lost.
- Document your consumer’s delivery guarantee in runbooks so on-call engineers know whether a restart can cause gaps.
How Netdata helps
- Netdata collects broker-side consumer group lag and state, so you can verify that low lag does not mean fully processed data.
- Correlate consumer group state transitions with system-level events such as OOM kills or container restarts. If a restart falls inside an auto-commit interval, the loss window is open.
- Monitor broker
BytesOutPerSecand consumer fetch latency alongside application health to detect consumers that fetch but do not finish processing. - Track consumer group member count drops to spot ungraceful exits that can precede silent offset jumps.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind
- Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata
- Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections
- Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation
- Kafka leadership imbalance: LeaderCount skew and preferred replica election







