Kafka enable.auto.commit data loss: committed offsets that outrun processing

Downstream systems are missing events, but your Kafka consumer group reports lag near zero and the group state is Stable. There are no broker errors, no rebalances, and no visible backpressure. The application crashed or deployed minutes ago, then caught up instantly on recovery. Messages are lost silently. This is the signature of enable.auto.commit=true: offsets are committed to __consumer_offsets on a fixed schedule, not after processing completes.

When enable.auto.commit=true, the consumer commits offsets on a timer based on the position of the last poll batch. The committed offset is the next record the consumer will read, not the last one it processed. If the application crashes after the commit fires but before processing completes, those messages are skipped on restart. The broker only knows the committed offset, so standard lag metrics show a healthy gap while your application silently drops data.

sequenceDiagram
    participant App as Application
    participant Client as Consumer client
    participant Broker as Broker

    Client->>Broker: poll() returns offsets 100-199
    Broker-->>Client: record batch
    Note over Client: auto.commit fires
    Client->>Broker: commit offset 200
    loop Process batch
        Client->>App: deliver records
    end
    Note over App: Crash after offset 150
    App--xClient: 
    Client->>Broker: restart, fetch from 200
    Note over Broker: Lag = 0
Offsets 151-199 lost

Common causes

CauseWhat it looks likeFirst thing to check
Crash between auto-commit and processing completionMissing records downstream; broker-reported lag is flat at zero; consumer group remains StableApplication logs and orchestration events for restarts, OOM kills, or SIGKILL during the incident window
Slow processing exceeding auto.commit.interval.msIntermittent skipped offsets; consumer appears healthy but downstream watermark drifts ahead of processed dataEnd-to-end processing latency per poll batch versus the configured auto-commit interval
Reliance on default consumer configurationNewly deployed services lose data silently; team assumed offsets commit after processing finishesConsumer configuration for enable.auto.commit and auto.commit.interval.ms across all deployment layers

Quick checks

# Check broker-reported lag and active members for the group.
# Look for CURRENT-OFFSET close to LOG-END-OFFSET while downstream reports gaps.
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group {group-id}

# Verify consumer client configuration for auto-commit settings.
# Check consumer.properties, environment variables, and framework overrides.
grep -E "enable.auto.commit|auto.commit.interval.ms" /path/to/consumer.properties

# Inspect application logs for crashes or restarts in the incident window.
journalctl -u consumer-service --since "30 minutes ago" | grep -iE "error|fatal|oom|sigterm"

# Check for container or process restarts that narrow the commit window.
kubectl get pods -l app=consumer --sort-by=.status.startTime

# Sample committed offsets twice to observe automatic advance.
# Run this, wait 10 s, then run again. If CURRENT-OFFSET moves while the app is still processing, auto-commit is active.
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group {group-id}

How to diagnose it

  1. Confirm auto-commit is enabled. Check application configuration, environment variables, and framework defaults. Some stream processing frameworks override explicit properties or default to auto-commit even when not explicitly set.
  2. Collect broker-reported lag at the time of the incident. If lag was zero or near-zero while downstream systems report a data gap, the committed offset outran processing.
  3. Map missing record offsets to poll batches. The lost messages will be contiguous offsets immediately preceding the committed position. If your application logs the last successfully processed offset, compare it to the broker’s CURRENT-OFFSET for the partition. Any gap is the loss window.
  4. Check application logs and orchestration events. Look for crashes, restarts, or SIGTERM delivery between poll return and expected processing completion.
  5. Calculate processing latency per batch. If average batch processing time exceeds auto.commit.interval.ms, a background commit fires during processing, guaranteeing the loss window.
  6. Reproduce in a test environment. Consume a test topic with enable.auto.commit=true, kill the consumer mid-batch, and observe skipped offsets on restart.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Consumer group lagReflects committed offset against log-end offset, not processed offsetLag near zero while downstream reports missing data
Application processing watermarkTracks the last offset successfully handled by your code, independent of broker commitsBroker-reported lag is near zero but application watermark lags behind
Consumer group stateRebalances and member loss expose the commit windowUnexpected transitions to Empty or rapid rebalance cycles
Consumer rebalance rateRepeated rebalances increase the chance of commits during unstable processingMore than 2-3 rebalances per hour outside deployments
Broker bytes out per secondConfirms the consumer was actively fetching before failureSudden drop correlated with application restarts
Application restart rateEach restart is a potential commit-to-processing gapRestarts occurring within auto.commit.interval.ms windows

Fixes

Manual commits after processing

Set enable.auto.commit=false. Call commitSync() or commitAsync() only after the application has successfully processed the batch. This couples commit timing to completion and eliminates the loss window. The tradeoff is at-least-once delivery: if the commit fails or the consumer crashes before committing, records are re-delivered on restart. Your application or downstream system must handle duplicates.

Use commitSync() when you need a clear failure signal and can tolerate the blocking round-trip. Use commitAsync() with a failure callback when throughput is critical, but handle errors to avoid silent commit loss.

Idempotent processing

If you cannot switch to manual commits (for example, in a legacy consumer or a framework that abstracts offset management), make re-processing the same record produce the same downstream state. Use natural keys, conditional writes, or upsert semantics so that duplicates overwrite identical state instead of creating duplicates. This does not prevent the loss window, but it prevents data corruption when the window closes with re-delivery rather than a skip.

Shrink the exposure window

Lowering max.poll.records so batches finish faster reduces the likelihood that a crash lands inside the commit interval, but it does not eliminate the risk. The only way to close the window is to tie commit timing to processing completion.

Prevention

  • Set enable.auto.commit=false for any consumer where message loss is unacceptable.
  • Commit offsets only after successful processing, never before.
  • Monitor application-level processing watermarks or lag, not only broker-reported lag.
  • Test consumer failure injection in staging: kill a consumer mid-batch and verify whether records are re-delivered or lost.
  • Document your consumer’s delivery guarantee in runbooks so on-call engineers know whether a restart can cause gaps.

How Netdata helps

  • Netdata collects broker-side consumer group lag and state, so you can verify that low lag does not mean fully processed data.
  • Correlate consumer group state transitions with system-level events such as OOM kills or container restarts. If a restart falls inside an auto-commit interval, the loss window is open.
  • Monitor broker BytesOutPerSec and consumer fetch latency alongside application health to detect consumers that fetch but do not finish processing.
  • Track consumer group member count drops to spot ungraceful exits that can precede silent offset jumps.