Kafka OffsetOutOfRangeException: when retention deletes data before the consumer reads it

OffsetOutOfRangeException means the consumer requested an offset the broker no longer holds. The log segment containing the consumer’s committed position was deleted by retention before the consumer caught up. This is not a transient fetch error; it is data loss, and the outcome depends entirely on auto.offset.reset. Many clients default to latest, which turns this exception into silent skipping.

This is a lag problem wearing a fetch error. The consumer was too slow, paused too long, or was offline longer than the topic’s retention.ms. Once the log start offset moves past the committed offset, every subsequent fetch fails.

What this means

Kafka maintains a log start offset per partition. Segments older than retention.ms or larger than retention.bytes become eligible for deletion whole. The retention checker runs every log.retention.check.interval.ms (default 5 minutes). Because Kafka deletes entire segments, the effective deletion boundary can run slightly ahead of the exact timestamp.

A Fetch request for an offset below the current log start offset returns the OFFSET_OUT_OF_RANGE error. The Java client surfaces this as OffsetOutOfRangeException; other clients may name it differently.

When the consumer sees this error, the next poll() applies auto.offset.reset:

  • earliest: jump to the current log start offset. Expect duplicates.
  • latest: jump to the log end offset. This silently skips the deleted data.
  • none: the exception propagates and the consumer stops.

This same error hits follower replicas whose replication lag exceeds the leader’s retention window. A follower whose log end is behind the leader’s log start offset cannot replicate, stalls, and is eventually removed from the ISR. That produces UnderReplicatedPartitions independently of any consumer error.

Common causes

CauseWhat it looks likeFirst thing to check
Consumer throughput is lower than producer throughputLag grows monotonically; exception hits during traffic peaksConsumer group lag against production rate
Consumer group rebalance stormGroup state cycles between Stable and PreparingRebalance; lag grows during cyclesGroup state and rebalance rate
Follower replica lag exceeds leader retentionUnderReplicatedPartitions rises on the leader; follower fetch threads log errorsReplicaFetcherManager lag on the follower broker
Consumer restarted after extended downtimeCommitted offset is far behind; segments deleted while consumer was offlineGap between committed offset and LogStartOffset

Quick checks

Run these read-only checks to confirm the failure and measure the gap.

# Identify the partition leader
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic "$TOPIC"
# Consumer lag and committed offsets
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "$GROUP_ID"
# Retention and cleanup policy
kafka-configs.sh --bootstrap-server localhost:9092 --describe --entity-type topics --entity-name "$TOPIC"
# Log start offset for the partition. Run against the broker that hosts the leader.
echo "get -b kafka.log:type=Log,name=LogStartOffset,topic=$TOPIC,partition=$PARTITION Value" | java -jar jmxterm.jar -l "$LEADER_HOST:9999"
# Broker-wide failed fetch rate
echo "get -b kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Group state to detect rebalance loops
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "$GROUP_ID" --state
# Follower replica lag if a broker follower is affected
echo "get -b kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica Value" | java -jar jmxterm.jar -l "$FOLLOWER_HOST:9999"

How to diagnose it

  1. Map the exception to a partition and offset. The consumer log or client metric names the partition and the offset it attempted to fetch.
  2. Compare the committed offset to the log start offset. Use the LogStartOffset MBean on the partition leader. If the committed offset is lower, retention deleted the segment. You can also inspect the lowest segment file base offset in the partition directory under the broker’s configured log.dirs.
  3. Calculate lag as time. Approximate: lag_offsets / produce_rate_per_sec. If this approaches or exceeds retention.ms, the consumer is inside the deletion window.
  4. Check for a rebalance lag spike. kafka-consumer-groups.sh --describe --state shows whether the group was stuck in PreparingRebalance or CompletingRebalance while data accumulated.
  5. Confirm broker-side fetch failures. A sustained spike in FailedFetchRequestsPerSec correlates with consumers or followers hitting the out-of-range condition.
  6. Distinguish consumer lag from replica lag. If the affected client is a follower broker, examine Replica Max Lag and the follower’s disk I/O and network metrics. Follower fetch failures surface as UnderReplicatedPartitions, not consumer exceptions.
flowchart TD
    A[Growing consumer lag] --> B[Lag exceeds retention.ms]
    B --> C[Log segment deleted]
    C --> D[Consumer fetches stale offset]
    D --> E[Broker returns OFFSET_OUT_OF_RANGE]
    E --> F{auto.offset.reset}
    F -->|earliest| G[Re-read from log start]
    F -->|latest| H[Skip data silently]
    F -->|none| I[Exception halts consumer]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Consumer group lag as timeAbsolute offset lag is meaningless without production rate; time-based lag reveals true exposure to retentionLag-as-time exceeds 50% of retention.ms
Failed fetch requests per secondIncludes OffsetOutOfRange errors from consumers and followersSustained nonzero rate outside leader elections
Log start offset vs committed offsetMeasures the safety margin between the consumer and the deletion boundaryCommitted offset within 10% of log start offset
Consumer group stateRebalances pause consumption and allow lag to accumulateStuck in PreparingRebalance longer than 5 minutes
Replica max lagFollower lag beyond leader retention triggers ISR shrinks and fetch failuresLag nonzero and growing for more than 10 minutes
UnderReplicatedPartitionsIndicates follower replication is stalled, often from the same lag mechanismNonzero and growing across leaders

Fixes

Scale consumer throughput

Add consumer instances up to the partition count, or reduce per-record processing time. If the partition count is too low to parallelize further, increase it. Be aware that more partitions increase file descriptor usage and controller metadata overhead. This is the correct fix when the application is genuinely slower than the producer.

Raise retention

Increase retention.ms or retention.bytes to give consumers more runway. Apply topic-level changes with kafka-configs.sh --alter, or broker-level changes in server.properties followed by a rolling restart. Topic-level overrides take precedence and apply dynamically. Before raising retention, verify disk headroom. Retention drives steady-state disk usage:

( bytes_in_per_sec * retention_seconds * replication_factor ) / number_of_brokers

Kafka needs at least 15-20% free disk per volume to handle compaction bursts and reassignment. Do not raise retention without confirming the space exists.

Fix rebalance storms

If lag spikes correlate with rebalances, reduce max.poll.records, increase max.poll.interval.ms, or switch to the cooperative sticky assignor:

partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor

While the group rebalances, consumption pauses and lag grows at the full production rate.

Reset offsets manually

If auto.offset.reset=none halted the consumer and the data is gone, reset the group to a valid offset manually. This is destructive and will skip or reprocess data. Stop all consumers in the group first, verify with --dry-run, then execute:

kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --group "$GROUP_ID" \
  --topic "$TOPIC:$PARTITION" \
  --reset-offsets --to-earliest \
  --dry-run

Remove --dry-run only after confirming the segments are deleted and the target offset is acceptable.

Recover a lagging follower

If a follower broker is hitting OFFSET_OUT_OF_RANGE on replication, check that broker’s disk I/O (iostat -x, look at r_await/w_await) and network metrics. If the disk is degraded, replacing the broker may be faster than waiting for a full log rebuild.

Prevention

Alert on lag-as-time, not absolute offsets. A lag of one million offsets can be ten seconds or ten hours depending on throughput. Convert lag to seconds and alert when it exceeds a fraction of retention.ms.

Monitor committed offset vs log start offset. When the gap shrinks, the consumer is in the danger zone regardless of absolute lag.

Set auto.offset.reset=none for critical consumers. This forces a hard stop rather than silent skipping. It is painful, but it prevents a consumer from pretending data was processed after jumping past a retention gap.

Test consumer recovery time. Measure how long a group takes to restart, rebalance, and catch up after a total shutdown. Keep retention.ms at least 2-3 times the worst-case recovery time.

Watch page cache pressure during backfills. A sudden backfill consumer can thrash the page cache, spike FetchConsumer LocalTimeMs, and indirectly slow tail consumers until they fall behind retention. Track pgmajfault from /proc/vmstat.

How Netdata helps

  • Correlates consumer lag growth with broker FetchConsumer LocalTimeMs to separate a slow application from broker read latency caused by page cache misses.
  • Surfaces FailedFetchRequestsPerSec alongside consumer group lag in the same view, linking the broker error rate to the client exception.
  • Calculates lag-as-time by combining offset lag with per-topic production rate, exposing when consumers are inside the retention danger window.
  • Alerts on consumer lag growth rate rather than static thresholds, catching consumers that are falling behind before they cross retention.ms.
  • Tracks OS-level page cache pressure (pgmajfault rate) to identify backfill or cache thrashing that indirectly causes lag spikes.