Kafka retention not deleting old segments: retention.ms, retention.bytes, and the active segment
You set retention.ms to 24 hours, but broker disk keeps climbing. A partition shows segment files older than the threshold, or the active segment has grown so large it consumes most of the volume. A topic with both retention.ms and retention.bytes may still appear to ignore them.
Kafka retention is not a continuous sweep. retention.ms and retention.bytes apply independently to closed segments, and the check interval adds latency. The active segment is never deleted by retention alone. On compacted topics, retention.bytes is ignored. Per-topic overrides shadow broker defaults.
This guide covers segment deletion mechanics and read-only checks to find the cause.
What this means
Kafka reclaims disk through a periodic retention check that runs every log.retention.check.interval.ms (default 5 minutes). During each run, the broker evaluates closed log segments against two independent policies:
- Time-based:
retention.ms(orlog.retention.hours). A closed segment is eligible when the largest timestamp of its records exceeds the threshold. - Size-based:
retention.bytes(orlog.retention.bytes). A closed segment is eligible when the total size of all segments in the partition exceeds the threshold, from the oldest first.
These policies operate as an OR. Three rules exempt segments:
- The active segment – the segment currently receiving writes – is never deleted by retention. It rolls when
segment.bytes,segment.ms, or the index file size limit is reached. Each offset-index entry covers roughlylog.index.interval.bytes(default 4 KiB), so a 10 MiB index limits a segment to about 5 GiB. Once rolled, the segment becomes eligible in the next check cycle. - After a segment is marked for deletion,
file.delete.delay.ms(default 1 minute) must pass before the OS removes the file. Disk space is not freed instantly. - For topics with
cleanup.policy=compact(withoutdelete), bothretention.msandretention.bytesare ignored. Compaction retains each unique key until a newer value arrives. Usemax.compaction.lag.msto bound how long an uncompacted record can persist, but that is not time retention.
Per-topic configs set via kafka-configs.sh take precedence over broker defaults in server.properties. A forgotten override means the broker default will not apply.
Because of the 5-minute check interval and the 1-minute deletion delay, retention.ms is a lower bound, not a real-time guarantee. A low-volume topic with a 10-minute retention can retain data for 20 minutes or more if the active segment has not rolled.
flowchart TD
A[Disk growing despite retention] --> B{Topic compacted?}
B -->|Yes| C{Policy includes delete?}
C -->|No| D[retention.bytes ignored
check cleaner health]
B -->|No| E{Closed segments past
retention.ms?}
E -->|No| F[Active segment exempt
check segment roll settings]
E -->|Yes| G{retention.bytes set?}
G -->|Yes, not breached| H[Size limit not met
time policy still active]
G -->|Breached / Not set| I[Wait for check interval
and file deletion delay]
I --> J[Verify per-topic overrides
vs broker defaults]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Active segment is the only segment | Partition has one giant .log file and no closed segments | Segment roll settings (segment.bytes, segment.ms) and write rate |
| Only one retention policy is breached | Disk exceeds size limit but segments are too young, or vice versa | Effective config for both values |
Topic is compacted without delete | cleanup.policy=compact; retention.bytes ignored; disk grows with unique key count | cleanup.policy and log cleaner dirty ratio |
| Per-topic override shadows broker default | Broker default is 7 days but topic override is -1 or longer | kafka-configs.sh --describe output |
| Segment rolls prematurely due to index size | Segments roll at ~5 GiB because log.index.size.max.bytes (default 10 MiB) fills first | Segment sizes vs segment.bytes target |
| File deletion delay not elapsed | Segments disappeared from Kafka metadata but df shows no freed space | Wait for file.delete.delay.ms; check open file handles |
Quick checks
# Effective topic config
kafka-configs.sh --bootstrap-server localhost:9092 \
--describe --entity-type topics --entity-name <topic>
# Broker defaults
grep -E 'log.retention|log.segment|cleanup.policy' /etc/kafka/server.properties
# Segment files in partition dir
ls -lth /var/lib/kafka-logs/<topic>-<partition>/
# Disk usage per log dir
kafka-log-dirs.sh --bootstrap-server localhost:9092 --describe
# Cleaner errors in broker logs
grep -iE 'cleaner|compaction' /var/log/kafka/server.log
# Open file descriptors
ls /proc/$(pgrep -f kafka.Kafka)/fd | wc -l
cat /proc/$(pgrep -f kafka.Kafka)/limits | grep "Max open files"
How to diagnose it
- Confirm the effective configuration. Run
kafka-configs.sh --describefor the topic. Compareretention.ms,retention.bytes,cleanup.policy,segment.ms, andsegment.bytesto broker defaults inserver.properties. Per-topic overrides shadow defaults completely. - Identify the active segment. In the partition directory, the active segment is the highest-numbered
.logfile and is exempt from retention. If it is the only segment or contains data far older thanretention.ms, it has not rolled. Checksegment.bytes,segment.ms, and the index size limit. Low throughput keeps the segment open untilsegment.mselapses. - Verify closed segments are eligible. For closed segments, compare the largest timestamp in the segment against
retention.ms; Kafka uses the time index, not filesystem modification time. Usekafka-dump-log.sh --files <segment>.timeindexif necessary. For size-based retention, sum the closed segment sizes and compare againstretention.bytes. The policies apply independently: 10 GiB of closed segments will not delete anything ifretention.bytesis 20 GiB. - Check if compaction is the culprit. If
cleanup.policy=compact(withoutdelete),retention.bytesis ignored. Check the log cleaner dirty ratio via JMX (kafka.log:type=LogCleaner,name=max-dirty-percent) or broker logs. A dead cleaner lets compacted topics grow without bound. - Account for check interval and deletion delay. Even eligible segments wait for
log.retention.check.interval.msandfile.delete.delay.ms. Do not expect instant reclamation. - Calculate runway. Multiply
BytesInPerSecfor the topic byretention.ms(in seconds) and the replication factor. If actual disk usage is much higher, retention or compaction is failing.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Disk space utilization on log.dirs volumes | Retention failure leads directly to disk full and broker outage | >75% TICKET; >90% or runway <4h PAGE |
Log cleaner dirty ratio (max-dirty-percent) | On compacted topics, a silent cleaner failure mimics a retention failure | >50% sustained or climbing |
BytesInPerSec per topic | Validates whether disk growth aligns with throughput and retention math | Rate exceeding retention reclaim capacity |
| Open file descriptor count | Each segment consumes FDs; a segment explosion pressures this resource | >75% of ulimit -n |
| Offline log directory count | Final consequence of unchecked disk growth; broker takes the directory offline | Any nonzero value PAGE |
| Partition count per broker | High partition counts amplify the active-segment headroom problem | >4,000 partitions per broker |
Fixes
WARNING: Altering topic configuration can delete data or cause OffsetOutOfRangeException for lagging consumers. Check consumer lag first and run during a maintenance window when possible.
Lower retention. Reduce retention.ms or retention.bytes with kafka-configs.sh --alter. This takes effect at the next check interval. Do not delete segment files manually; Kafka must remove them to keep indexes and metadata consistent.
Force faster segment rolling. Reduce segment.ms on the topic. The active segment rolls once its first record timestamp exceeds the new threshold, then becomes eligible for retention. Tradeoff: smaller segments increase file count and FD usage.
Enable size-based deletion on compacted topics. Change cleanup.policy to compact,delete. This restores retention.bytes enforcement. Tradeoff: older keys may be deleted entirely, changing semantics for consumers that expect indefinite retention.
Resolve per-topic override conflicts. Remove or lower unexpected topic overrides with kafka-configs.sh --alter. Verify with --describe.
Restart a broker with a dead cleaner. A controlled restart restarts the cleaner. Check broker logs for the underlying corrupt segment or error first, or the cleaner may die again. Tradeoff: restart evicts page cache and triggers ISR recovery.
Expand disk before adjusting retention. Add capacity to log.dirs first if you must preserve data. A full disk can take the directory offline.
Prevention
- Treat
retention.msandretention.bytesas independent guardrails, not a combined policy. Set both and verify which triggers first for your throughput. - Monitor log cleaner dirty ratio and
DeadThreadCounton every cluster with compacted topics, including__consumer_offsets. - Size
segment.bytesandlog.index.size.max.bytestogether. If you raisesegment.bytesabove 5 GiB, raiselog.index.size.max.bytesproportionally. - Keep 15-20% headroom per
log.dirsvolume to accommodate compaction overhead, reassignment, and the delay between eligibility and deletion. - Convert disk monitoring from percentage used to time-to-full by correlating
BytesInPerSecwith retention configuration.
How Netdata helps
Netdata exposes disk utilization per mount, Kafka JMX metrics such as BytesInPerSec per topic, and OS signals such as disk I/O latency and file-descriptor usage. Use these to calculate retention runway, detect a stuck log cleaner, and separate retention failures from producer backfills or reassignment traffic.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka enable.auto.commit data loss: committed offsets that outrun processing
- Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka disk I/O latency high: await, LocalTimeMs, and the slow-disk broker
- Kafka disk space planning: retention, replication, and runway estimation
- Kafka fetch request latency high: FetchConsumer vs FetchFollower and page cache misses







