Kafka log compaction falling behind: the dead cleaner thread and unbounded disk growth

Disk utilization climbs steadily on one or more brokers while producer traffic stays flat. __consumer_offsets balloons from a few gigabytes to hundreds. Under-replicated partitions are at zero, producers show no errors, and standard Kafka alerts are silent. The usual culprit is a dead log cleaner thread: it hit a corrupt record or an OOM during compaction, crashed, and never restarted. Every compacted topic now accumulates records without bound.

This failure is silent by design. Kafka has no built-in alert for a dead cleaner thread. Without cleaner-specific JMX metrics, the first sign is a disk space ticket or a broker taking its log directory offline.

What this means

Log compaction retains the latest record for each key on compacted topics. The log cleaner thread rewrites older segments and discards superseded records. When it dies, compaction stops. Records that should have been removed accumulate indefinitely.

Compacted topics with cleanup.policy=compact (the default for __consumer_offsets) ignore retention.bytes and retention.ms unless you explicitly set cleanup.policy=compact,delete. They grow with the number of unique keys and the write rate. __consumer_offsets exists on every cluster. When the cleaner dies, this topic grows without bound. The problem stays invisible until disk utilization crosses a threshold or the broker takes the log directory offline.

flowchart TD
    A[Cleaner thread crashes on corrupt record or OOM] --> B[Compaction stops]
    B --> C[Compacted topics retain every record]
    C --> D[__consumer_offsets grows without bound]
    D --> E[Disk utilization climbs steadily]
    E --> F[Broker log directory goes offline or broker crashes]

Common causes

CauseWhat it looks likeFirst thing to check
Dead cleaner thread from corrupt record or OOMDisk growing despite stable BytesInPerSec; evidence of dead threads in JMX or logsBroker logs for ERROR or FATAL lines mentioning “cleaner”
Cleaner overwhelmed by write volumemax-dirty-percent climbing steadily; compaction cannot keep upIngest rate to compacted topics versus log.cleaner.threads
Dedup buffer exhaustionIntermittent cleaner crashes tied to high unique key countslog.cleaner.dedupe.buffer.size relative to topic key cardinality
Unique-key explosion on a compacted topicTopic grows rapidly even with low traffic; every message has a unique keyProducer key distribution for the topic

Quick checks

# Check the maximum dirty ratio across compacted topics
echo "get -b kafka.log:type=LogCleanerManager,name=max-dirty-percent Value" | java -jar jmxterm.jar -l localhost:9999

# Check whether any cleaner threads have died
echo "get -b kafka.log:type=LogCleaner,name=DeadThreadCount Value" | java -jar jmxterm.jar -l localhost:9999

# List log directory sizes and find bloated topics
kafka-log-dirs.sh --bootstrap-server localhost:9092 --describe

# Search for cleaner thread crashes in broker logs (include rotated logs)
grep -i "cleaner\|compaction" /var/log/kafka/server.log*

# Check disk utilization for the log directories
grep log.dirs /etc/kafka/server.properties | tr ',' '\n' | while read d; do df -h "$d"; done

# Review current cleaner configuration
grep -E "^log.cleaner" /etc/kafka/server.properties

How to diagnose it

  1. Confirm unbounded growth on a compacted topic. Run kafka-log-dirs.sh --describe and look for topics with cleanup.policy=compact (including __consumer_offsets) that are disproportionately large relative to their expected key cardinality. Compare the Size field across brokers. A single broker hosting a much larger replica of __consumer_offsets strongly suggests that broker’s cleaner is dead.

  2. Check compaction health. Query max-dirty-percent. It measures the ratio of dirty to total compacted log bytes. In steady state it should stay near or below log.cleaner.min.cleanable.ratio (default 0.5). A value climbing toward 100% means compaction is stalled or falling behind.

  3. Check for dead threads. Query DeadThreadCount. If it is nonzero, at least one cleaner thread has exited and will not restart without broker intervention. If the metric is unavailable, search broker logs for FATAL or ERROR from LogCleaner followed by silence. A dead thread produces no subsequent INFO lines about compaction resuming.

  4. Correlate with broker logs. Search /var/log/kafka/server.log* for ERROR and cleaner. Look for CorruptRecordException, IllegalArgumentException, BufferOverflowException, or OOM killer traces (Killed process, java.lang.OutOfMemoryError). BufferOverflowException usually means the dedupe buffer is undersized. An OOM signals insufficient JVM heap or a memory leak. The crash is logged even though no JMX alert fires.

  5. Rule out retention misconfiguration. Verify that growth is specific to compacted topics. Non-compacted topics growing points to retention.ms, retention.bytes, or log.retention.check.interval.ms issues rather than a dead cleaner.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
kafka.log:type=LogCleanerManager,name=max-dirty-percentMeasures the ratio of dirty to total compacted log bytes awaiting compaction.Sustained value above 50% and climbing.
kafka.log:type=LogCleaner,name=DeadThreadCountCount of cleaner threads that have exited permanently.Any nonzero value.
kafka.log:type=LogCleaner,name=max-clean-time-secsHow long the slowest compaction pass takes.Sustained increase or sudden spikes before a crash.
Disk utilization on log.dirs volumesCompaction failure eventually causes disk exhaustion.Alert at 75%; page at 90% or when runway is under 4 hours.
kafka.log:type=LogManager,name=OfflineLogDirectoryCountKafka takes log directories offline when disk is full or I/O errors occur.Any nonzero value.
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSecDistinguishes compaction failure from actual traffic growth.Stable or flat while disk still grows.

Fixes

Restart the cleaner thread

A broker restart recreates the cleaner threads. This is disruptive but reliable. Prefer a rolling restart one broker at a time; restarting all brokers simultaneously removes redundancy during the window. Before restarting, capture the relevant log lines so you can identify the corrupt segment or configuration issue that caused the crash. If the crash was caused by a specific corrupt record in __consumer_offsets, the cleaner may die again after restart when it reaches the same offset.

In that case, you may need to stop the broker and remove the offending segment and its index files from the replica on that broker, then restart and let the broker truncate and re-replicate. Warning: this is destructive. Only proceed when the partition is fully replicated, the ISR is healthy, and the target broker is not the leader. Do not delete segments on a leader broker.

Increase cleaner throughput

If max-dirty-percent is high but there is no evidence of dead threads, the cleaner is alive but cannot keep up. Increase log.cleaner.threads from its default of 1. This requires a broker restart. Adding threads increases parallelism but also raises disk contention; monitor iowait and disk latency after the change.

Also evaluate log.cleaner.dedupe.buffer.size. If the buffer is too small for the number of unique keys in the topic, the cleaner makes poor progress per pass and may crash. Raising the buffer increases memory pressure; ensure the JVM has sufficient headroom or the cleaner will OOM.

Fix producer key hygiene

If a compacted topic is growing because every message carries a unique key, compaction provides no value and the topic behaves like an unbounded log. This must be fixed in the producer. Ensure producers write deterministic keys for compacted topics. If the workload genuinely requires unbounded retention, migrate it to a topic with cleanup.policy=delete and size retention accordingly.

Prevention

  • Monitor max-dirty-percent and dead thread indicators. These are the only early warnings of a silent cleaner death. Set an alert on any transition from zero to nonzero dead threads, or on max-dirty-percent sustained above 50%.
  • Track disk utilization per log.dirs volume. Compacted topics with pure compact policy do not respect retention.bytes. Do not rely on size-based retention to protect them unless you use compact,delete. A dedicated disk alert on compacted-topic brokers is essential.
  • Size cleaner resources for peak traffic. A single log.cleaner.threads default cannot handle many compacted partitions under high write load. Plan thread count and log.cleaner.dedupe.buffer.size based on the number of unique keys in your compacted topics.
  • Audit compacted topic key cardinality. A compacted topic with unbounded unique keys will fill disk regardless of cleaner health. Review producer key choices during design reviews.

How Netdata helps

  • Correlate rising disk utilization with flat BytesInPerSec to isolate compaction failure from traffic growth.
  • Alert on DeadThreadCount and max-dirty-percent using JMX collectors to catch silent cleaner deaths before disk fills.
  • Surface per-broker disk utilization for each log.dirs volume alongside OS-level I/O metrics to distinguish cleaner-induced growth from replication or backfill traffic.
  • Track JVM heap utilization and GC behavior to identify OOM conditions that can kill cleaner threads.
  • How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
  • Kafka enable.auto.commit data loss: committed offsets that outrun processing: /guides/kafka/kafka-auto-commit-silent-data-loss/
  • Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery: /guides/kafka/kafka-broker-out-of-disk/
  • Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts: /guides/kafka/kafka-commit-failed-exception/
  • Kafka consumer group stuck Empty or Dead: no members consuming: /guides/kafka/kafka-consumer-group-empty-stuck/
  • Kafka consumer group lag growing: detection, lag-as-time, and root causes: /guides/kafka/kafka-consumer-group-lag-growing/
  • Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors: /guides/kafka/kafka-consumer-group-rebalancing-frequently/
  • Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms: /guides/kafka/kafka-consumer-rebalance-storm/
  • Kafka controller event queue backing up: overwhelmed controller and stalled metadata: /guides/kafka/kafka-controller-event-queue-backup/
  • Kafka disk I/O latency high: await, LocalTimeMs, and the slow-disk broker: /guides/kafka/kafka-disk-io-latency-high/
  • Kafka disk space planning: retention, replication, and runway estimation: /guides/kafka/kafka-disk-space-runway-planning/
  • Kafka fetch request latency high: FetchConsumer vs FetchFollower and page cache misses: /guides/kafka/kafka-fetch-request-latency-high/