Kafka JVM heap and Full GC pauses: ISR drops, session timeouts, and right-sizing the heap

Sporadic UnderReplicatedPartitions and ISR shrinks that do not correlate with disk I/O or network faults, combined with consumer rebalances and NotEnoughReplicasException from producers using acks=all, point to broker JVM heap pressure. Check broker logs for GC pauses in the Old Generation lasting several seconds.

Brokers use the JVM heap for metadata, request buffers, and message format conversion. They do not store messages on the heap; the OS page cache handles that. When the heap is misconfigured or under pressure, garbage collection pauses can freeze a broker long enough to trigger ZooKeeper session expirations, follower lag, and cascading availability issues. Full GC pauses exceeding five seconds are the common threshold where these symptoms begin.

Do not simply add heap. An oversized heap causes longer pauses and steals RAM from the page cache, degrading read performance. Target a right-sized heap, typically 4-8 GB, with healthy post-GC headroom and minimal Old Generation activity.

What this means

A Full GC or long G1 Old Generation pause stops all application threads. During the pause, the broker cannot process produce or fetch requests, send heartbeats to ZooKeeper, or serve follower replication fetches. If the pause is long enough, ZooKeeper expires the broker’s session, triggering controller re-election and leader changes for all partitions the broker led. Simultaneously, followers on the pausing broker fall behind their leaders, causing those leaders to shrink the ISR. If enough replicas drop out, min.insync.replicas may no longer be satisfied, and producers with acks=all are rejected.

The stalled broker also delays consumer heartbeat handling, causing consumer session timeouts and unnecessary rebalances. This failure mode is invisible to basic process monitoring because the process remains alive. After the broker rejoins the cluster, the cascade of ISR shrinks, rebalances, and leader elections has already created noise and potential data-loss windows.

flowchart TD
    A[Heap pressure or oversized heap] --> B[Full GC pause > 5s]
    B --> C[Broker request threads stall]
    C --> D[ZK session timeout]
    C --> E[Follower fetch delays]
    E --> F[ISR shrinks]
    F --> G[UnderReplicatedPartitions rises]
    C --> H[Produce p99 latency spikes]
    D --> I[Controller re-election]

Common causes

CauseWhat it looks likeFirst thing to check
Heap too small or memory leakUsed after GC climbs toward max over hours or days; frequent Full GCHeapMemoryUsage used vs. max trend
Oversized heap (the 32GB mistake)Very long GC pauses, low page cache hit rate, elevated read latencyHeap max setting and OS available memory
Message format down-conversionGC spikes after upgrading brokers or adding old clientsMessageConversionsPerSec
Large batches with broker-side decompressionYoung GC frequency spikes during peak traffic, high allocation rateMessagesInPerSec vs. BytesInPerSec ratio
Too many partitions or consumer groups on controllerGradual heap growth on controller broker onlyPartitionCount and LeaderCount on controller

Quick checks

Run these safe, read-only checks to confirm heap pressure and cluster impact. Substitute $KAFKA_PID with the broker process ID.

# Check live GC utilization and generation distribution
jstat -gcutil $KAFKA_PID 1000

# Check heap used, committed, and max via JMX
echo "get -b java.lang:type=Memory HeapMemoryUsage" | java -jar jmxterm.jar -l localhost:9999

# Check cumulative Old Generation collection time
echo "get -b java.lang:type=GarbageCollector,name=G1\ Old\ Generation CollectionTime" | java -jar jmxterm.jar -l localhost:9999

# Check ISR shrink velocity on this broker
echo "get -b kafka.server:type=ReplicaManager,name=IsrShrinksPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# List under-replicated partitions to scope blast radius
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# Confirm produce latency spikes are broker-side
echo "get -b kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# Rule out disk I/O as the primary bottleneck
iostat -xz 1

How to diagnose it

  1. Confirm GC is the trigger. Overlay GC timestamps from broker GC logs with IsrShrinksPerSec and p99 TotalTimeMs spikes. If ISR shrinks and latency spikes align with Old Gen collections, GC is the smoking gun.
  2. Measure pause duration. Use CollectionTime from java.lang:type=GarbageCollector,name=G1 Old Generation. If individual pauses exceed 5 seconds, the broker is at risk of ZK session timeout and follower lag.
  3. Check heap utilization after GC. Query HeapMemoryUsage. If used stays above 80% of max after Full GC, the heap is undersized or there is a leak.
  4. Identify the heap consumer. Correlate GC spikes with MessageConversionsPerSec. If conversion rate is non-zero during GC events, broker-side buffer materialization from old clients is the likely culprit.
  5. Locate the pausing follower. If IsrShrinksPerSec spikes on leader brokers but the broker with GC pauses is a follower, the pausing follower is the one falling behind. Cross-reference UnderReplicatedPartitions across all brokers to find the common denominator.
  6. Rule out page cache pressure. Monitor the delta of pgmajfault in /proc/vmstat over an interval. A high rate indicates the broker is doing heavy on-heap I/O that compounds GC pressure.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
java.lang:type=Memory HeapMemoryUsageReveals actual heap pressure after collectionUsed > 80% of max after Full GC
java.lang:type=GarbageCollector,name=G1 Old Generation CollectionTimeMeasures duration of stop-the-world pausesAny pause > 5s; more than 2-3 Full GCs in 10 min
kafka.server:type=ReplicaManager,name=IsrShrinksPerSecVelocity of replicas falling out of syncSustained OneMinuteRate > 0 outside maintenance
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitionsCumulative impact of ISR shrinks on durabilityNonzero and growing across multiple brokers
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=ProduceEnd-to-end latency spike confirmationp99 spikes aligned with GC timestamps
kafka.server:type=SessionExpireListener,name=ZooKeeperExpiresPerSecConfirms ZK session loss from pauses (ZK mode)Any nonzero OneMinuteRate outside maintenance
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercentShows if GC pauses are starving request threadsSustained drop below 0.3 during GC events
kafka.network:type=RequestChannel,name=RequestQueueSizeBackpressure from paused I/O threadsSustained elevation above 250

Fixes

Right-size the heap

Target 4-8 GB. Set -Xms equal to -Xmx to avoid heap resizing pauses. Avoid the 32GB mistake: heaps at or above this range cause longer pauses and steal memory from the OS page cache without proportional benefit. If you need more than 8 GB to stay below 70% utilization after GC, investigate a memory leak or reduce metadata load rather than simply growing the heap.

Tradeoff: Larger heaps reduce out-of-memory risk but increase pause duration and reduce page cache available for log segments.

Eliminate message down-conversion

Upgrade legacy clients to match the broker’s current message format version. This removes the large on-heap buffers that trigger Old GC.

Tradeoff: Requires client deployment coordination and validation.

Reduce allocation rate

If large batches are causing frequent Young GC, work with producers to reduce batch sizes or align compression codecs to avoid broker-side decompression and recompression.

Tradeoff: Smaller batches increase per-request overhead and may reduce throughput.

Address controller metadata load

If the controller broker alone shows heap growth, reduce total partition and consumer group load. Controller memory scales with cluster-wide metadata and consumer group state.

Tradeoff: Reassignment generates transient under-replication and I/O load.

Isolate a broker in a GC death spiral

If one broker is repeatedly triggering Full GC cascades, perform a controlled shutdown to migrate leadership cleanly. Investigate a heap dump offline before restarting. Do not restart blindly; if the issue is a leak or corrupt segment, it will recur.

Tradeoff: Brief under-replication during the controlled shutdown window.

Prevention

  • Keep heap after GC below 70-80%. This leaves headroom for allocation spikes and prevents frequent Full GC cycles.
  • Enable GC logging. Chart pause duration and heap after GC trends to catch gradual leaks.
  • Maintain a 4-8 GB heap. Do not oversize because the host has RAM; that memory belongs to the page cache.
  • Monitor MessageConversionsPerSec. Plan client upgrades before broker version changes force on-heap down-conversion.
  • Game-day test broker failure recovery under full load. Observe GC behavior during stress and verify pause budgets stay under 5 seconds.
  • Reserve host RAM for the page cache. In containers, ensure memory limits account for both JVM heap and OS cache; starving the host of RAM forces disk reads and amplifies GC pressure.

How Netdata helps

  • Netdata collects HeapMemoryUsage and GC CollectionTime, so you can overlay GC pauses on IsrShrinksPerSec and produce latency to confirm correlation.
  • The Kafka collector surfaces UnderReplicatedPartitions, IsrShrinksPerSec, and request latency percentiles without manual JMXterm queries.
  • OS-level metrics including pgmajfault rate and disk await are available on the same dashboard, helping distinguish heap pressure from page cache thrashing or disk degradation.
  • Anomaly detection on heap utilization and GC pause duration flags gradual leaks before Full GC cascades.