Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas

Producers are throwing org.apache.kafka.common.errors.NotEnoughReplicasException or NotEnoughReplicasAfterAppendException, and acks=all writes are failing while acks=1 or acks=0 writes may still succeed. The affected partitions no longer have enough in-sync replicas to satisfy min.insync.replicas. The immediate operational question is whether the ISR shrink is a transient recovery blip or a sustained degradation that will block writes until you fix the follower.

What this means

The leader tracks followers caught up within replica.lag.time.max.ms in the In-Sync Replica set (ISR). For acks=all, the leader waits for all current ISR members before acknowledging the producer. If the ISR size drops below min.insync.replicas, the leader rejects the produce request. The broker-level default for min.insync.replicas is 1, so a lone leader can acknowledge alone. In practice, with replication.factor=3 and acks=all, set min.insync.replicas=2 so a single follower loss blocks writes instead of silently weakening durability.

NotEnoughReplicasException is thrown before the record is appended. NotEnoughReplicasAfterAppendException means the ISR shrank after the leader appended but before all followers acknowledged. For idempotent producers, the latter matters because the record is already in the leader log and retries may be ambiguous.

flowchart TD
    A[Broker disk slows or GC pauses] --> B[Follower fetch lag grows]
    B --> C[Lag exceeds replica.lag.time.max.ms]
    C --> D[ISR shrinks]
    D --> E[UnderReplicatedPartitions rises]
    D --> F{ISR < min.insync.replicas?}
    F -->|No| G[acks=all writes succeed]
    F -->|Yes| H[Leader rejects produce]
    H --> I[NotEnoughReplicasException]

Common causes

CauseWhat it looks likeFirst thing to check
Slow follower (disk I/O, GC, network saturation)IsrShrinksPerSec spikes on the leader; UnderReplicatedPartitions rises; follower disk await elevatediostat -xz 1 on the follower and Replica Max Lag
Transient broker restart or recoveryISR shrinks are clustered around one broker; broker uptime low; shrinks resolve within 1-2 times replica.lag.time.max.msBroker uptime and IsrExpandsPerSec
min.insync.replicas misconfigured relative to replication factorAtMinIsrPartitionCount above zero even in steady state; topic config shows min.insync.replicas is close to or equal to replication.factorkafka-configs.sh --describe --all for topic configs
Network partition between leader and followerISR drops one follower suddenly; follower appears healthy locally with no disk or GC issuesNetwork connectivity and FetchFollower latency from the leader
Cascading ISR shrink from multiple broker stressUnderReplicatedPartitions rises across many leaders; RequestHandlerAvgIdlePercent drops on multiple brokersCluster-wide disk and network saturation signals

Quick checks

# Partitions currently under min ISR
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-min-isr-partitions
# Under-replicated partitions
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# UnderMinIsrPartitionCount via JMX
echo "get -b kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount Value" | java -jar jmxterm.jar -l localhost:9999
# ISR shrink and expand velocity
echo "get -b kafka.server:type=ReplicaManager,name=IsrShrinksPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.server:type=ReplicaManager,name=IsrExpandsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Disk latency on the suspected follower
iostat -xz 1
# Topic config: compare min.insync.replicas to replication.factor
kafka-configs.sh --bootstrap-server localhost:9092 --describe --entity-type topics --entity-name <topic> --all

How to diagnose it

  1. Confirm the blast radius. Use UnderMinIsrPartitionCount and kafka-topics.sh --describe --under-min-isr-partitions to identify which partitions are rejecting writes. A nonzero value means the error is cluster-side for those partitions, not a client issue.
  2. Map partitions to leaders. UnderReplicatedPartitions on leader brokers shows which brokers are leading the degraded partitions.
  3. Read the ISR composition. If the ISR is [Leader] for an RF=3 partition, two followers are missing. If it is [Leader, Follower1], one is missing. This tells you how many replicas must recover.
  4. Find the common lagging follower. Cross-reference UnderReplicatedPartitions across all brokers to identify which follower appears most often in under-replicated sets. Inter-broker replication metrics are reported on the leader, so correlate by elimination.
  5. Inspect follower health. On the lagging broker, check disk await, GC pause logs, and NetworkProcessorAvgIdlePercent. High FetchFollower latency on the leader also points to a slow follower.
  6. Determine if the condition is transient. If IsrExpandsPerSec is rising and IsrShrinksPerSec has stopped, the broker is recovering. If both are sustained, you have ISR flapping caused by an intermittent resource issue.
  7. Validate configuration. Ensure min.insync.replicas is not equal to or greater than replication.factor for topics receiving acks=all.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCountDirect confirmation that acks=all writes are being rejected> 0 for more than 2 minutes outside maintenance
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitionsLeading indicator that replicas are lagging before min ISR is breachedNonzero and growing across brokers
kafka.server:type=ReplicaManager,name=IsrShrinksPerSecVelocity of replicas leaving ISR; sustained shrinks mean active degradationOneMinuteRate > 0 for more than 5 minutes
kafka.server:type=ReplicaManager,name=AtMinIsrPartitionCountISR equals min.insync.replicas; one more loss triggers exceptions> 0 in steady state
kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSecProducer-visible error rate including NotEnoughReplicasSustained nonzero OneMinuteRate
kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=ProduceTime the leader spends waiting for follower acksp99 spiking for acks=all workloads
kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=ProduceRequests stuck waiting for ISR acknowledgmentsSustained growth above 2x baseline

Fixes

Recover the lagging follower

If one broker is the common denominator in under-replicated partitions, inspect its resources. For disk saturation, plan a reassignment with kafka-reassign-partitions.sh or add capacity. For GC pauses, tune heap or investigate leaks. For network issues, fix connectivity or reduce colocated traffic. Do not restart the lagging follower as a first response: a restart will empty its page cache and extend the recovery window.

Reduce min.insync.replicas temporarily

If you must restore write availability before the follower recovers, lower min.insync.replicas at the topic or broker level. This is a destructive durability tradeoff. Writes will succeed with fewer replicas, and if the leader fails before the follower catches up, you risk data loss or an offline partition. Document the override and revert it as soon as the ISR recovers.

Perform a controlled shutdown of a sick broker

If a broker is clearly degraded and dragging down multiple ISRs, a controlled shutdown lets the controller migrate leadership cleanly. This reduces follower count, so use it only if the remaining ISR still meets min.insync.replicas or you have already lowered the threshold.

Fix topic configuration

If min.insync.replicas is greater than or equal to replication.factor, acks=all cannot tolerate even one missing replica. Reduce min.insync.replicas to at most replication.factor - 1, or increase the replication factor. For replication.factor=3, use min.insync.replicas=2.

Prevention

  • Set min.insync.replicas=2 for topics with replication.factor=3 and acks=all producers. The default of 1 lets the leader acknowledge alone and negates the durability intent of acks=all.
  • Monitor AtMinIsrPartitionCount as an early warning when ISR equals the minimum.
  • Monitor follower disk I/O and GC. Alert on disk await above 20 ms for SSDs or 50 ms for HDDs, and on Full GC pauses above 200 ms.
  • Validate topic configs at creation. min.insync.replicas must be less than replication.factor for any topic intended for acks=all.
  • Test failure recovery. Game-day a broker shutdown and measure how long ISR recovery takes. If it exceeds your recovery time objective, tune replica.lag.time.max.ms or add capacity.

How Netdata helps

  • Correlates UnderMinIsrPartitionCount with per-broker disk latency and CPU saturation to identify the lagging follower without manual JMX scraping.
  • Tracks IsrShrinksPerSec and IsrExpandsPerSec on one timeline to expose flapping replicas caused by intermittent GC or network issues.
  • Surfaces FailedProduceRequestsPerSec alongside RequestHandlerAvgIdlePercent to distinguish replication trouble from general broker overload.
  • Alerts on UnderReplicatedPartitions before it escalates to UnderMinIsrPartitionCount.
  • JVM heap and GC dashboards help catch pause-induced ISR shrinks that correlate with NotEnoughReplicasException spikes.