Kafka NotEnoughReplicasException: acks=all writes rejected below min.insync.replicas
Producers are throwing org.apache.kafka.common.errors.NotEnoughReplicasException or NotEnoughReplicasAfterAppendException, and acks=all writes are failing while acks=1 or acks=0 writes may still succeed. The affected partitions no longer have enough in-sync replicas to satisfy min.insync.replicas. The immediate operational question is whether the ISR shrink is a transient recovery blip or a sustained degradation that will block writes until you fix the follower.
What this means
The leader tracks followers caught up within replica.lag.time.max.ms in the In-Sync Replica set (ISR). For acks=all, the leader waits for all current ISR members before acknowledging the producer. If the ISR size drops below min.insync.replicas, the leader rejects the produce request. The broker-level default for min.insync.replicas is 1, so a lone leader can acknowledge alone. In practice, with replication.factor=3 and acks=all, set min.insync.replicas=2 so a single follower loss blocks writes instead of silently weakening durability.
NotEnoughReplicasException is thrown before the record is appended. NotEnoughReplicasAfterAppendException means the ISR shrank after the leader appended but before all followers acknowledged. For idempotent producers, the latter matters because the record is already in the leader log and retries may be ambiguous.
flowchart TD
A[Broker disk slows or GC pauses] --> B[Follower fetch lag grows]
B --> C[Lag exceeds replica.lag.time.max.ms]
C --> D[ISR shrinks]
D --> E[UnderReplicatedPartitions rises]
D --> F{ISR < min.insync.replicas?}
F -->|No| G[acks=all writes succeed]
F -->|Yes| H[Leader rejects produce]
H --> I[NotEnoughReplicasException]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Slow follower (disk I/O, GC, network saturation) | IsrShrinksPerSec spikes on the leader; UnderReplicatedPartitions rises; follower disk await elevated | iostat -xz 1 on the follower and Replica Max Lag |
| Transient broker restart or recovery | ISR shrinks are clustered around one broker; broker uptime low; shrinks resolve within 1-2 times replica.lag.time.max.ms | Broker uptime and IsrExpandsPerSec |
min.insync.replicas misconfigured relative to replication factor | AtMinIsrPartitionCount above zero even in steady state; topic config shows min.insync.replicas is close to or equal to replication.factor | kafka-configs.sh --describe --all for topic configs |
| Network partition between leader and follower | ISR drops one follower suddenly; follower appears healthy locally with no disk or GC issues | Network connectivity and FetchFollower latency from the leader |
| Cascading ISR shrink from multiple broker stress | UnderReplicatedPartitions rises across many leaders; RequestHandlerAvgIdlePercent drops on multiple brokers | Cluster-wide disk and network saturation signals |
Quick checks
# Partitions currently under min ISR
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-min-isr-partitions
# Under-replicated partitions
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# UnderMinIsrPartitionCount via JMX
echo "get -b kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount Value" | java -jar jmxterm.jar -l localhost:9999
# ISR shrink and expand velocity
echo "get -b kafka.server:type=ReplicaManager,name=IsrShrinksPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.server:type=ReplicaManager,name=IsrExpandsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Disk latency on the suspected follower
iostat -xz 1
# Topic config: compare min.insync.replicas to replication.factor
kafka-configs.sh --bootstrap-server localhost:9092 --describe --entity-type topics --entity-name <topic> --all
How to diagnose it
- Confirm the blast radius. Use
UnderMinIsrPartitionCountandkafka-topics.sh --describe --under-min-isr-partitionsto identify which partitions are rejecting writes. A nonzero value means the error is cluster-side for those partitions, not a client issue. - Map partitions to leaders.
UnderReplicatedPartitionson leader brokers shows which brokers are leading the degraded partitions. - Read the ISR composition. If the ISR is
[Leader]for an RF=3 partition, two followers are missing. If it is[Leader, Follower1], one is missing. This tells you how many replicas must recover. - Find the common lagging follower. Cross-reference
UnderReplicatedPartitionsacross all brokers to identify which follower appears most often in under-replicated sets. Inter-broker replication metrics are reported on the leader, so correlate by elimination. - Inspect follower health. On the lagging broker, check disk
await, GC pause logs, andNetworkProcessorAvgIdlePercent. HighFetchFollowerlatency on the leader also points to a slow follower. - Determine if the condition is transient. If
IsrExpandsPerSecis rising andIsrShrinksPerSechas stopped, the broker is recovering. If both are sustained, you have ISR flapping caused by an intermittent resource issue. - Validate configuration. Ensure
min.insync.replicasis not equal to or greater thanreplication.factorfor topics receivingacks=all.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount | Direct confirmation that acks=all writes are being rejected | > 0 for more than 2 minutes outside maintenance |
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions | Leading indicator that replicas are lagging before min ISR is breached | Nonzero and growing across brokers |
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec | Velocity of replicas leaving ISR; sustained shrinks mean active degradation | OneMinuteRate > 0 for more than 5 minutes |
kafka.server:type=ReplicaManager,name=AtMinIsrPartitionCount | ISR equals min.insync.replicas; one more loss triggers exceptions | > 0 in steady state |
kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec | Producer-visible error rate including NotEnoughReplicas | Sustained nonzero OneMinuteRate |
kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce | Time the leader spends waiting for follower acks | p99 spiking for acks=all workloads |
kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce | Requests stuck waiting for ISR acknowledgments | Sustained growth above 2x baseline |
Fixes
Recover the lagging follower
If one broker is the common denominator in under-replicated partitions, inspect its resources. For disk saturation, plan a reassignment with kafka-reassign-partitions.sh or add capacity. For GC pauses, tune heap or investigate leaks. For network issues, fix connectivity or reduce colocated traffic. Do not restart the lagging follower as a first response: a restart will empty its page cache and extend the recovery window.
Reduce min.insync.replicas temporarily
If you must restore write availability before the follower recovers, lower min.insync.replicas at the topic or broker level. This is a destructive durability tradeoff. Writes will succeed with fewer replicas, and if the leader fails before the follower catches up, you risk data loss or an offline partition. Document the override and revert it as soon as the ISR recovers.
Perform a controlled shutdown of a sick broker
If a broker is clearly degraded and dragging down multiple ISRs, a controlled shutdown lets the controller migrate leadership cleanly. This reduces follower count, so use it only if the remaining ISR still meets min.insync.replicas or you have already lowered the threshold.
Fix topic configuration
If min.insync.replicas is greater than or equal to replication.factor, acks=all cannot tolerate even one missing replica. Reduce min.insync.replicas to at most replication.factor - 1, or increase the replication factor. For replication.factor=3, use min.insync.replicas=2.
Prevention
- Set
min.insync.replicas=2for topics withreplication.factor=3andacks=allproducers. The default of 1 lets the leader acknowledge alone and negates the durability intent ofacks=all. - Monitor
AtMinIsrPartitionCountas an early warning when ISR equals the minimum. - Monitor follower disk I/O and GC. Alert on disk
awaitabove 20 ms for SSDs or 50 ms for HDDs, and on Full GC pauses above 200 ms. - Validate topic configs at creation.
min.insync.replicasmust be less thanreplication.factorfor any topic intended foracks=all. - Test failure recovery. Game-day a broker shutdown and measure how long ISR recovery takes. If it exceeds your recovery time objective, tune
replica.lag.time.max.msor add capacity.
How Netdata helps
- Correlates
UnderMinIsrPartitionCountwith per-broker disk latency and CPU saturation to identify the lagging follower without manual JMX scraping. - Tracks
IsrShrinksPerSecandIsrExpandsPerSecon one timeline to expose flapping replicas caused by intermittent GC or network issues. - Surfaces
FailedProduceRequestsPerSecalongsideRequestHandlerAvgIdlePercentto distinguish replication trouble from general broker overload. - Alerts on
UnderReplicatedPartitionsbefore it escalates toUnderMinIsrPartitionCount. - JVM heap and GC dashboards help catch pause-induced ISR shrinks that correlate with
NotEnoughReplicasExceptionspikes.







