Kafka min.insync.replicas and acks: configuring durability you actually have

Most operators set producers to acks=all and assume the cluster acks only when every replica has the message. It does not. With acks=all, the broker waits only for the current in-sync replica set (ISR). Because the ISR shrinks dynamically when followers lag, a partition with replication factor three can have an ISR of one – the leader itself. Without raising min.insync.replicas from its default, the leader acks with zero followers caught up. Your durability guarantee collapses to leader-only persistence, and you only find out when the leader dies and data is missing.

This guide covers how acks and min.insync.replicas interact, why the defaults create a silent durability gap, and how to configure the pair so you get the fault tolerance you expect without turning a single broker restart into a write outage.

What it is and why it matters

acks lives on the producer. min.insync.replicas lives on the broker or topic. Set in different places, they jointly define how many replicas must persist a message before the producer considers it sent.

A partition has one leader and replication.factor - 1 followers. The leader maintains the ISR: followers caught up within replica.lag.time.max.ms. A follower that lags because of disk saturation, GC, or a network blip gets dropped; when it catches up, it rejoins. The ISR size changes independently of the replication factor.

The producer’s acks setting controls how many ISR members must acknowledge before the broker responds:

  • acks=0: The producer sends and forgets. No broker acknowledgment. If the leader crashes, the message is lost.
  • acks=1: The leader writes to its local log and responds immediately. Followers replicate asynchronously. If the leader crashes before the followers catch up, the message is lost.
  • acks=all: The leader waits for acknowledgment from every broker currently in the ISR.

The gap is in the third case. Because the ISR can shrink to just the leader, acks=all without an adequate min.insync.replicas lets the leader ack a write even when no follower has it. The write sits on one broker, unreplicated, yet the producer gets a success response. This is the silent durability trap.

min.insync.replicas closes the gap. It sets the minimum ISR size the broker requires to accept a write with acks=all. If the current ISR is smaller, the broker returns NotEnoughReplicasException and rejects the produce request. The broker enforces this floor only for acks=all. A producer using acks=1 bypasses the check entirely; the broker acks after its local write regardless of follower state.

How it works

When a producer sends a batch with acks=all, a network thread hands it to an I/O thread, which appends the records to the active log segment. The request then enters the purgatory, a delayed-operation timer wheel, where it waits until every broker in the current ISR acknowledges the replicated write or the request times out.

flowchart TD
    P[Producer sends
acks=all] L[Leader appends to log] ISR{ISR size >=
min.insync.replicas?} ACK[Broker acks
producer] NER[NotEnoughReplicasException] BG[Background replication
to out-of-sync followers] P --> L L --> ISR ISR -->|Yes| ACK ISR -->|No| NER L --> BG

If the leader has two followers in the ISR and min.insync.replicas is two, the write proceeds after both followers acknowledge. If one follower is dropped from the ISR, the required acks shrink to the leader plus the remaining follower. The write still proceeds because the ISR still meets the minimum.

If the second follower also falls out, the ISR drops to one. With min.insync.replicas=2, the broker refuses the write and returns NotEnoughReplicasException; the producer retries. With the default min.insync.replicas=1, the leader acks the write alone.

Background replication continues regardless. An out-of-ISR follower keeps fetching from the leader and rejoins the ISR once caught up. A temporary shrink does not permanently lose data; it narrows the durability window until the follower recovers. Even when only two replicas ack a write, the third continues to catch up and will eventually become consistent.

With replication.factor=N and min.insync.replicas=M, the cluster tolerates N - M broker failures and still accepts acks=all writes. For RF=3:

  • min.insync.replicas=1 tolerates two failures but a single produce may be acked by the leader alone.
  • min.insync.replicas=2 tolerates one failure and guarantees at least one follower has the data before the producer gets an ack.
  • min.insync.replicas=3 tolerates zero failures; any outage or lag event blocks writes.

Where it shows up in production

Rolling restarts. Restarting a broker drops its hosted partitions’ followers from the ISR. The ISR shrinks from three to two. With min.insync.replicas=2, writes continue. With min.insync.replicas=3, the restart immediately blocks all acks=all producers for those partitions until the broker returns and catches up. min.insync.replicas=RF is a write-blocking footgun during normal maintenance.

Follower disk degradation. A follower with a failing disk or saturated I/O cannot fetch fast enough to stay within replica.lag.time.max.ms. The leader drops it. With min.insync.replicas=1, the cluster keeps accepting writes with only the leader. New data now has a single point of failure, and you will not discover the exposure until the leader fails.

Correlated network blips or GC pauses. A follower is temporarily partitioned from the leader and drops from the ISR. If a second follower suffers a long enough GC pause to fall out of sync, the ISR drops to one. With min.insync.replicas=2, the broker rejects writes rather than accepting them unreplicated. With min.insync.replicas=1, the broker acks them, leaving you exposed to data loss on a single node failure.

Producer retry cascades. NotEnoughReplicasException is retriable, and the Java producer retries by default. When the ISR stays below the minimum, producers loop. Request volume rises while successful throughput falls, potentially saturating the request queues of the remaining brokers. This positive feedback loop can turn a single slow follower into cluster-wide request saturation.

Rack-aware replication. In rack-aware clusters, a rack failure should shrink the ISR but not lose availability. With RF=3 across three racks and min.insync.replicas=2, losing one rack leaves two in sync and writes continue. With min.insync.replicas=3, a rack failure blocks writes even though the cluster is designed to survive it.

Common misuses and tradeoffs

  • Leaving min.insync.replicas at the default. The leader can ack with zero followers in the ISR, so acks=all silently degrades to leader-only durability when followers lag.

  • Setting min.insync.replicas equal to replication.factor. Any single broker outage or transient follower timeout causes NotEnoughReplicasException and blocks writes. This trades availability for theoretical durability and should be used only when data loss is categorically unacceptable and downtime is acceptable.

  • Monitoring total replica counts instead of ISR size. A partition may show three replicas while only one is in the ISR. The full replica set does not determine acknowledgment behavior; only the ISR does.

  • Assuming acks=all means the full replica set. Out-of-sync followers still replicate in the background, but they do not count toward the ack. The producer receives an ack as soon as the current ISR members respond.

  • Ignoring silent producer retries. Because NotEnoughReplicasException is retriable, producers may loop indefinitely without logging an error. The symptom is not a client exception but rising broker FailedProduceRequestsPerSec and flat throughput.

Signals to watch in production

SignalWhy it mattersWarning sign
UnderMinIsrPartitionCountCounts partitions currently rejecting acks=all writesNonzero outside maintenance windows
IsrShrinksPerSecRate at which replicas leave the ISR