Kafka FailedProduceRequestsPerSec rising: the single best ‘producers are hurting’ signal

FailedProduceRequestsPerSec rising is broker-side confirmation that producers are actively being rejected. Unlike client-side timeouts from network blips or producer memory pressure, it only increments when the broker receives a produce request and fails to complete it. The MBean kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec rolls up server-side produce failures into a single rate: NotEnoughReplicasException, NotLeaderOrFollowerException, CorruptRecordException, and others. A sustained nonzero rate means data is not being accepted, and the window for data loss or unavailability is open.

It aggregates multiple error types, so FailedProduceRequestsPerSec is the best single signal that producers are hurting, but not the only one. Correlate it with UnderMinIsrPartitionCount to confirm cluster-wide write rejection for acks=all, with OfflinePartitionsCount to detect total unavailability, and with UnderReplicatedPartitions to spot durability degradation before writes are fully blocked. Treat this metric as impact confirmation, not root cause.

Its companion, FailedFetchRequestsPerSec, plays the same role for consumers and follower replicas. When you see one, check the other. If both are elevated, the broker is likely saturated or transitioning leadership. If only produce failures are elevated, the problem is usually in the write path: replication, ISR policy, or disk.

flowchart TD
    A[FailedProduce rising] --> B{UnderMinIsr > 0?}
    B -->|Yes| C[ISR shrink: check follower disk and network]
    B -->|No| D{OfflinePartitions > 0?}
    D -->|Yes| E[Leaderless: check controller health]
    D -->|No| F{HandlerIdle < 0.3?}
    F -->|Yes| G[Saturated: check queue and disk]
    F -->|No| H[Check election rate and broker logs]

What this means

FailedProduceRequestsPerSec is a rate meter exposed per broker under kafka.server:type=BrokerTopicMetrics. Omit the topic tag for the all-topic aggregate. It increments each time the broker rejects or fails to complete a produce request server-side. Common triggers:

  • NOT_ENOUGH_REPLICAS: The ISR is smaller than min.insync.replicas, so the broker refuses acks=all writes.
  • NOT_LEADER_OR_FOLLOWER: The request arrived at a broker that is not the current leader.
  • CORRUPT_RECORD: The record batch failed checksum or deserialization on the broker side.
  • Other fatal broker-side errors that prevent completion.

This metric is purely broker-side. Client-side timeouts, such as a producer blocking on buffer.memory until max.block.ms, do not increment it. If producers are failing but this metric is zero, check client logs for TimeoutException and network connectivity.

Common causes

CauseWhat it looks likeFirst thing to check
ISR shrink below min.insync.replicasSustained FailedProduce rate correlated with UnderMinIsrPartitionCount > 0 and UnderReplicatedPartitions risingUnderMinIsrPartitionCount and per-broker UnderReplicatedPartitions
Partition leader offlineFailedProduce spike with OfflinePartitionsCount > 0; affected partitions are completely unwritableOfflinePartitionsCount and ActiveControllerCount
Broker request handler saturationFailedProduce rising alongside RequestHandlerAvgIdlePercent dropping below 0.3 and RequestQueueSize growingRequestHandlerAvgIdlePercent and RequestQueueSize
Disk I/O degradationProduce LocalTimeMs elevated, disk await high, and follower replication laggingiostat -xz 1 and produce LocalTimeMs breakdown
Leadership transition or corrupt recordsBrief or spiky FailedProduce while replication and saturation metrics are cleanLeaderElectionRateAndTimeMs and broker logs

Quick checks

# Check broker-side failed produce rate
echo "get -b kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check partitions actively rejecting acks=all writes
echo "get -b kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount Value" | java -jar jmxterm.jar -l localhost:9999

# Check partitions with no active leader
echo "get -b kafka.controller:type=KafkaController,name=OfflinePartitionsCount Value" | java -jar jmxterm.jar -l localhost:9999

# List under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# List leaderless partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions

# Check broker request handler saturation
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Value" | java -jar jmxterm.jar -l localhost:9999

# Check disk I/O latency on the broker
iostat -xz 1

How to diagnose it

  1. Confirm the signal is sustained. Brief spikes during rolling restarts or preferred replica elections are normal. A sustained nonzero rate outside maintenance is abnormal.
  2. Check cluster-wide impact. Query UnderMinIsrPartitionCount and OfflinePartitionsCount. If either is nonzero, the cluster is actively rejecting writes or partitions are leaderless.
  3. If UnderMinIsrPartitionCount > 0, trace the ISR shrink. Cross-reference UnderReplicatedPartitions across all brokers to find which follower is the common denominator. Check that follower’s disk I/O and network metrics.
  4. If OfflinePartitionsCount > 0, verify controller health. ActiveControllerCount must sum to exactly 1 across the cluster. If the controller is healthy but partitions remain offline, all replicas for those partitions may be down.
  5. If replication metrics are clean, look for saturation. Check RequestHandlerAvgIdlePercent. If it is below 0.3, break down TotalTimeMs for Produce into RequestQueueTimeMs (thread starvation), LocalTimeMs (disk), and RemoteTimeMs (follower lag for acks=all).
  6. For acks=all with high RemoteTimeMs, check follower health. If follower fetch latency is spiking and IsrShrinksPerSec is active, a follower is too slow to acknowledge.
  7. If the metric is spiky but everything else looks healthy, check broker logs. Look for CorruptRecordException or repeated NotLeaderOrFollowerException errors that might indicate a stuck metadata view or damaged log segments.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
FailedProduceRequestsPerSecDirect measure of producer-visible broker rejectionsAny sustained nonzero rate outside maintenance
UnderMinIsrPartitionCountConfirms writes are actively rejected for acks=all> 0 sustained for more than 2 minutes
OfflinePartitionsCountPartitions completely unwritable> 0 sustained
UnderReplicatedPartitionsLeading indicator of durability degradationNonzero and growing across multiple brokers
RequestHandlerAvgIdlePercentBroker processing capacity headroomSustained below 0.3
Produce RequestQueueTimeMsTime requests wait for a request handler threadp99 spiking above baseline
Produce LocalTimeMsDisk write latency on the brokerp99 sustained above normal baseline
IsrShrinksPerSecVelocity of replicas falling out of syncSustained > 0 outside maintenance

Fixes

When ISR shrink causes rejection

Do NOT lower min.insync.replicas to silence the alert. That removes durability guarantees without fixing the underlying replication problem. Find the sick follower by correlating UnderReplicatedPartitions across brokers. If a broker shows elevated disk await or severe GC pauses, initiate a controlled shutdown to evacuate leadership cleanly. Forcing an unclean shutdown generates more controller events and extends recovery time.

When partitions are offline

Verify the controller is active (ActiveControllerCount sums to 1). If the controller event queue is backed up, do not restart additional brokers. That adds more events to an already overwhelmed queue. If all replicas for a partition are down and unclean.leader.election.enable=false (the safe default), the partition stays offline until replicas recover. Do not enable unclean elections reactively unless you explicitly accept acknowledged data loss.

When the broker is saturated

Use producer quotas to throttle traffic temporarily and break retry cascades. Producers with retries=MAX_INT (the default) amplify load if the broker is slow, creating a positive feedback loop. Increasing num.io.threads helps only if the bottleneck is concurrency, not disk I/O. If LocalTimeMs is high, more threads contending for the same slow disk will worsen latency.

When leadership transitions are the cause

Brief spikes during rolling restarts or controller re-elections are normal and self-healing. If spikes persist, check LeaderElectionRateAndTimeMs and the controller event queue. A controller that cannot keep up with metadata changes will leave partitions in a state where brokers disagree about leadership, causing sustained NotLeaderOrFollowerException errors.

Prevention

  • Monitor FailedProduceRequestsPerSec from day one. Any sustained nonzero rate is abnormal in steady state.
  • Set min.insync.replicas=2 when replication.factor=3 and producers use acks=all. Without this, acks=all only guarantees the leader received the message, even with zero followers in ISR.
  • Keep partitions and leadership balanced across brokers. Leadership skew concentrates request load and makes ISR shrink events more likely when a hot broker hiccups.
  • Maintain headroom on RequestHandlerAvgIdlePercent. A broker at 0.45 idle looks healthy but has no buffer for absorbing a peer failure.
  • Run game days. Know how long ISR rebuild and page cache warmup take after a broker restart before you learn it in an incident.

How Netdata helps

  • Correlates FailedProduceRequestsPerSec with UnderMinIsrPartitionCount, UnderReplicatedPartitions, and OfflinePartitionsCount to show whether producers are being rejected by durability policy or total unavailability.
  • Surfaces per-broker RequestHandlerAvgIdlePercent and the produce latency breakdown (RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs) without manual JMXterm queries.
  • Composite alerts require FailedProduceRequestsPerSec to correlate with replication or saturation signals before paging, reducing false positives from brief leadership transitions.
  • Tracks OS-level disk await and page cache pressure alongside Kafka metrics to distinguish thread saturation from disk degradation.