Kafka FailedProduceRequestsPerSec rising: the single best 'producers are hurting' signal

Kafka FailedProduceRequestsPerSec rising: the single best ‘producers are hurting’ signal

FailedProduceRequestsPerSec rising is broker-side confirmation that producers are actively being rejected. Unlike client-side timeouts from network blips or producer memory pressure, it only increments when the broker receives a produce request and fails to complete it. The MBean kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec rolls up server-side produce failures into a single rate: NotEnoughReplicasException, NotLeaderOrFollowerException, CorruptRecordException, and others. A sustained nonzero rate means data is not being accepted, and the window for data loss or unavailability is open.

It aggregates multiple error types, so FailedProduceRequestsPerSec is the best single signal that producers are hurting, but not the only one. Correlate it with UnderMinIsrPartitionCount to confirm cluster-wide write rejection for acks=all, with OfflinePartitionsCount to detect total unavailability, and with UnderReplicatedPartitions to spot durability degradation before writes are fully blocked. Treat this metric as impact confirmation, not root cause.

Its companion, FailedFetchRequestsPerSec, plays the same role for consumers and follower replicas. When you see one, check the other. If both are elevated, the broker is likely saturated or transitioning leadership. If only produce failures are elevated, the problem is usually in the write path: replication, ISR policy, or disk.

flowchart TD
    A[FailedProduce rising] --> B{UnderMinIsr > 0?}
    B -->|Yes| C[ISR shrink: check follower disk and network]
    B -->|No| D{OfflinePartitions > 0?}
    D -->|Yes| E[Leaderless: check controller health]
    D -->|No| F{HandlerIdle < 0.3?}
    F -->|Yes| G[Saturated: check queue and disk]
    F -->|No| H[Check election rate and broker logs]

What this means

FailedProduceRequestsPerSec is a rate meter exposed per broker under kafka.server:type=BrokerTopicMetrics. Omit the topic tag for the all-topic aggregate. It increments each time the broker rejects or fails to complete a produce request server-side. Common triggers:

NOT_ENOUGH_REPLICAS: The ISR is smaller than min.insync.replicas, so the broker refuses acks=all writes.
NOT_LEADER_OR_FOLLOWER: The request arrived at a broker that is not the current leader.
CORRUPT_RECORD: The record batch failed checksum or deserialization on the broker side.
Other fatal broker-side errors that prevent completion.

This metric is purely broker-side. Client-side timeouts, such as a producer blocking on buffer.memory until max.block.ms, do not increment it. If producers are failing but this metric is zero, check client logs for TimeoutException and network connectivity.

Common causes

Cause	What it looks like	First thing to check
ISR shrink below `min.insync.replicas`	Sustained `FailedProduce` rate correlated with `UnderMinIsrPartitionCount > 0` and `UnderReplicatedPartitions` rising	`UnderMinIsrPartitionCount` and per-broker `UnderReplicatedPartitions`
Partition leader offline	`FailedProduce` spike with `OfflinePartitionsCount > 0`; affected partitions are completely unwritable	`OfflinePartitionsCount` and `ActiveControllerCount`
Broker request handler saturation	`FailedProduce` rising alongside `RequestHandlerAvgIdlePercent` dropping below 0.3 and `RequestQueueSize` growing	`RequestHandlerAvgIdlePercent` and `RequestQueueSize`
Disk I/O degradation	Produce `LocalTimeMs` elevated, disk `await` high, and follower replication lagging	`iostat -xz 1` and produce `LocalTimeMs` breakdown
Leadership transition or corrupt records	Brief or spiky `FailedProduce` while replication and saturation metrics are clean	`LeaderElectionRateAndTimeMs` and broker logs

Quick checks

# Check broker-side failed produce rate
echo "get -b kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999

# Check partitions actively rejecting acks=all writes
echo "get -b kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount Value" | java -jar jmxterm.jar -l localhost:9999

# Check partitions with no active leader
echo "get -b kafka.controller:type=KafkaController,name=OfflinePartitionsCount Value" | java -jar jmxterm.jar -l localhost:9999

# List under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

# List leaderless partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions

# Check broker request handler saturation
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Value" | java -jar jmxterm.jar -l localhost:9999

# Check disk I/O latency on the broker
iostat -xz 1

How to diagnose it

Confirm the signal is sustained. Brief spikes during rolling restarts or preferred replica elections are normal. A sustained nonzero rate outside maintenance is abnormal.
Check cluster-wide impact. Query UnderMinIsrPartitionCount and OfflinePartitionsCount. If either is nonzero, the cluster is actively rejecting writes or partitions are leaderless.
If UnderMinIsrPartitionCount > 0, trace the ISR shrink. Cross-reference UnderReplicatedPartitions across all brokers to find which follower is the common denominator. Check that follower’s disk I/O and network metrics.
If OfflinePartitionsCount > 0, verify controller health. ActiveControllerCount must sum to exactly 1 across the cluster. If the controller is healthy but partitions remain offline, all replicas for those partitions may be down.
If replication metrics are clean, look for saturation. Check RequestHandlerAvgIdlePercent. If it is below 0.3, break down TotalTimeMs for Produce into RequestQueueTimeMs (thread starvation), LocalTimeMs (disk), and RemoteTimeMs (follower lag for acks=all).
For acks=all with high RemoteTimeMs, check follower health. If follower fetch latency is spiking and IsrShrinksPerSec is active, a follower is too slow to acknowledge.
If the metric is spiky but everything else looks healthy, check broker logs. Look for CorruptRecordException or repeated NotLeaderOrFollowerException errors that might indicate a stuck metadata view or damaged log segments.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
`FailedProduceRequestsPerSec`	Direct measure of producer-visible broker rejections	Any sustained nonzero rate outside maintenance
`UnderMinIsrPartitionCount`	Confirms writes are actively rejected for `acks=all`	> 0 sustained for more than 2 minutes
`OfflinePartitionsCount`	Partitions completely unwritable	> 0 sustained
`UnderReplicatedPartitions`	Leading indicator of durability degradation	Nonzero and growing across multiple brokers
`RequestHandlerAvgIdlePercent`	Broker processing capacity headroom	Sustained below 0.3
`Produce RequestQueueTimeMs`	Time requests wait for a request handler thread	p99 spiking above baseline
`Produce LocalTimeMs`	Disk write latency on the broker	p99 sustained above normal baseline
`IsrShrinksPerSec`	Velocity of replicas falling out of sync	Sustained > 0 outside maintenance

Fixes

When ISR shrink causes rejection

Do NOT lower min.insync.replicas to silence the alert. That removes durability guarantees without fixing the underlying replication problem. Find the sick follower by correlating UnderReplicatedPartitions across brokers. If a broker shows elevated disk await or severe GC pauses, initiate a controlled shutdown to evacuate leadership cleanly. Forcing an unclean shutdown generates more controller events and extends recovery time.

When partitions are offline

Verify the controller is active (ActiveControllerCount sums to 1). If the controller event queue is backed up, do not restart additional brokers. That adds more events to an already overwhelmed queue. If all replicas for a partition are down and unclean.leader.election.enable=false (the safe default), the partition stays offline until replicas recover. Do not enable unclean elections reactively unless you explicitly accept acknowledged data loss.

When the broker is saturated

Use producer quotas to throttle traffic temporarily and break retry cascades. Producers with retries=MAX_INT (the default) amplify load if the broker is slow, creating a positive feedback loop. Increasing num.io.threads helps only if the bottleneck is concurrency, not disk I/O. If LocalTimeMs is high, more threads contending for the same slow disk will worsen latency.

When leadership transitions are the cause

Brief spikes during rolling restarts or controller re-elections are normal and self-healing. If spikes persist, check LeaderElectionRateAndTimeMs and the controller event queue. A controller that cannot keep up with metadata changes will leave partitions in a state where brokers disagree about leadership, causing sustained NotLeaderOrFollowerException errors.

Prevention

Monitor FailedProduceRequestsPerSec from day one. Any sustained nonzero rate is abnormal in steady state.
Set min.insync.replicas=2 when replication.factor=3 and producers use acks=all. Without this, acks=all only guarantees the leader received the message, even with zero followers in ISR.
Keep partitions and leadership balanced across brokers. Leadership skew concentrates request load and makes ISR shrink events more likely when a hot broker hiccups.
Maintain headroom on RequestHandlerAvgIdlePercent. A broker at 0.45 idle looks healthy but has no buffer for absorbing a peer failure.
Run game days. Know how long ISR rebuild and page cache warmup take after a broker restart before you learn it in an incident.

How Netdata helps

Correlates FailedProduceRequestsPerSec with UnderMinIsrPartitionCount, UnderReplicatedPartitions, and OfflinePartitionsCount to show whether producers are being rejected by durability policy or total unavailability.
Surfaces per-broker RequestHandlerAvgIdlePercent and the produce latency breakdown (RequestQueueTimeMs, LocalTimeMs, RemoteTimeMs) without manual JMXterm queries.
Composite alerts require FailedProduceRequestsPerSec to correlate with replication or saturation signals before paging, reducing false positives from brief leadership transitions.
Tracks OS-level disk await and page cache pressure alongside Kafka metrics to distinguish thread saturation from disk degradation.