Kafka FailedProduceRequestsPerSec rising: the single best ‘producers are hurting’ signal
FailedProduceRequestsPerSec rising is broker-side confirmation that producers are actively being rejected. Unlike client-side timeouts from network blips or producer memory pressure, it only increments when the broker receives a produce request and fails to complete it. The MBean kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec rolls up server-side produce failures into a single rate: NotEnoughReplicasException, NotLeaderOrFollowerException, CorruptRecordException, and others. A sustained nonzero rate means data is not being accepted, and the window for data loss or unavailability is open.
It aggregates multiple error types, so FailedProduceRequestsPerSec is the best single signal that producers are hurting, but not the only one. Correlate it with UnderMinIsrPartitionCount to confirm cluster-wide write rejection for acks=all, with OfflinePartitionsCount to detect total unavailability, and with UnderReplicatedPartitions to spot durability degradation before writes are fully blocked. Treat this metric as impact confirmation, not root cause.
Its companion, FailedFetchRequestsPerSec, plays the same role for consumers and follower replicas. When you see one, check the other. If both are elevated, the broker is likely saturated or transitioning leadership. If only produce failures are elevated, the problem is usually in the write path: replication, ISR policy, or disk.
flowchart TD
A[FailedProduce rising] --> B{UnderMinIsr > 0?}
B -->|Yes| C[ISR shrink: check follower disk and network]
B -->|No| D{OfflinePartitions > 0?}
D -->|Yes| E[Leaderless: check controller health]
D -->|No| F{HandlerIdle < 0.3?}
F -->|Yes| G[Saturated: check queue and disk]
F -->|No| H[Check election rate and broker logs]What this means
FailedProduceRequestsPerSec is a rate meter exposed per broker under kafka.server:type=BrokerTopicMetrics. Omit the topic tag for the all-topic aggregate. It increments each time the broker rejects or fails to complete a produce request server-side. Common triggers:
NOT_ENOUGH_REPLICAS: The ISR is smaller thanmin.insync.replicas, so the broker refusesacks=allwrites.NOT_LEADER_OR_FOLLOWER: The request arrived at a broker that is not the current leader.CORRUPT_RECORD: The record batch failed checksum or deserialization on the broker side.- Other fatal broker-side errors that prevent completion.
This metric is purely broker-side. Client-side timeouts, such as a producer blocking on buffer.memory until max.block.ms, do not increment it. If producers are failing but this metric is zero, check client logs for TimeoutException and network connectivity.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
ISR shrink below min.insync.replicas | Sustained FailedProduce rate correlated with UnderMinIsrPartitionCount > 0 and UnderReplicatedPartitions rising | UnderMinIsrPartitionCount and per-broker UnderReplicatedPartitions |
| Partition leader offline | FailedProduce spike with OfflinePartitionsCount > 0; affected partitions are completely unwritable | OfflinePartitionsCount and ActiveControllerCount |
| Broker request handler saturation | FailedProduce rising alongside RequestHandlerAvgIdlePercent dropping below 0.3 and RequestQueueSize growing | RequestHandlerAvgIdlePercent and RequestQueueSize |
| Disk I/O degradation | Produce LocalTimeMs elevated, disk await high, and follower replication lagging | iostat -xz 1 and produce LocalTimeMs breakdown |
| Leadership transition or corrupt records | Brief or spiky FailedProduce while replication and saturation metrics are clean | LeaderElectionRateAndTimeMs and broker logs |
Quick checks
# Check broker-side failed produce rate
echo "get -b kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Check partitions actively rejecting acks=all writes
echo "get -b kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount Value" | java -jar jmxterm.jar -l localhost:9999
# Check partitions with no active leader
echo "get -b kafka.controller:type=KafkaController,name=OfflinePartitionsCount Value" | java -jar jmxterm.jar -l localhost:9999
# List under-replicated partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# List leaderless partitions cluster-wide
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions
# Check broker request handler saturation
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent Value" | java -jar jmxterm.jar -l localhost:9999
# Check disk I/O latency on the broker
iostat -xz 1
How to diagnose it
- Confirm the signal is sustained. Brief spikes during rolling restarts or preferred replica elections are normal. A sustained nonzero rate outside maintenance is abnormal.
- Check cluster-wide impact. Query
UnderMinIsrPartitionCountandOfflinePartitionsCount. If either is nonzero, the cluster is actively rejecting writes or partitions are leaderless. - If
UnderMinIsrPartitionCount > 0, trace the ISR shrink. Cross-referenceUnderReplicatedPartitionsacross all brokers to find which follower is the common denominator. Check that follower’s disk I/O and network metrics. - If
OfflinePartitionsCount > 0, verify controller health.ActiveControllerCountmust sum to exactly 1 across the cluster. If the controller is healthy but partitions remain offline, all replicas for those partitions may be down. - If replication metrics are clean, look for saturation. Check
RequestHandlerAvgIdlePercent. If it is below 0.3, break downTotalTimeMsfor Produce intoRequestQueueTimeMs(thread starvation),LocalTimeMs(disk), andRemoteTimeMs(follower lag foracks=all). - For
acks=allwith highRemoteTimeMs, check follower health. If follower fetch latency is spiking andIsrShrinksPerSecis active, a follower is too slow to acknowledge. - If the metric is spiky but everything else looks healthy, check broker logs. Look for
CorruptRecordExceptionor repeatedNotLeaderOrFollowerExceptionerrors that might indicate a stuck metadata view or damaged log segments.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
FailedProduceRequestsPerSec | Direct measure of producer-visible broker rejections | Any sustained nonzero rate outside maintenance |
UnderMinIsrPartitionCount | Confirms writes are actively rejected for acks=all | > 0 sustained for more than 2 minutes |
OfflinePartitionsCount | Partitions completely unwritable | > 0 sustained |
UnderReplicatedPartitions | Leading indicator of durability degradation | Nonzero and growing across multiple brokers |
RequestHandlerAvgIdlePercent | Broker processing capacity headroom | Sustained below 0.3 |
Produce RequestQueueTimeMs | Time requests wait for a request handler thread | p99 spiking above baseline |
Produce LocalTimeMs | Disk write latency on the broker | p99 sustained above normal baseline |
IsrShrinksPerSec | Velocity of replicas falling out of sync | Sustained > 0 outside maintenance |
Fixes
When ISR shrink causes rejection
Do NOT lower min.insync.replicas to silence the alert. That removes durability guarantees without fixing the underlying replication problem. Find the sick follower by correlating UnderReplicatedPartitions across brokers. If a broker shows elevated disk await or severe GC pauses, initiate a controlled shutdown to evacuate leadership cleanly. Forcing an unclean shutdown generates more controller events and extends recovery time.
When partitions are offline
Verify the controller is active (ActiveControllerCount sums to 1). If the controller event queue is backed up, do not restart additional brokers. That adds more events to an already overwhelmed queue. If all replicas for a partition are down and unclean.leader.election.enable=false (the safe default), the partition stays offline until replicas recover. Do not enable unclean elections reactively unless you explicitly accept acknowledged data loss.
When the broker is saturated
Use producer quotas to throttle traffic temporarily and break retry cascades. Producers with retries=MAX_INT (the default) amplify load if the broker is slow, creating a positive feedback loop. Increasing num.io.threads helps only if the bottleneck is concurrency, not disk I/O. If LocalTimeMs is high, more threads contending for the same slow disk will worsen latency.
When leadership transitions are the cause
Brief spikes during rolling restarts or controller re-elections are normal and self-healing. If spikes persist, check LeaderElectionRateAndTimeMs and the controller event queue. A controller that cannot keep up with metadata changes will leave partitions in a state where brokers disagree about leadership, causing sustained NotLeaderOrFollowerException errors.
Prevention
- Monitor
FailedProduceRequestsPerSecfrom day one. Any sustained nonzero rate is abnormal in steady state. - Set
min.insync.replicas=2whenreplication.factor=3and producers useacks=all. Without this,acks=allonly guarantees the leader received the message, even with zero followers in ISR. - Keep partitions and leadership balanced across brokers. Leadership skew concentrates request load and makes ISR shrink events more likely when a hot broker hiccups.
- Maintain headroom on
RequestHandlerAvgIdlePercent. A broker at 0.45 idle looks healthy but has no buffer for absorbing a peer failure. - Run game days. Know how long ISR rebuild and page cache warmup take after a broker restart before you learn it in an incident.
How Netdata helps
- Correlates
FailedProduceRequestsPerSecwithUnderMinIsrPartitionCount,UnderReplicatedPartitions, andOfflinePartitionsCountto show whether producers are being rejected by durability policy or total unavailability. - Surfaces per-broker
RequestHandlerAvgIdlePercentand the produce latency breakdown (RequestQueueTimeMs,LocalTimeMs,RemoteTimeMs) without manual JMXterm queries. - Composite alerts require
FailedProduceRequestsPerSecto correlate with replication or saturation signals before paging, reducing false positives from brief leadership transitions. - Tracks OS-level disk
awaitand page cache pressure alongside Kafka metrics to distinguish thread saturation from disk degradation.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka enable.auto.commit data loss: committed offsets that outrun processing
- Kafka ‘Broker may not be available’: clients that can’t connect or stay connected
- Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka __consumer_offsets growing huge: compaction failure on the offsets topic
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka disk I/O latency high: await, LocalTimeMs, and the slow-disk broker







