Kafka quota throttling: throttle-time, runaway clients, and protecting the cluster
Your Kafka producers are timing out. Consumer lag is growing. Client dashboards show elevated request latency, but the brokers are healthy: UnderReplicatedPartitions is zero and disk I/O looks fine. The culprit is often quota throttling: the cluster is enforcing per-client or per-user byte-rate limits, and one or more clients have hit the wall. The broker exposes this as throttle-time in JMX. That backpressure slows the client, but if the client is a runaway service or a backfill consumer, the throttling can cascade into widespread latency.
Throttling is not a broker failure. It is the cluster protecting shared resources. Operators must distinguish legitimate workload growth, a runaway service, and a backfill consumer trashing page cache. The fix differs in each case: raise a quota, apply a harder limit, or stop a producer timeout cascade.
What this means
Kafka enforces per-client and per-user quotas for produce and fetch byte rates. When a client exceeds its configured limit, the broker records throttle-time for that user and client-id combination. The client spends more time waiting, which reduces its effective throughput.
Throttling is a signal, not a root cause. The client may have grown legitimately, a service may be retrying in a loop, or a consumer may have reset its offsets and is re-reading the entire log. The throttle-time metric tells you enforcement is active; you still need to determine whether the client or the quota is the problem.
Because throttling applies backpressure rather than rejection, clients typically retry or wait. This can mask the root cause. A producer timeout cascade can look like broker saturation when it is actually a client exceeding a modest quota.
The key broker-side JMX metrics are scoped by user and client-id under kafka.server:type=Produce and kafka.server:type=Fetch, with the attribute throttle-time. An aggregate view is available through kafka.server:type=Request,name=ThrottleTimeMs.
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Legitimate workload growth | Throttle-time rises during business hours for a known service | Per-client throttle-time and BytesInPerSec or BytesOutPerSec |
| Runaway producer retry cascade | FailedProduceRequestsPerSec increases; RequestQueueSize grows; MessagesInPerSec flat while BytesInPerSec rises | Aggregate ThrottleTimeMs and producer error logs |
| Backfill consumer | Consumer lag drops rapidly from a large baseline; FetchConsumer LocalTimeMs spikes; disk read I/O jumps | Consumer group lag trend and fetch throttle-time |
| Missing or low default quotas | Multiple unrelated clients hit throttling simultaneously after traffic shifts | kafka-configs.sh –describe for users and clients |
Quick checks
# Aggregate throttle time across all clients
echo "get -b kafka.server:type=Request,name=ThrottleTimeMs 99thPercentile" | java -jar jmxterm.jar -l localhost:9999
# List per-client produce metrics to identify throttled users/client-ids
echo "beans -d kafka.server -s type=Produce" | java -jar jmxterm.jar -l localhost:9999
# Check a specific client (replace user and client-id)
echo "get -b kafka.server:type=Produce,user=svc-account,client-id=payments-app throttle-time" | java -jar jmxterm.jar -l localhost:9999
# Review configured quotas
kafka-configs.sh --bootstrap-server localhost:9092 --describe --entity-type users
kafka-configs.sh --bootstrap-server localhost:9092 --describe --entity-type clients
# Check for backfill behavior in a consumer group
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group backfill-group-id
# Check if the broker itself is saturated
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
How to diagnose it
Confirm throttling is active. Check
kafka.server:type=Request,name=ThrottleTimeMs. If the p99 is sustained above zero, the broker is enforcing quotas. Brief spikes during consumer group rebalances or client startup are normal; sustained elevation is not.Find the throttled client. Query the per-client
kafka.server:type=Produceandkafka.server:type=FetchMBeans to find which user and client-id combinations report throttle-time above zero. You may need to iterate over all matching beans.Determine produce versus fetch. Produce throttling shows up under
type=Produce; fetch throttling undertype=Fetch. Produce problems often correlate withBytesInPerSecspikes. Fetch problems correlate with consumer activity.Check the quota configuration. Run
kafka-configs.sh --describefor users and clients to see current byte-rate limits. Compare the configuredproducer_byte_rateorconsumer_byte_rateto the client’s actual throughput.Correlate with workload changes. Did a service deploy just before throttling began? Did a consumer group reset offsets? Did a new mirror or migration job start? Check deployment pipelines and consumer group history.
Rule out broker saturation. Low
RequestHandlerAvgIdlePercentand a growingRequestQueueSizemean the broker is overloaded. In that case, throttling is a symptom of a sick broker, not a client runaway.Detect a producer timeout cascade. If
FailedProduceRequestsPerSecis rising,RequestQueueTimeMsis growing, andMessagesInPerSecis flat whileBytesInPerSecclimbs, producers are retrying large batches. Check producer logs forTimeoutException.Detect a backfill consumer. If one consumer group’s lag drops rapidly from a high baseline,
records-consumed-rateis elevated, and FetchConsumerLocalTimeMshas spiked alongside disk read I/O, the consumer is likely backfilling. Verify withkafka-consumer-groups.sh.
flowchart TD
A[Throttle-time sustained above 0] --> B[Identify user and client-id
via per-client MBeans]
B --> C{Produce or Fetch?}
C -->|Produce| D[Check BytesInPerSec
and producer quotas]
C -->|Fetch| E[Check consumer lag
and fetch quotas]
D --> F{Runaway growth
or legitimate?}
E --> F
F -->|Runaway| G[Reduce quota or stop client]
F -->|Legitimate| H[Raise quota]
F -->|Backfill| I[Set consumer_byte_rate]Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| kafka.server:type=Produce/Fetch throttle-time | Direct measure of per-client enforcement | Sustained above 0 for critical clients |
| kafka.server:type=Request,name=ThrottleTimeMs | Aggregate cluster throttling | p99 sustained above 0 |
| BytesInPerSec / BytesOutPerSec per topic | Identifies which topic drives violations | Sudden spike correlating with throttle-time |
| Consumer group lag | Flags backfill consumers that hit fetch quotas | Lag decreasing rapidly from a high baseline |
| FetchConsumer LocalTimeMs | Page cache thrashing from large historical reads | Spike paired with elevated disk read I/O |
| RequestHandlerAvgIdlePercent | Distinguishes broker overload from quota enforcement | Below 0.3 with high RequestQueueSize means broker saturation |
| FailedProduceRequestsPerSec | Producer timeout cascade indicator | Rising while MessagesInPerSec stays flat |
Fixes
Raise the quota for legitimate growth
If the client is a known production service that has scaled up, increase its producer_byte_rate or consumer_byte_rate via dynamic configuration. Review current values first with kafka-configs.sh --describe, then apply the new limit.
Tradeoff: Higher quotas reduce the cluster’s ability to absorb sudden traffic spikes from that client. Monitor BytesInPerSec after the change.
Throttle a backfill consumer
When a consumer is reading historical data and evicting page cache, set a consumer_byte_rate quota to limit its fetch rate. This protects tail-consumer latency for the rest of the cluster.
Tradeoff: The backfill takes longer, but the rest of the cluster remains stable.
Starve a runaway producer
If a producer is stuck in a retry loop and overwhelming the broker, lower its producer_byte_rate quota to reduce its throughput, then stop the process. Do not rely on quota cuts alone to break client-side retries.
Tradeoff: The producer will be throttled harder until you restore the quota or stop it. Use this only while you fix the underlying broker issue or terminate the faulty client.
Stop the abusive client
If the client is misconfigured or malicious, terminate the application. Throttling is a safety net, not a substitute for killing a runaway process.
Warning: Terminating the application is disruptive. Verify the client-id with connection logs or kafka-configs.sh before acting.
Prevention
- Set default quotas. Configure default user and client quotas so new clients cannot overwhelm the cluster on their first connection.
- Monitor per-client byte rates before they hit limits. Track
BytesInPerSecandBytesOutPerSecper topic to spot growth trends. - Alert on throttle-time. Any sustained throttle-time above zero for tier-1 services should create a ticket.
- Review quotas during capacity planning. When a service is expected to double traffic, adjust quotas before the deploy.
- Require client-id registration. Unrecognized client-ids that suddenly report high throttle-time are easier to investigate when every id has an owner.
How Netdata helps
- Correlates throttle-time with per-broker
RequestHandlerAvgIdlePercentandRequestQueueSizeto distinguish quota enforcement from broker saturation. - Surfaces per-topic
BytesInPerSecandBytesOutPerSecto pinpoint which topic drives quota violations. - Tracks consumer lag and consumer group state to detect backfill behavior before fetch quotas max out.
- Exposes aggregate request latency breakdowns so you can see whether throttling elevates
ResponseQueueTimeMsorResponseSendTimeMs. - Provides cluster-wide throttle-time visibility without manual JMXterm queries during an incident.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka authentication failures: SASL/mTLS errors, credential rotation, and brute force
- Kafka enable.auto.commit data loss: committed offsets that outrun processing
- Kafka ‘Broker may not be available’: clients that can’t connect or stay connected
- Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery
- Kafka network egress saturation: BytesOutPerSec, replication amplification, and fan-out
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka connection storms: connection-count spikes, FD pressure, and network threads
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka __consumer_offsets growing huge: compaction failure on the offsets topic







