Kafka quota throttling: throttle-time, runaway clients, and protecting the cluster

Your Kafka producers are timing out. Consumer lag is growing. Client dashboards show elevated request latency, but the brokers are healthy: UnderReplicatedPartitions is zero and disk I/O looks fine. The culprit is often quota throttling: the cluster is enforcing per-client or per-user byte-rate limits, and one or more clients have hit the wall. The broker exposes this as throttle-time in JMX. That backpressure slows the client, but if the client is a runaway service or a backfill consumer, the throttling can cascade into widespread latency.

Throttling is not a broker failure. It is the cluster protecting shared resources. Operators must distinguish legitimate workload growth, a runaway service, and a backfill consumer trashing page cache. The fix differs in each case: raise a quota, apply a harder limit, or stop a producer timeout cascade.

What this means

Kafka enforces per-client and per-user quotas for produce and fetch byte rates. When a client exceeds its configured limit, the broker records throttle-time for that user and client-id combination. The client spends more time waiting, which reduces its effective throughput.

Throttling is a signal, not a root cause. The client may have grown legitimately, a service may be retrying in a loop, or a consumer may have reset its offsets and is re-reading the entire log. The throttle-time metric tells you enforcement is active; you still need to determine whether the client or the quota is the problem.

Because throttling applies backpressure rather than rejection, clients typically retry or wait. This can mask the root cause. A producer timeout cascade can look like broker saturation when it is actually a client exceeding a modest quota.

The key broker-side JMX metrics are scoped by user and client-id under kafka.server:type=Produce and kafka.server:type=Fetch, with the attribute throttle-time. An aggregate view is available through kafka.server:type=Request,name=ThrottleTimeMs.

Common causes

CauseWhat it looks likeFirst thing to check
Legitimate workload growthThrottle-time rises during business hours for a known servicePer-client throttle-time and BytesInPerSec or BytesOutPerSec
Runaway producer retry cascadeFailedProduceRequestsPerSec increases; RequestQueueSize grows; MessagesInPerSec flat while BytesInPerSec risesAggregate ThrottleTimeMs and producer error logs
Backfill consumerConsumer lag drops rapidly from a large baseline; FetchConsumer LocalTimeMs spikes; disk read I/O jumpsConsumer group lag trend and fetch throttle-time
Missing or low default quotasMultiple unrelated clients hit throttling simultaneously after traffic shiftskafka-configs.sh –describe for users and clients

Quick checks

# Aggregate throttle time across all clients
echo "get -b kafka.server:type=Request,name=ThrottleTimeMs 99thPercentile" | java -jar jmxterm.jar -l localhost:9999

# List per-client produce metrics to identify throttled users/client-ids
echo "beans -d kafka.server -s type=Produce" | java -jar jmxterm.jar -l localhost:9999

# Check a specific client (replace user and client-id)
echo "get -b kafka.server:type=Produce,user=svc-account,client-id=payments-app throttle-time" | java -jar jmxterm.jar -l localhost:9999

# Review configured quotas
kafka-configs.sh --bootstrap-server localhost:9092 --describe --entity-type users
kafka-configs.sh --bootstrap-server localhost:9092 --describe --entity-type clients

# Check for backfill behavior in a consumer group
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group backfill-group-id

# Check if the broker itself is saturated
echo "get -b kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
echo "get -b kafka.network:type=RequestChannel,name=RequestQueueSize Value" | java -jar jmxterm.jar -l localhost:9999

How to diagnose it

  1. Confirm throttling is active. Check kafka.server:type=Request,name=ThrottleTimeMs. If the p99 is sustained above zero, the broker is enforcing quotas. Brief spikes during consumer group rebalances or client startup are normal; sustained elevation is not.

  2. Find the throttled client. Query the per-client kafka.server:type=Produce and kafka.server:type=Fetch MBeans to find which user and client-id combinations report throttle-time above zero. You may need to iterate over all matching beans.

  3. Determine produce versus fetch. Produce throttling shows up under type=Produce; fetch throttling under type=Fetch. Produce problems often correlate with BytesInPerSec spikes. Fetch problems correlate with consumer activity.

  4. Check the quota configuration. Run kafka-configs.sh --describe for users and clients to see current byte-rate limits. Compare the configured producer_byte_rate or consumer_byte_rate to the client’s actual throughput.

  5. Correlate with workload changes. Did a service deploy just before throttling began? Did a consumer group reset offsets? Did a new mirror or migration job start? Check deployment pipelines and consumer group history.

  6. Rule out broker saturation. Low RequestHandlerAvgIdlePercent and a growing RequestQueueSize mean the broker is overloaded. In that case, throttling is a symptom of a sick broker, not a client runaway.

  7. Detect a producer timeout cascade. If FailedProduceRequestsPerSec is rising, RequestQueueTimeMs is growing, and MessagesInPerSec is flat while BytesInPerSec climbs, producers are retrying large batches. Check producer logs for TimeoutException.

  8. Detect a backfill consumer. If one consumer group’s lag drops rapidly from a high baseline, records-consumed-rate is elevated, and FetchConsumer LocalTimeMs has spiked alongside disk read I/O, the consumer is likely backfilling. Verify with kafka-consumer-groups.sh.

flowchart TD
    A[Throttle-time sustained above 0] --> B[Identify user and client-id
via per-client MBeans] B --> C{Produce or Fetch?} C -->|Produce| D[Check BytesInPerSec
and producer quotas] C -->|Fetch| E[Check consumer lag
and fetch quotas] D --> F{Runaway growth
or legitimate?} E --> F F -->|Runaway| G[Reduce quota or stop client] F -->|Legitimate| H[Raise quota] F -->|Backfill| I[Set consumer_byte_rate]

Metrics and signals to monitor

SignalWhy it mattersWarning sign
kafka.server:type=Produce/Fetch throttle-timeDirect measure of per-client enforcementSustained above 0 for critical clients
kafka.server:type=Request,name=ThrottleTimeMsAggregate cluster throttlingp99 sustained above 0
BytesInPerSec / BytesOutPerSec per topicIdentifies which topic drives violationsSudden spike correlating with throttle-time
Consumer group lagFlags backfill consumers that hit fetch quotasLag decreasing rapidly from a high baseline
FetchConsumer LocalTimeMsPage cache thrashing from large historical readsSpike paired with elevated disk read I/O
RequestHandlerAvgIdlePercentDistinguishes broker overload from quota enforcementBelow 0.3 with high RequestQueueSize means broker saturation
FailedProduceRequestsPerSecProducer timeout cascade indicatorRising while MessagesInPerSec stays flat

Fixes

Raise the quota for legitimate growth

If the client is a known production service that has scaled up, increase its producer_byte_rate or consumer_byte_rate via dynamic configuration. Review current values first with kafka-configs.sh --describe, then apply the new limit.

Tradeoff: Higher quotas reduce the cluster’s ability to absorb sudden traffic spikes from that client. Monitor BytesInPerSec after the change.

Throttle a backfill consumer

When a consumer is reading historical data and evicting page cache, set a consumer_byte_rate quota to limit its fetch rate. This protects tail-consumer latency for the rest of the cluster.

Tradeoff: The backfill takes longer, but the rest of the cluster remains stable.

Starve a runaway producer

If a producer is stuck in a retry loop and overwhelming the broker, lower its producer_byte_rate quota to reduce its throughput, then stop the process. Do not rely on quota cuts alone to break client-side retries.

Tradeoff: The producer will be throttled harder until you restore the quota or stop it. Use this only while you fix the underlying broker issue or terminate the faulty client.

Stop the abusive client

If the client is misconfigured or malicious, terminate the application. Throttling is a safety net, not a substitute for killing a runaway process.

Warning: Terminating the application is disruptive. Verify the client-id with connection logs or kafka-configs.sh before acting.

Prevention

  • Set default quotas. Configure default user and client quotas so new clients cannot overwhelm the cluster on their first connection.
  • Monitor per-client byte rates before they hit limits. Track BytesInPerSec and BytesOutPerSec per topic to spot growth trends.
  • Alert on throttle-time. Any sustained throttle-time above zero for tier-1 services should create a ticket.
  • Review quotas during capacity planning. When a service is expected to double traffic, adjust quotas before the deploy.
  • Require client-id registration. Unrecognized client-ids that suddenly report high throttle-time are easier to investigate when every id has an owner.

How Netdata helps

  • Correlates throttle-time with per-broker RequestHandlerAvgIdlePercent and RequestQueueSize to distinguish quota enforcement from broker saturation.
  • Surfaces per-topic BytesInPerSec and BytesOutPerSec to pinpoint which topic drives quota violations.
  • Tracks consumer lag and consumer group state to detect backfill behavior before fetch quotas max out.
  • Exposes aggregate request latency breakdowns so you can see whether throttling elevates ResponseQueueTimeMs or ResponseSendTimeMs.
  • Provides cluster-wide throttle-time visibility without manual JMXterm queries during an incident.