Kafka NetworkProcessorAvgIdlePercent low: network thread saturation and TLS overhead
NetworkProcessorAvgIdlePercent drops on one or more brokers. A ticket may fire when the value falls below 0.3, or producers and consumers may timeout while broker CPU looks fine and I/O threads are not obviously saturated. Network thread saturation blocks all socket activity, including metadata requests, so every client suffers. With the default num.network.threads set to 3, brokers running TLS and dense client pools are especially vulnerable. Sustained values below 0.1 make the broker effectively unreachable even though the process is still running.
What this means
Kafka uses a small pool of network threads to accept connections, read requests from sockets, and write responses back to clients. When NetworkProcessorAvgIdlePercent drops, those threads are spending most of their time doing work and cannot keep up with socket I/O. Because Kafka uses a reactor pattern where network threads handle the wire protocol before handing off to I/O handlers, saturation here means the broker cannot accept new connections or drain completed responses fast enough. This often manifests as client timeouts and metadata request failures before you see any backlog in the request queue.
This is distinct from RequestHandlerAvgIdlePercent, which measures the I/O handler threads that process the actual produce and fetch logic. Network thread saturation happens before requests ever reach the handlers, so the request queue may look normal while clients experience connection timeouts and elevated latency. The metric is exposed through JMX as kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent and ranges from 0.0 (fully saturated) to 1.0 (fully idle). Healthy brokers usually idle above 0.5. Below 0.3 you should investigate; below 0.1 the broker is in critical distress.
flowchart TD
A[Idle percent < 0.3] --> B{One broker or all?}
B -->|One| C[Check connections and auth]
B -->|All| D[Check deployments and TLS]
C --> E{ResponseQueueTimeMs high?}
D --> E
E -->|Yes| F[Response bottleneck:
slow clients or large fetches]
E -->|No| G[Accept bottleneck:
connection storm or TLS]
F --> H[Tune fetch bytes or isolate consumers]
G --> I[Increase network threads or offload TLS]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| TLS handshake overhead | Low idle percent paired with elevated CPU; spike after cert rotation or client restart | Per-core CPU via mpstat; connection count relative to num.network.threads |
| Connection storm | Connection count jumps 2x or more above baseline; all network processors drop together | ss connection count; broker logs for reconnect bursts |
| Large fetch responses | BytesOutPerSec spikes; ResponseQueueTimeMs rises; consumers configured with large max.partition.fetch.bytes | Consumer configs and per-topic outbound traffic |
| Slow client readers | ResponseSendTimeMs high; response queue growing; specific listeners or IPs lagging | Response latency breakdown and per-listener metrics |
| Undersized network thread pool | Idle percent chronically low during peak hours with no acute event; default of 3 threads on a multicore host | num.network.threads in server.properties versus core count |
Quick checks
Use these read-only commands to confirm scope and likely cause.
# Confirm network thread idle percent
echo "get -b kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent Value" | java -jar jmxterm.jar -l localhost:9999
# Check active connection count
ss -tnp | grep $(pgrep -f kafka.Kafka) | wc -l
# Inspect response queue pressure
echo "get -b kafka.network:type=RequestChannel,name=ResponseQueueSize Value" | java -jar jmxterm.jar -l localhost:9999
# Check outbound traffic for large fetch bursts
echo "get -b kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec OneMinuteRate" | java -jar jmxterm.jar -l localhost:9999
# Review CPU utilization for TLS overhead
mpstat -P ALL 1 3
# Check network interface throughput
cat /proc/net/dev
# Verify configured network thread count
grep num.network.threads /etc/kafka/server.properties
How to diagnose it
- Validate the metric and scope. Check NetworkProcessorAvgIdlePercent on every broker. If only one broker is affected, look for a hot spot such as leadership imbalance, a noisy client, or local NIC issues. If the cluster is affected, look for a systemic trigger such as a deployment, certificate rotation, or consumer group restart.
- Compare with I/O thread idle percent. Read RequestHandlerAvgIdlePercent. If it is healthy (above 0.3) while network threads are saturated, the bottleneck is strictly in socket I/O, not request processing.
- Check connection count against baseline. A sudden doubling of connections suggests a connection storm. Use ss to count TCP connections to the broker process and compare to the baseline from the same time window on previous days.
- Inspect the response path. Read ResponseQueueSize, ResponseQueueTimeMs, and ResponseSendTimeMs. If the response queue is growing and send time is high, network threads are busy writing large or numerous responses, often because of slow consumers or oversized fetch configs.
- Look for TLS and authentication load. If TLS is enabled, check CPU per core. TLS handshakes are CPU-intensive and run on network threads. Bursts of AuthenticationException in broker logs can signal a re-authentication storm that compounds saturation. Some modern configurations allow moving handshake work off the network threads; enable that if supported.
- Correlate with consumer behavior. Check if any consumer group changed configuration recently, especially max.partition.fetch.bytes. Large fetches inflate response sizes and keep network threads busy longer.
- Review broker logs. Look for connection errors or rapid disconnect and reconnect cycles that indicate a client behaving badly or a load balancer health check storm.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| NetworkProcessorAvgIdlePercent | Direct measure of network thread saturation | Sustained below 0.3; critical below 0.1 |
| ResponseQueueSize | Shows responses waiting to be sent by network threads | Consistently above baseline or trending up |
| ResponseQueueTimeMs / ResponseSendTimeMs | Reveals whether the bottleneck is queuing or actual socket writes | p99 spikes above normal baseline |
| Connection count | Each connection consumes thread time and file descriptors | Sustained 2x above baseline |
| BytesOutPerSec | Large or bursty egress keeps network threads busy | Sustained above 70% of NIC capacity |
| RequestHandlerAvgIdlePercent | Rules in or out I/O thread saturation as the root cause | Healthy while network threads are low confirms the layer |
| CPU utilization per core | TLS handshakes burn CPU on network threads | Sustained above 80% with low network idle |
Fixes
Increase num.network.threads. The default of 3 is often too low for production clusters using TLS or high connection counts. A common guideline is to set num.network.threads to roughly half the number of CPU cores. Do not raise network threads without ensuring num.io.threads and queued.max.requests can absorb the increased handoff; otherwise you will shift saturation to the I/O layer. Restart brokers to apply. Increasing threads consumes additional memory for socket buffers and thread stacks, so monitor heap and file descriptors after the change.
Reduce fetch response sizes. If large consumer fetches are the trigger, lower max.partition.fetch.bytes on consumers. The tradeoff is more fetch requests, which adds slight per-request overhead, but it prevents network threads from being monopolized by megabyte-scale responses.
Tune or offload TLS. If CPU and network idle percent move together, TLS handshakes are likely the culprit. Ensure you are using the most efficient TLS implementation available for your Kafka version. Some modern configurations allow moving handshake work off the network threads; enable that if supported.
Break connection storms. Identify and restart misbehaving clients that are reconnecting in a tight loop. If a load balancer is health-checking aggressively, tune its interval. Note that standard Kafka quotas throttle bandwidth, not connection rate; they will not stop a reconnection storm.
Address slow consumers. If ResponseSendTimeMs is high, the broker is waiting for clients to read data. Fix or restart the slow consumers. If backpressure is insufficient, consider reducing broker socket send buffer sizes so threads unblock faster, at the cost of higher per-fetch overhead.
Prevention
Monitor NetworkProcessorAvgIdlePercent and alert when it drops below 0.3. Establish a baseline for connection count per broker and alert on significant deviations. If you run TLS, size num.network.threads during provisioning rather than waiting for saturation. Review consumer configurations before deployment to prevent accidentally large fetch sizes. Keep authentication credential rotations staged so clients do not all reconnect simultaneously. During rolling restarts, monitor for reconnect storms and be prepared to increase thread counts preemptively if connection counts are already high. Include network thread saturation in game day scenarios, especially for TLS-enabled clusters, so you know the threshold at which your broker size becomes insufficient.
How Netdata helps
- Charts NetworkProcessorAvgIdlePercent alongside RequestHandlerAvgIdlePercent, connection count, CPU per core, and NIC throughput so you can distinguish network thread saturation from I/O or disk bottlenecks.
- Baselines connection count per broker automatically and highlights deviations that precede saturation.
- Correlates broker-side ResponseQueueTimeMs and BytesOutPerSec spikes with consumer fetch latency to pinpoint whether large responses or slow readers are the root cause.
- Alerts on sustained values below 0.3 and critical drops below 0.1 with context from adjacent signals.
Related guides
- How Kafka actually works in production: a mental model for operators: /guides/kafka/how-kafka-works-in-production/
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts: /guides/kafka/kafka-commit-failed-exception/
- Kafka consumer group stuck Empty or Dead: no members consuming: /guides/kafka/kafka-consumer-group-empty-stuck/
- Kafka consumer group lag growing: detection, lag-as-time, and root causes: /guides/kafka/kafka-consumer-group-lag-growing/
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors: /guides/kafka/kafka-consumer-group-rebalancing-frequently/
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms: /guides/kafka/kafka-consumer-rebalance-storm/
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata: /guides/kafka/kafka-controller-event-queue-backup/
- Kafka ISR shrinking: IsrShrinksPerSec, flapping, and the cascade to offline: /guides/kafka/kafka-isr-shrink-storm/
- Kafka KRaft metadata log lag: standby controllers and brokers falling behind: /guides/kafka/kafka-kraft-metadata-log-lag/
- Kafka KRaft quorum has no leader: current-leader = -1 and frozen metadata: /guides/kafka/kafka-kraft-quorum-no-leader/
- Kafka LeaderElectionRateAndTimeMs spiking: election storms and slow elections: /guides/kafka/kafka-leader-election-rate-high/
- Kafka LEADER_NOT_AVAILABLE: causes during elections, restarts, and topic creation: /guides/kafka/kafka-leader-not-available/







