Kafka connection storms: connection-count spikes, FD pressure, and network threads

Connection-count jumps on one or more brokers. An alert fires for open file descriptors, or consumers report timeouts while broker logs show nothing dramatic. In Kafka, a connection storm is insidious: every TCP connection costs one file descriptor and network-thread time, and the reactor architecture means network-thread saturation blocks everything, including metadata requests.

Incremental fetch sessions keep consumer connections open even when idle. This is normal, but it means brokers run with a high steady-state connection count. The danger is the rate of change, not the absolute number. When connections spike above twice the established baseline, network threads drop below 30% idle, and FD usage approaches the process limit, the broker is in a connection storm.

This article shows how to distinguish idle-consumer connections from a real storm, diagnose whether the bottleneck is network threads or FD exhaustion, and fix it without restarting the cluster.

What this means

Kafka uses a reactor pattern. A small pool of network threads (num.network.threads, default 3 per listener) handles all socket I/O: accepting connections, reading requests, and writing responses. Each connection consumes one FD. Inter-broker traffic alone creates (N-1) connections per listener on every broker. Producers, consumers, Connect workers, and admin clients add to the total.

A connection storm happens when the aggregate connection cost exceeds network-thread capacity or the OS FD limit. Saturation does not only slow produce or fetch requests. It also delays metadata responses, which triggers client-side reconnection loops that worsen the storm. Once threads are saturated, backpressure spreads to request and response queues, and clients see latency before any explicit broker error.

flowchart TD
    A[Connection count spikes to over 2x baseline] --> B[FD usage climbs]
    A --> C[NetworkProcessorAvgIdlePercent drops]
    C --> D[Request and response queues grow]
    D --> E[Client timeouts trigger retries]
    E --> A
    B --> F[Broker cannot open new log segments or accept connections]

Common causes

CauseWhat it looks likeFirst thing to check
Client connection leakConnection count grows monotonically; low request rate per connectionss -tnp showing many connections from a single client PID
Mass client restartStep jump in connections correlated with a deployment; may recover if threads can keep upConnection count across all brokers spikes simultaneously
TLS handshake overheadNetworkProcessorAvgIdlePercent drops even though connection count looks normal; CPU also elevatedListener type (SSL vs PLAINTEXT) and num.network.threads
Low OS FD limitOpenFileDescriptorCount near limit; broker logs show Too many open files/proc/{pid}/limits for Max open files
Inter-broker listener churnElevated connections on internal listeners only; correlated with broker restart or network partitionPer-listener connection counts

Quick checks

# Find Kafka PID (adjust pattern if your process line differs)
KAFKA_PID=$(pgrep -f 'kafka\.Kafka')

# Count TCP connections (ss truncates process names; verify by port if needed)
ss -tnp | grep "$KAFKA_PID" | wc -l

# Breakdown by TCP state
ss -tan | grep "$KAFKA_PID" | awk '{print $1}' | sort | uniq -c | sort -rn

# Count open file descriptors for the broker process
ls /proc/"$KAFKA_PID"/fd | wc -l

# Check effective hard limit
cat /proc/"$KAFKA_PID"/limits | grep "Max open files"

# Network thread idle percent via JMX
echo "get -b kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent Value" | java -jar jmxterm.jar -l localhost:9999 -n -v silent

JMX exposes connection-count per listener and per network processor under kafka.server:type=socket-server-metrics. Aggregate across all matching beans to get the broker total.

How to diagnose it

  1. Confirm the spike is abnormal. Incremental fetch sessions keep connections open for idle consumers. A high connection count with low request rate is normal. Compare the current count to your 7-day baseline. A storm is typically greater than 2x baseline.
  2. Correlate with network thread idle percent. If NetworkProcessorAvgIdlePercent is sustained below 0.3, the storm is actively saturating the broker. If idle percent is healthy, the connections are mostly idle and the immediate risk is FD exhaustion, not thread saturation.
  3. Check FD pressure. If OpenFileDescriptorCount exceeds 75% of MaxFileDescriptorCount, the broker is heading for a cliff-edge failure. When the limit is hit, the broker cannot accept connections or open new log segment files.
  4. Separate listeners. Use the per-listener connection-count to determine whether the load is from clients or inter-broker replication. Inter-broker connections scale as (N-1) per listener and should be stable.
  5. Check request and response queues. If RequestQueueSize or ResponseQueueSize is growing while network threads are saturated, the reactor is backing up. Clients will see elevated latency before explicit errors.
  6. Inspect broker logs and connection states. Look for Too many open files, SocketServer accept errors, or TLS handshake timeouts. If threads are saturated but FDs are fine, check for slow clients that do not read responses, or asymmetric routing that leaves half-open connections. Half-open connections still consume an FD and thread until the broker-side socket timeout fires. A high count of SYN-RECV in ss output indicates the broker cannot accept new connections fast enough, which points to thread saturation rather than an FD limit.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
kafka.server:type=socket-server-metrics connection-count (aggregated)Each connection costs an FD and thread time.Sustained over 2x baseline
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercentDirect measure of network thread saturation.Sustained below 0.3
java.lang:type=OperatingSystem OpenFileDescriptorCountFD exhaustion is a hard failure.Above 75% of limit
kafka.network:type=RequestChannel,name=RequestQueueSizeBackpressure between network and I/O threads.Consistently above 250 (50% of default 500 max)
kafka.network:type=RequestChannel,name=ResponseQueueSizeNetwork threads cannot drain responses fast enough.Sustained growth above baseline
OS ss -tnp countGround-truth check against JMX aggregation gaps.Large delta from JMX aggregate

Fixes

Network thread saturation. If NetworkProcessorAvgIdlePercent is below 0.3 and the listener uses SSL/TLS, the default num.network.threads=3 is likely too low. Increase it via broker configuration and roll the brokers. More network threads consume more CPU, and TLS handshakes are expensive. There is no hot-reload for this setting; each broker must restart to pick up the new value. If the spike follows a consumer group coordinator migration, also verify client session.timeout.ms and heartbeat.interval.ms; aggressive timeouts cause faster rejoins and connection churn.

FD limit exhaustion. Production brokers should have ulimit -n set to at least 100,000. If you are in a container, verify that the runtime limit matches the host expectation. Docker, containerd, and systemd impose their own ceilings. Check /proc/{pid}/limits to confirm the effective limit, not just /etc/security/limits.conf. Changing a systemd LimitNOFILE value requires daemon-reload and a service restart to take effect. If the broker is already near the limit, raising the ceiling does not close existing connections; you still need to address the source of the spike.

Connection leaks. If one client IP or PID dominates ss -tnp, the application is likely opening connections without closing them. This is common with short-lived admin clients or misconfigured proxies. Fix the client. Do not just reboot the broker.

Mass reconnect events. If a deployment or coordinator change causes all clients to reconnect simultaneously, stagger the client restart or increase the broker’s connection headroom. If you cannot control client rollout timing, ensure num.network.threads can absorb at least a 50% connection spike.

Prevention

  • Baseline connections per broker. Document inter-broker (N-1) per listener, steady-state clients, and idle consumers. Know your normal before the incident.
  • Monitor FD usage as a percentage of limit. Absolute counts are meaningless without the limit context. Alert when usage exceeds 75%.
  • Size network threads for SSL and peak connection count. If you use TLS, treat num.network.threads=3 as a development default. Production brokers with many SSL clients often need 6 to 12 or more.
  • Isolate inter-broker traffic on a dedicated listener. If client and replication traffic share a listener, a client storm starves partition replication and metadata propagation.
  • Track connection-count per listener separately. Client listeners should be stable; inter-broker listeners should be exactly (N-1) per listener per broker. Deviation on the internal listener points to broker churn or network partitions.
  • Alert on connection-count rate of change, not just absolute value. Idle consumers after a weekend deployment can legitimately double connections. A sudden 10x spike in 60 seconds is the storm.
  • Watch for stale connections after network blips. Asymmetric routing or firewall drops can leave connections in ESTABLISHED on the broker after the client is gone. These still hold FDs and thread registration until TCP retransmits time out or the broker idle timeout fires. If they accumulate, the broker bleeds capacity even without a true storm.

How Netdata helps

  • Charts connection-count, NetworkProcessorAvgIdlePercent, and OpenFileDescriptorCount together so you can see which resource limits first.
  • Per-second resolution detects connection storms that vanish between slower JMX scraping intervals.
  • OS-level TCP and FD charts validate JMX aggregates. A gap between the two often means a connection leak outside the Kafka listener stack.
  • Baseline-aware alerts fire on deviation from normal connection count rather than static thresholds, reducing false positives from incremental fetch session behavior.