Kafka ‘Broker may not be available’: clients that can’t connect or stay connected
Client logs show a warning like this:
WARN [Producer clientId=...] Connection to node -1 could not be established. Broker may not be available.
Python clients may raise kafka.errors.NoBrokersAvailable, while librdkafka-based clients report Connection refused against the broker address. The bootstrap server is often reachable; ping and telnet succeed, yet the client still fails. This happens because Kafka clients use the bootstrap connection only to fetch metadata. After that, they disconnect and try to open fresh TCP connections to the host:port pairs advertised in the metadata response. If those endpoints are unreachable, misconfigured, or secured differently than the client expects, the connection fails even though the bootstrap succeeded.
What this means
This is a client-side symptom of a failed second-hop connection. The bootstrap succeeded and the client received a metadata response containing broker endpoints. The direct connection to one of those brokers then failed. The failure can happen at multiple layers: TCP routing, DNS resolution, TLS handshake, SASL authentication, or because the target broker is genuinely down. In containerized environments, the most common root cause is an advertised.listeners configuration that returns an internal hostname or IP to external clients. The broker is healthy, but it is telling clients to connect to an address they cannot reach.
flowchart TD
A[Client sees Broker may not be available] --> B{Bootstrap host:port reachable?}
B -->|No| C[Fix client network or DNS]
B -->|Yes| D{kafka-broker-api-versions.sh works from client?}
D -->|No| E[Check bootstrap listener and firewall]
D -->|Yes| F{Advertised endpoint reachable from client?}
F -->|No| G[Fix advertised.listeners or NAT rules]
F -->|Yes| H{Broker logs show TLS or SASL errors?}
H -->|Yes| I[Fix handshake or auth config]
H -->|No| J[Check broker process health and load]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
advertised.listeners misconfiguration (Docker, K8s, NAT) | Client receives metadata but cannot resolve or reach the advertised host:port. Connections work inside the container network but fail from outside. | kafka-broker-api-versions.sh from the client’s network. Compare the returned broker list against what the client can reach. |
| TLS or SASL handshake failure | TCP connection establishes but is immediately reset or hangs. Broker logs contain SSLHandshakeException or AuthenticationException. | Broker logs for handshake or authentication errors. Verify client and broker protocol versions and SASL mechanisms match. |
| Firewall or security group drop | TCP SYN to the advertised port times out. The bootstrap port is open, but the data port is not. | nc -zv or Bash /dev/tcp to the advertised host:port from the client host. |
| Broker process down or overloaded | Connection refused to the advertised port, or the broker appears in metadata but does not respond to API requests. | Broker process liveness, port binding with ss -tlnp, and NetworkProcessorAvgIdlePercent via JMX. |
Quick checks
# TCP reachability to bootstrap
nc -zv <bootstrap-host> <bootstrap-port>
# Bash built-in alternative
timeout 5 bash -c 'cat < /dev/null > /dev/tcp/<bootstrap-host>/<bootstrap-port>'
# Kafka protocol reachability from client
kafka-broker-api-versions.sh --bootstrap-server <bootstrap-host>:<bootstrap-port>
# Broker logs for listener and handshake errors
grep -E "advertised\.listeners|listeners=|ERROR.*SSLHandshakeException|ERROR.*AuthenticationException" /var/log/kafka/server.log
# Broker listening interfaces
ss -tlnp | grep java
# KRaft quorum health
kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status
# File descriptor limits
cat /proc/$(pgrep -f kafka.Kafka)/limits | grep "Max open files"
How to diagnose it
- Test basic TCP connectivity to the bootstrap server using
ncor Bash/dev/tcp. If this fails, investigate routing, DNS, or firewalls on both sides before changing broker configuration. - Run
kafka-broker-api-versions.sh --bootstrap-server <host>:<port>from the client machine. This performs a full Kafka protocol handshake. If it fails, the client cannot reach the Kafka protocol layer. This often indicates a firewall blocking the port or the broker listening on an interface the client cannot reach. - If the API versions check succeeds, the cluster is reachable at the bootstrap address. The issue is likely the second hop. Inspect the broker’s
advertised.listenersinserver.propertiesor broker logs. Ensure the advertised host:port is resolvable and routable from the client’s network. In Docker or Kubernetes, the default advertised address is often the pod IP or container hostname, which external clients cannot resolve. - Test TCP connectivity from the client to the specific advertised host:port that the client is complaining about. Use
nc -zv <advertised-host> <advertised-port>. If this fails while the bootstrap works, you have a NAT, routing, oradvertised.listenersmismatch. - Check broker logs at
/var/log/kafka/server.logforSSLHandshakeException,AuthenticationException, or SASL mechanism mismatch errors. If TCP connects but the connection drops during negotiation, the issue is at the security protocol layer, not the network layer. - On the broker, verify the process is running and bound to the correct network interface. A common misconfiguration is
listeners=PLAINTEXT://localhost:9092, which prevents external connections entirely. Usess -tlnpto confirm the listening address. - Check whether the broker is saturated. Even a running broker can refuse connections if network threads are exhausted. Query the JMX MBean
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent. If the value is below 0.1 (10% idle), the broker is overloaded and may drop connections. - In KRaft mode, verify quorum health with
kafka-metadata-quorum.sh --bootstrap-server <host>:<port> describe --status. If there is no quorum leader, metadata may be stale and brokers may advertise incorrect endpoints.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Authentication failures | Failed TLS or SASL handshakes often present as broker unavailability to clients. | Sustained AuthenticationException in broker logs. |
| Network Processor Average Idle Percent | Saturated network threads cannot accept new connections or complete handshakes. | Sustained value below 0.3 (30% idle). |
| Offline Partitions Count | Confirms whether a broker is genuinely down and its partitions are leaderless. | Nonzero value reported by the active controller. |
| Active Controller Count | Stale or missing metadata means clients may receive bad broker lists. | Cluster-wide sum is not exactly 1. |
| Connection Count | Approaching the file descriptor or thread limit causes silent connection refusal. | Count exceeds 2x baseline or nears the OS FD limit. |
| Failed produce requests | Distinguishes a pure connectivity issue from a protocol or authorization failure. | Sustained nonzero rate outside of rolling restarts. |
Fixes
Fix advertised.listeners mismatches
Edit server.properties on each broker and set advertised.listeners to an address that clients in their respective networks can resolve and reach. In containerized environments, this usually means advertising the node IP, load balancer, or ingress hostname instead of the pod IP.
Use separate listeners for inter-broker and client traffic:
listeners=INTERNAL://0.0.0.0:9092,EXTERNAL://0.0.0.0:9093
advertised.listeners=INTERNAL://broker-1.internal:9092,EXTERNAL://kafka.example.com:9093
listener.security.protocol.map=INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT
Map each listener name to the correct security protocol. After changing advertised.listeners, restart the broker. Clients must then use the external bootstrap server to ensure they receive the external advertised endpoints.
Tradeoff: Multiple listeners add configuration complexity and require careful firewall rules, but they eliminate the single biggest source of “broker may not be available” errors in NATed environments.
Do not rely on /etc/hosts workarounds on client machines. They are fragile and break when broker IPs change.
Fix TLS or SASL handshake failures
Ensure the client and broker support a common TLS version. Many modern Kafka deployments on recent JVMs negotiate TLS 1.2 and higher. If the client is restricted to TLS 1.3 and the broker only offers 1.2, the handshake will fail.
For SASL, verify the mechanism matches on both sides. For example, AWS MSK with IAM authentication requires the client to use the IAM mechanism, not SCRAM-SHA-256. Check broker logs for the exact AuthenticationException message.
Inspect certificate validity, trust stores, and whether the broker presents the full certificate chain. A missing intermediate certificate causes client-side trust failures that look like network timeouts.
Tradeoff: Stricter TLS cipher suites and certificate validation improve security but can break legacy clients.
Fix firewall or security group blocks
Open the advertised broker port from the client subnet to the broker host. Remember that the client connects to the advertised endpoint, not just the bootstrap endpoint. If you use a two-listener setup, ensure both the bootstrap listener port and the advertised data listener port are accessible.
Verify that return traffic is allowed. Stateful firewalls usually handle this, but asymmetric routing or strict ACLs can block response packets.
Recover a genuinely down or overloaded broker
If the broker process is not running, investigate why before restarting. Check related guides for disk exhaustion, OOM kills, or controller queue backups. A restart is not free: the broker loses its page cache, must re-fetch replicas to join the ISR, and triggers client reconnections across the cluster.
If the process is running but unresponsive, check for GC pauses via JMX heap metrics, disk I/O latency via iostat, or request queue saturation. If NetworkProcessorAvgIdlePercent is near zero, the broker is overloaded. Temporarily throttle producers with quotas or migrate partitions to reduce load.
Tradeoff: A rolling restart of a sick broker can restore service, but expect elevated UnderReplicatedPartitions and latency for several minutes to hours depending on partition size.
Prevention
- Validate
advertised.listenersfrom a client-equivalent network location after every broker deployment, container reschedule, or infrastructure change. - Monitor broker authentication failure rates via JMX or logs to catch TLS and SASL drift before clients fail.
- Manage firewall rules and security groups in infrastructure-as-code so advertised ports remain open by default.
- Avoid making a single broker the only bootstrap host for critical clients. If that broker is down, clients cannot even fetch metadata.
How Netdata helps
- Correlate broker authentication failure rates with client connection errors to distinguish network issues from security misconfigurations.
- Alert on low
Network Processor Average Idle Percentto catch network thread saturation before clients time out. - Monitor
Offline Partitions CountandActive Controller Countto confirm whether a broker is genuinely down or the cluster metadata plane is degraded. - Track broker process uptime and OS-level TCP metrics to spot file descriptor exhaustion or listen-queue overflows that silently reject connections.
- Cross-reference client-visible errors with broker request queue size and request handler idle percentage to separate overload from connectivity failures.
Related guides
- How Kafka actually works in production: a mental model for operators
- Kafka enable.auto.commit data loss: committed offsets that outrun processing
- Kafka broker out of disk: log.dirs full, the cliff-edge shutdown, and recovery
- Kafka CommitFailedException: rebalanced-out consumers and poll loop timeouts
- Kafka consumer group stuck Empty or Dead: no members consuming
- Kafka consumer group lag growing: detection, lag-as-time, and root causes
- Kafka consumer group rebalancing too often: heartbeats, session timeout, and assignors
- Kafka __consumer_offsets growing huge: compaction failure on the offsets topic
- Kafka consumer rebalance storm: stuck in PreparingRebalance and max.poll.interval.ms
- Kafka controller event queue backing up: overwhelmed controller and stalled metadata
- Kafka disk I/O latency high: await, LocalTimeMs, and the slow-disk broker
- Kafka disk space planning: retention, replication, and runway estimation







