Kafka ‘Broker may not be available’: clients that can’t connect or stay connected

Client logs show a warning like this:

WARN [Producer clientId=...] Connection to node -1 could not be established. Broker may not be available.

Python clients may raise kafka.errors.NoBrokersAvailable, while librdkafka-based clients report Connection refused against the broker address. The bootstrap server is often reachable; ping and telnet succeed, yet the client still fails. This happens because Kafka clients use the bootstrap connection only to fetch metadata. After that, they disconnect and try to open fresh TCP connections to the host:port pairs advertised in the metadata response. If those endpoints are unreachable, misconfigured, or secured differently than the client expects, the connection fails even though the bootstrap succeeded.

What this means

This is a client-side symptom of a failed second-hop connection. The bootstrap succeeded and the client received a metadata response containing broker endpoints. The direct connection to one of those brokers then failed. The failure can happen at multiple layers: TCP routing, DNS resolution, TLS handshake, SASL authentication, or because the target broker is genuinely down. In containerized environments, the most common root cause is an advertised.listeners configuration that returns an internal hostname or IP to external clients. The broker is healthy, but it is telling clients to connect to an address they cannot reach.

flowchart TD
    A[Client sees Broker may not be available] --> B{Bootstrap host:port reachable?}
    B -->|No| C[Fix client network or DNS]
    B -->|Yes| D{kafka-broker-api-versions.sh works from client?}
    D -->|No| E[Check bootstrap listener and firewall]
    D -->|Yes| F{Advertised endpoint reachable from client?}
    F -->|No| G[Fix advertised.listeners or NAT rules]
    F -->|Yes| H{Broker logs show TLS or SASL errors?}
    H -->|Yes| I[Fix handshake or auth config]
    H -->|No| J[Check broker process health and load]

Common causes

CauseWhat it looks likeFirst thing to check
advertised.listeners misconfiguration (Docker, K8s, NAT)Client receives metadata but cannot resolve or reach the advertised host:port. Connections work inside the container network but fail from outside.kafka-broker-api-versions.sh from the client’s network. Compare the returned broker list against what the client can reach.
TLS or SASL handshake failureTCP connection establishes but is immediately reset or hangs. Broker logs contain SSLHandshakeException or AuthenticationException.Broker logs for handshake or authentication errors. Verify client and broker protocol versions and SASL mechanisms match.
Firewall or security group dropTCP SYN to the advertised port times out. The bootstrap port is open, but the data port is not.nc -zv or Bash /dev/tcp to the advertised host:port from the client host.
Broker process down or overloadedConnection refused to the advertised port, or the broker appears in metadata but does not respond to API requests.Broker process liveness, port binding with ss -tlnp, and NetworkProcessorAvgIdlePercent via JMX.

Quick checks

# TCP reachability to bootstrap
nc -zv <bootstrap-host> <bootstrap-port>

# Bash built-in alternative
timeout 5 bash -c 'cat < /dev/null > /dev/tcp/<bootstrap-host>/<bootstrap-port>'

# Kafka protocol reachability from client
kafka-broker-api-versions.sh --bootstrap-server <bootstrap-host>:<bootstrap-port>

# Broker logs for listener and handshake errors
grep -E "advertised\.listeners|listeners=|ERROR.*SSLHandshakeException|ERROR.*AuthenticationException" /var/log/kafka/server.log

# Broker listening interfaces
ss -tlnp | grep java

# KRaft quorum health
kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status

# File descriptor limits
cat /proc/$(pgrep -f kafka.Kafka)/limits | grep "Max open files"

How to diagnose it

  1. Test basic TCP connectivity to the bootstrap server using nc or Bash /dev/tcp. If this fails, investigate routing, DNS, or firewalls on both sides before changing broker configuration.
  2. Run kafka-broker-api-versions.sh --bootstrap-server <host>:<port> from the client machine. This performs a full Kafka protocol handshake. If it fails, the client cannot reach the Kafka protocol layer. This often indicates a firewall blocking the port or the broker listening on an interface the client cannot reach.
  3. If the API versions check succeeds, the cluster is reachable at the bootstrap address. The issue is likely the second hop. Inspect the broker’s advertised.listeners in server.properties or broker logs. Ensure the advertised host:port is resolvable and routable from the client’s network. In Docker or Kubernetes, the default advertised address is often the pod IP or container hostname, which external clients cannot resolve.
  4. Test TCP connectivity from the client to the specific advertised host:port that the client is complaining about. Use nc -zv <advertised-host> <advertised-port>. If this fails while the bootstrap works, you have a NAT, routing, or advertised.listeners mismatch.
  5. Check broker logs at /var/log/kafka/server.log for SSLHandshakeException, AuthenticationException, or SASL mechanism mismatch errors. If TCP connects but the connection drops during negotiation, the issue is at the security protocol layer, not the network layer.
  6. On the broker, verify the process is running and bound to the correct network interface. A common misconfiguration is listeners=PLAINTEXT://localhost:9092, which prevents external connections entirely. Use ss -tlnp to confirm the listening address.
  7. Check whether the broker is saturated. Even a running broker can refuse connections if network threads are exhausted. Query the JMX MBean kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent. If the value is below 0.1 (10% idle), the broker is overloaded and may drop connections.
  8. In KRaft mode, verify quorum health with kafka-metadata-quorum.sh --bootstrap-server <host>:<port> describe --status. If there is no quorum leader, metadata may be stale and brokers may advertise incorrect endpoints.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Authentication failuresFailed TLS or SASL handshakes often present as broker unavailability to clients.Sustained AuthenticationException in broker logs.
Network Processor Average Idle PercentSaturated network threads cannot accept new connections or complete handshakes.Sustained value below 0.3 (30% idle).
Offline Partitions CountConfirms whether a broker is genuinely down and its partitions are leaderless.Nonzero value reported by the active controller.
Active Controller CountStale or missing metadata means clients may receive bad broker lists.Cluster-wide sum is not exactly 1.
Connection CountApproaching the file descriptor or thread limit causes silent connection refusal.Count exceeds 2x baseline or nears the OS FD limit.
Failed produce requestsDistinguishes a pure connectivity issue from a protocol or authorization failure.Sustained nonzero rate outside of rolling restarts.

Fixes

Fix advertised.listeners mismatches

Edit server.properties on each broker and set advertised.listeners to an address that clients in their respective networks can resolve and reach. In containerized environments, this usually means advertising the node IP, load balancer, or ingress hostname instead of the pod IP.

Use separate listeners for inter-broker and client traffic:

listeners=INTERNAL://0.0.0.0:9092,EXTERNAL://0.0.0.0:9093
advertised.listeners=INTERNAL://broker-1.internal:9092,EXTERNAL://kafka.example.com:9093
listener.security.protocol.map=INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT

Map each listener name to the correct security protocol. After changing advertised.listeners, restart the broker. Clients must then use the external bootstrap server to ensure they receive the external advertised endpoints.

Tradeoff: Multiple listeners add configuration complexity and require careful firewall rules, but they eliminate the single biggest source of “broker may not be available” errors in NATed environments.

Do not rely on /etc/hosts workarounds on client machines. They are fragile and break when broker IPs change.

Fix TLS or SASL handshake failures

Ensure the client and broker support a common TLS version. Many modern Kafka deployments on recent JVMs negotiate TLS 1.2 and higher. If the client is restricted to TLS 1.3 and the broker only offers 1.2, the handshake will fail.

For SASL, verify the mechanism matches on both sides. For example, AWS MSK with IAM authentication requires the client to use the IAM mechanism, not SCRAM-SHA-256. Check broker logs for the exact AuthenticationException message.

Inspect certificate validity, trust stores, and whether the broker presents the full certificate chain. A missing intermediate certificate causes client-side trust failures that look like network timeouts.

Tradeoff: Stricter TLS cipher suites and certificate validation improve security but can break legacy clients.

Fix firewall or security group blocks

Open the advertised broker port from the client subnet to the broker host. Remember that the client connects to the advertised endpoint, not just the bootstrap endpoint. If you use a two-listener setup, ensure both the bootstrap listener port and the advertised data listener port are accessible.

Verify that return traffic is allowed. Stateful firewalls usually handle this, but asymmetric routing or strict ACLs can block response packets.

Recover a genuinely down or overloaded broker

If the broker process is not running, investigate why before restarting. Check related guides for disk exhaustion, OOM kills, or controller queue backups. A restart is not free: the broker loses its page cache, must re-fetch replicas to join the ISR, and triggers client reconnections across the cluster.

If the process is running but unresponsive, check for GC pauses via JMX heap metrics, disk I/O latency via iostat, or request queue saturation. If NetworkProcessorAvgIdlePercent is near zero, the broker is overloaded. Temporarily throttle producers with quotas or migrate partitions to reduce load.

Tradeoff: A rolling restart of a sick broker can restore service, but expect elevated UnderReplicatedPartitions and latency for several minutes to hours depending on partition size.

Prevention

  • Validate advertised.listeners from a client-equivalent network location after every broker deployment, container reschedule, or infrastructure change.
  • Monitor broker authentication failure rates via JMX or logs to catch TLS and SASL drift before clients fail.
  • Manage firewall rules and security groups in infrastructure-as-code so advertised ports remain open by default.
  • Avoid making a single broker the only bootstrap host for critical clients. If that broker is down, clients cannot even fetch metadata.

How Netdata helps

  • Correlate broker authentication failure rates with client connection errors to distinguish network issues from security misconfigurations.
  • Alert on low Network Processor Average Idle Percent to catch network thread saturation before clients time out.
  • Monitor Offline Partitions Count and Active Controller Count to confirm whether a broker is genuinely down or the cluster metadata plane is degraded.
  • Track broker process uptime and OS-level TCP metrics to spot file descriptor exhaustion or listen-queue overflows that silently reject connections.
  • Cross-reference client-visible errors with broker request queue size and request handler idle percentage to separate overload from connectivity failures.