Elasticsearch TLS certificate expiry: cluster fragmentation and client lockout

A node reboots and never rejoins the cluster. Kibana shows connection errors while your data pipeline buffers events. curl returns a TLS handshake failure even though the Elasticsearch process is still listening on port 9200. In Elasticsearch 8.x, security is enabled by default: every node and client relies on TLS certificates. When they expire, failure is abrupt and total. Transport-layer expiry fragments the cluster by rejecting inter-node handshakes. HTTP-layer expiry locks out clients while the cluster internals may still operate. There is no built-in grace period. At the expiry timestamp, connections fail immediately, often with no prior warning in the application logs.

What this means

Elasticsearch uses TLS in two layers. The transport layer secures inter-node communication. In 8.x, nodes present certificates to each other and validation is required for cluster membership. When a transport certificate expires, peer nodes treat the presenting node as untrusted. It is not gracefully decommissioned; it becomes unreachable. The master stops receiving fault detection pings, shards on that node go unassigned, and if multiple nodes are affected the cluster can lose quorum or split.

The HTTP layer secures the REST API. When an HTTP certificate expires, clients including Kibana, Logstash, Beats, and application services receive TLS handshake failures. The cluster may still report green internally and the transport layer may be intact, but the cluster is effectively offline to external users. Transport and HTTP certificates can expire at different times, so one layer can fail while the other works. Auto-generated certificates from elasticsearch-certutil have configurable expiry, and the default varies: clusters provisioned at different times may have different deadlines.

flowchart TD
    A[Certificate expiry] --> B{Which layer?}
    B -->|Transport| C[Nodes reject peer handshakes]
    C --> D[Node leaves cluster]
    D --> E[Unassigned shards and master instability]
    B -->|HTTP| F[Clients reject server cert]
    F --> G[REST API unavailable and Kibana lockout]

Common causes

CauseWhat it looks likeFirst thing to check
Transport certificate expiredNode count drops; unassigned shards rise; logs indicate TLS handshake failuresGET /_ssl/certificates filtered for transport entries
HTTP certificate expiredExternal clients fail with SSL errors; cluster health may still be greenGET /_ssl/certificates filtered for HTTP entries
Mismatched renewal datesCluster works internally but clients are locked out, or vice versaCompare expiry timestamps across all certificate paths returned by the API

Quick checks

# All Elasticsearch API calls require authentication and CA trust.
# Export ES_USER, ES_PASS, and ES_CACERT, or pass -u and --cacert inline.

curl -s -u "$ES_USER:$ES_PASS" --cacert "$ES_CACERT" 'https://localhost:9200/_ssl/certificates' | jq '.[] | {path, expiry, has_private_key}'

# If the HTTP API is unreachable, inspect the certificate directly with openssl
echo | openssl s_client -connect localhost:9200 -servername localhost 2>/dev/null | openssl x509 -noout -dates

# Same approach for the transport layer on the configured transport port (default 9300).
# The handshake will fail without a client certificate, but the server certificate is emitted first.
echo | openssl s_client -connect localhost:9300 2>/dev/null | openssl x509 -noout -dates

# Cluster node count and health
curl -s -u "$ES_USER:$ES_PASS" --cacert "$ES_CACERT" 'https://localhost:9200/_cluster/health?filter_path=status,number_of_nodes,unassigned_shards'

# Nodes visible to the master
curl -s -u "$ES_USER:$ES_PASS" --cacert "$ES_CACERT" 'https://localhost:9200/_cat/nodes?v&h=name,node.role,heap.percent'

# Elected master
curl -s -u "$ES_USER:$ES_PASS" --cacert "$ES_CACERT" 'https://localhost:9200/_cat/master?v'

# Master backlog caused by node departures
curl -s -u "$ES_USER:$ES_PASS" --cacert "$ES_CACERT" 'https://localhost:9200/_cluster/pending_tasks?pretty'

How to diagnose it

  1. Determine the failure scope. If internal cluster APIs respond locally but external clients cannot connect, suspect HTTP-layer expiry. If the master shows fewer nodes than expected or shards are unassigned, suspect transport-layer expiry.
  2. Query GET /_ssl/certificates from a node that still answers. Review every entry. The expiry field is an ISO-8601 timestamp. Do not assume transport and HTTP certificates share the same date.
  3. If the REST API is blocked by an expired HTTP certificate, use openssl s_client directly on the node to read the notAfter date without relying on the Elasticsearch API.
  4. Check Elasticsearch logs for SSLHandshakeException, CertPathValidatorException, or certificate has expired near the incident timestamp. These messages confirm TLS rejection rather than network partition.
  5. Correlate the notAfter timestamp with the incident start time. Certificate expiry failures are typically instantaneous at the expiry boundary.
  6. Check cluster health and node count. Transport fragmentation shows a reduced number_of_nodes and possibly unassigned shards. Use GET /_cluster/allocation/explain to confirm that shards are unassigned because the node is unavailable, not because of disk or shard limits.
  7. Verify master stability. If master-eligible nodes lost transport connectivity, check GET /_cat/master and GET /_cluster/pending_tasks for election stalls or allocation backlogs.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
GET /_ssl/certificates expiryBuilt-in early warning before hard failureAny certificate under 7 days
Node countTransport expiry removes nodes from cluster membershipUnplanned drop in number_of_nodes
Unassigned shard countFragmentation leaves shards without active copiesSustained rise without rolling restarts
Cluster health statusComposite view of fragmentation impactRed or yellow with no disk or heap pressure
Pending cluster tasksNode loss creates allocation and state publication work for the masterBacklog growing while nodes are missing
TLS handshake exceptions in logsDistinguishes cert expiry from network partitionSSLHandshakeException or certificate expired at incident start

Fixes

Transport certificate expired

Generate replacement transport certificates using your existing PKI or elasticsearch-certutil. Distribute the new certificate material to every node. Restart Elasticsearch on each node in a rolling fashion, verifying that each node rejoins with GET /_cat/nodes before proceeding. If the cluster has already fragmented, start with the current master or a master-eligible node to preserve cluster state, then bring data nodes back online one by one.

Warning: Do not remove an old CA from truststores until every node presents a certificate signed by the new or retained authority. Premature removal partitions the cluster.

HTTP certificate expired

Generate replacement HTTP certificates, install them on each node, and restart. If the CA has changed, update client truststores and Kibana configuration. Verify client connectivity with a direct curl before declaring the incident resolved.

If both layers expired

Treat transport first. Restore cluster membership and master stability so that shard allocation and state publication work correctly. Then restore HTTP client access. Attempting to fix HTTP while the cluster is split risks conflicting state across nodes and makes diagnosis harder.

Prevention

  • Poll GET /_ssl/certificates at least daily and alert when any certificate is within 7 days of expiry.
  • Maintain separate tracking for transport and HTTP expiry dates.
  • Renew certificates before they expire. Elasticsearch does not support extending an existing certificate’s validity period.
  • Test the renewal and restart procedure in a staging cluster.
  • Keep a fallback access channel that does not depend on the HTTP TLS path, such as direct host access or an unproxied localhost connection, for diagnosis during lockout.

How Netdata helps

Netdata health alerts for node reachability and cluster health surface node-count drops caused by transport-layer fragmentation. Per-node HTTP check failures on port 9200 distinguish HTTP certificate expiry from application crashes. JVM heap and disk metrics help rule out resource-pressure cascades that mimic fragmentation. Netdata can also monitor the TLS certificate notAfter date on the HTTP endpoint directly, which works even when the REST API is locked out. Set alert thresholds at 7 days and 24 hours.