Elasticsearch TLS certificate expiry: cluster fragmentation and client lockout
A node reboots and never rejoins the cluster. Kibana shows connection errors while your data pipeline buffers events. curl returns a TLS handshake failure even though the Elasticsearch process is still listening on port 9200. In Elasticsearch 8.x, security is enabled by default: every node and client relies on TLS certificates. When they expire, failure is abrupt and total. Transport-layer expiry fragments the cluster by rejecting inter-node handshakes. HTTP-layer expiry locks out clients while the cluster internals may still operate. There is no built-in grace period. At the expiry timestamp, connections fail immediately, often with no prior warning in the application logs.
What this means
Elasticsearch uses TLS in two layers. The transport layer secures inter-node communication. In 8.x, nodes present certificates to each other and validation is required for cluster membership. When a transport certificate expires, peer nodes treat the presenting node as untrusted. It is not gracefully decommissioned; it becomes unreachable. The master stops receiving fault detection pings, shards on that node go unassigned, and if multiple nodes are affected the cluster can lose quorum or split.
The HTTP layer secures the REST API. When an HTTP certificate expires, clients including Kibana, Logstash, Beats, and application services receive TLS handshake failures. The cluster may still report green internally and the transport layer may be intact, but the cluster is effectively offline to external users. Transport and HTTP certificates can expire at different times, so one layer can fail while the other works. Auto-generated certificates from elasticsearch-certutil have configurable expiry, and the default varies: clusters provisioned at different times may have different deadlines.
flowchart TD
A[Certificate expiry] --> B{Which layer?}
B -->|Transport| C[Nodes reject peer handshakes]
C --> D[Node leaves cluster]
D --> E[Unassigned shards and master instability]
B -->|HTTP| F[Clients reject server cert]
F --> G[REST API unavailable and Kibana lockout]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Transport certificate expired | Node count drops; unassigned shards rise; logs indicate TLS handshake failures | GET /_ssl/certificates filtered for transport entries |
| HTTP certificate expired | External clients fail with SSL errors; cluster health may still be green | GET /_ssl/certificates filtered for HTTP entries |
| Mismatched renewal dates | Cluster works internally but clients are locked out, or vice versa | Compare expiry timestamps across all certificate paths returned by the API |
Quick checks
# All Elasticsearch API calls require authentication and CA trust.
# Export ES_USER, ES_PASS, and ES_CACERT, or pass -u and --cacert inline.
curl -s -u "$ES_USER:$ES_PASS" --cacert "$ES_CACERT" 'https://localhost:9200/_ssl/certificates' | jq '.[] | {path, expiry, has_private_key}'
# If the HTTP API is unreachable, inspect the certificate directly with openssl
echo | openssl s_client -connect localhost:9200 -servername localhost 2>/dev/null | openssl x509 -noout -dates
# Same approach for the transport layer on the configured transport port (default 9300).
# The handshake will fail without a client certificate, but the server certificate is emitted first.
echo | openssl s_client -connect localhost:9300 2>/dev/null | openssl x509 -noout -dates
# Cluster node count and health
curl -s -u "$ES_USER:$ES_PASS" --cacert "$ES_CACERT" 'https://localhost:9200/_cluster/health?filter_path=status,number_of_nodes,unassigned_shards'
# Nodes visible to the master
curl -s -u "$ES_USER:$ES_PASS" --cacert "$ES_CACERT" 'https://localhost:9200/_cat/nodes?v&h=name,node.role,heap.percent'
# Elected master
curl -s -u "$ES_USER:$ES_PASS" --cacert "$ES_CACERT" 'https://localhost:9200/_cat/master?v'
# Master backlog caused by node departures
curl -s -u "$ES_USER:$ES_PASS" --cacert "$ES_CACERT" 'https://localhost:9200/_cluster/pending_tasks?pretty'
How to diagnose it
- Determine the failure scope. If internal cluster APIs respond locally but external clients cannot connect, suspect HTTP-layer expiry. If the master shows fewer nodes than expected or shards are unassigned, suspect transport-layer expiry.
- Query
GET /_ssl/certificatesfrom a node that still answers. Review every entry. Theexpiryfield is an ISO-8601 timestamp. Do not assume transport and HTTP certificates share the same date. - If the REST API is blocked by an expired HTTP certificate, use
openssl s_clientdirectly on the node to read thenotAfterdate without relying on the Elasticsearch API. - Check Elasticsearch logs for
SSLHandshakeException,CertPathValidatorException, orcertificate has expirednear the incident timestamp. These messages confirm TLS rejection rather than network partition. - Correlate the
notAftertimestamp with the incident start time. Certificate expiry failures are typically instantaneous at the expiry boundary. - Check cluster health and node count. Transport fragmentation shows a reduced
number_of_nodesand possibly unassigned shards. UseGET /_cluster/allocation/explainto confirm that shards are unassigned because the node is unavailable, not because of disk or shard limits. - Verify master stability. If master-eligible nodes lost transport connectivity, check
GET /_cat/masterandGET /_cluster/pending_tasksfor election stalls or allocation backlogs.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
GET /_ssl/certificates expiry | Built-in early warning before hard failure | Any certificate under 7 days |
| Node count | Transport expiry removes nodes from cluster membership | Unplanned drop in number_of_nodes |
| Unassigned shard count | Fragmentation leaves shards without active copies | Sustained rise without rolling restarts |
| Cluster health status | Composite view of fragmentation impact | Red or yellow with no disk or heap pressure |
| Pending cluster tasks | Node loss creates allocation and state publication work for the master | Backlog growing while nodes are missing |
| TLS handshake exceptions in logs | Distinguishes cert expiry from network partition | SSLHandshakeException or certificate expired at incident start |
Fixes
Transport certificate expired
Generate replacement transport certificates using your existing PKI or elasticsearch-certutil. Distribute the new certificate material to every node. Restart Elasticsearch on each node in a rolling fashion, verifying that each node rejoins with GET /_cat/nodes before proceeding. If the cluster has already fragmented, start with the current master or a master-eligible node to preserve cluster state, then bring data nodes back online one by one.
Warning: Do not remove an old CA from truststores until every node presents a certificate signed by the new or retained authority. Premature removal partitions the cluster.
HTTP certificate expired
Generate replacement HTTP certificates, install them on each node, and restart. If the CA has changed, update client truststores and Kibana configuration. Verify client connectivity with a direct curl before declaring the incident resolved.
If both layers expired
Treat transport first. Restore cluster membership and master stability so that shard allocation and state publication work correctly. Then restore HTTP client access. Attempting to fix HTTP while the cluster is split risks conflicting state across nodes and makes diagnosis harder.
Prevention
- Poll
GET /_ssl/certificatesat least daily and alert when any certificate is within 7 days of expiry. - Maintain separate tracking for transport and HTTP expiry dates.
- Renew certificates before they expire. Elasticsearch does not support extending an existing certificate’s validity period.
- Test the renewal and restart procedure in a staging cluster.
- Keep a fallback access channel that does not depend on the HTTP TLS path, such as direct host access or an unproxied localhost connection, for diagnosis during lockout.
How Netdata helps
Netdata health alerts for node reachability and cluster health surface node-count drops caused by transport-layer fragmentation. Per-node HTTP check failures on port 9200 distinguish HTTP certificate expiry from application crashes. JVM heap and disk metrics help rule out resource-pressure cascades that mimic fragmentation. Netdata can also monitor the TLS certificate notAfter date on the HTTP endpoint directly, which works even when the REST API is locked out. Set alert thresholds at 7 days and 24 hours.
Related guides
- Elasticsearch all shards failed: diagnosing search_phase_execution_exception
- Elasticsearch CircuitBreakingException: [parent] Data too large - causes and fixes
- Elasticsearch cluster_block_exception: blocked by, the read-only blocks explained
- Elasticsearch cluster health red: unassigned primaries and how to recover
- Elasticsearch cluster health yellow: unassigned replicas vs real allocation blocks
- Elasticsearch cluster state too large: field count, index count, and per-node heap
- Elasticsearch disk full: emergency recovery and freeing space safely
- Elasticsearch disk watermark cascade: from low watermark to cluster-wide read-only
- Elasticsearch document indexing failures: index_failed, bulk item errors, and version conflicts
- Elasticsearch EsRejectedExecutionException: write thread pool rejections and HTTP 429
- Elasticsearch fielddata circuit breaker tripped: text-field aggregations and the keyword fix
- Elasticsearch FORBIDDEN/12/index read-only / allow delete (api) — flood stage recovery







