AWS ELB 5xx Surge Investigation Target Health Connection Draining and Sticky Sessions

The alert notification chimes, and your dashboard lights up with an ELB 5xx error rate spike. Your application, running behind an AWS Application Load Balancer (ALB), is suddenly returning 502, 503, or 504 errors. This is a high-stakes scenario where every second of downtime impacts users. The load balancer is often just the messenger; the real culprit lies somewhere in your backend infrastructure. Is a target group unhealthy? Are your connections timing out? Is a misconfigured deployment process wreaking havoc?

Effective ALB troubleshooting requires a systematic approach. You need to quickly move past the symptom—the 5xx error—and diagnose the root cause. This guide provides a structured methodology for investigating these critical errors, focusing on the three most common areas of failure: target health, connection management, and session configuration. By understanding how these components interact, you can dramatically reduce your mean time to resolution (MTTR) and build more resilient systems.

Decoding Target Group Health- The Root of Most 5xx Errors

The most common reason for an ALB to return a 5xx error is that it has no healthy targets to send traffic to. When a target fails its health checks, the ALB marks it as unhealthy and stops routing new requests to it. If all targets in a group become unhealthy, the load balancer has no choice but to return an HTTP 503 Service Unavailable error. An unhealthy target count is your first and most important clue.

Why Do Health Checks Fail?

A target health check failed status can happen for several reasons. When investigating, you need to verify that the ALB can successfully communicate with your backend instances on the specified health check port and path.

Here are the primary culprits:

Network and Security Misconfigurations: This is the number one cause. Ensure that the security group attached to your target instances explicitly allows traffic from the security group of your ALB on the health check port. Likewise, check that your Network Access Control Lists (NACLs) allow this traffic in both directions. The load balancer uses its private IP addresses for health checks, not its public-facing ones.
Application Not Responding Correctly:
- The application server (e.g., Nginx, Apache, Tomcat) is not running or has crashed.
- The application is not listening on the port defined in the target group’s health check configuration.
- The health check path (e.g., /health) is incorrect or does not return a 200 OK status code within the configured timeout. A slow target response time on this endpoint can cause the check to fail.
Resource Exhaustion: The target instance may be overloaded. High CPU utilization, memory exhaustion, or a lack of available network connections can prevent the application from responding to health checks in time.

Using ELB CloudWatch Metrics and Access Logs for Clues

AWS provides two essential tools for diagnosing target health issues: CloudWatch metrics and access logs.

Your first stop should be the ELB CloudWatch metrics for the specific load balancer. Look for these key indicators:

UnHealthyHostCount: A non-zero or increasing value here is a direct signal that targets are failing health checks.
HealthyHostCount: A corresponding drop in this metric confirms that instances are being removed from service.
TargetConnectionErrorCount: This metric counts connections that were not successfully established between the load balancer and its targets. A spike here often points to network-level issues like security group denials or TCP handshake failures.
HTTPCode_Target_5XX_Count: This tracks 5xx errors generated by your application targets themselves, as opposed to the load balancer. If this metric is high, the problem is almost certainly within your application code.

For more granular detail, enable and perform ELB access logs analysis. Each log entry contains valuable information about a request, including the status codes.

http 2023-10-27T10:20:15.123Z app/my-alb/12345abcdef 192.0.2.1:45678 - -1 -1 -1 502 - 29 35 “GET http://example.com:80/ HTTP/1.1” …

Generated code In this log, the elb_status_code is 502, and the target_processing_time is -1. This combination often means the ALB tried to send the request to a target but failed, either because the target was unhealthy, closed the connection unexpectedly, or failed to respond, leading to an AWS ELB 502 Bad Gateway error.

The Silent Killers- Connection Management Issues

If your targets appear healthy but you’re still seeing 5xx errors, the next place to look is connection management. How the ALB handles connections to your targets—especially idle connections and during deployments—can be a subtle source of errors, most notably the ALB 504 Gateway Timeout.

A 504 error means the load balancer successfully established a connection with a target but did not receive a response before the connection idle timeout was reached.

The Idle Timeout Trap and Deregistration Delay

Every ALB has a configurable idle timeout, with a default of 60 seconds. If a backend operation, like generating a complex report or processing a large file upload, takes longer than this timeout, the ALB will terminate the connection and return a 504 error to the client.

A related and often overlooked issue is the keep-alive timeout on your application server. Best practice dictates that the application’s keep-alive timeout should be greater than the ALB’s idle timeout.If the application server closes an idle connection before the ALB does, the ALB might try to send a new request over that now-closed connection, resulting in a 502 Bad Gateway error.

Equally important is the deregistration delay, also known as the connection draining timeout. This setting determines how long the ALB will wait for in-flight requests to complete on an instance that is being deregistered (e.g., during a deployment or scale-in event). If this value is too low, the ALB will abruptly cut off long-running requests, causing them to fail with a 5xx error. You should set this value to be slightly longer than the maximum expected request completion time.

Warming Up New Instances with Slow Start Mode

Applications that need to perform initialization tasks, such as loading large caches into memory, can be overwhelmed if they receive a full share of traffic immediately after passing an initial health check. This can cause them to become unresponsive, fail subsequent health checks, and trigger an error spike.

To prevent this, you can enable ALB slow start mode. This feature allows you to define a ramp-up period during which the ALB will gradually send an increasing amount of traffic to the newly registered instance, giving it time to “warm up” and prepare to handle its full load gracefully.

The Perils of Misconfigured Sticky Sessions

Stateful applications sometimes require that a user’s session remains on a single target instance. ELB sticky sessions (target group stickiness) achieve this by using a cookie to route all requests from a specific client to the same target.

While useful, stickiness introduces a new failure mode. If a client is “stuck” to a target instance that subsequently becomes unhealthy or is deregistered, the ALB will continue to route that client’s requests to the failing instance. The result is a persistent stream of 5xx errors for that specific user, which continues until the stickiness cookie expires. This can be difficult to diagnose with aggregate metrics, as it may only affect a small percentage of your users.

When troubleshooting, check your access logs for patterns where the same client IP is repeatedly routed to a known unhealthy target. The long-term architectural solution is often to move session state out of the application instances and into an external distributed store like Amazon ElastiCache, making your targets truly stateless.

Proactive Troubleshooting with Netdata

While CloudWatch metrics and access logs are indispensable for post-mortem ALB troubleshooting, they are inherently reactive. You’re often analyzing what happened minutes or hours ago. To truly get ahead of these issues, you need real-time, high-granularity visibility into the targets themselves.

Netdata provides an unparalleled level of insight by automatically collecting thousands of per-second metrics from your target instances. This allows you to correlate an ELB 5xx surge directly with the performance of the underlying systems in real-time.

Imagine seeing an UnHealthyHostCount spike in CloudWatch. With Netdata, you can immediately pivot to the exact timeframe on your instance dashboards and answer critical questions:

Did the application process crash? Netdata’s application monitoring can show you process uptime, CPU and memory usage, and open file descriptors.
Was the instance CPU-bound? You can see a per-core breakdown of CPU usage (user, system, iowait) to identify bottlenecks.
Did the instance run out of memory? Netdata provides detailed memory analysis, including application-specific usage, helping you spot memory leaks before they trigger the OOM killer.
Is there a network issue? Monitor TCP connection states, packet drops, and network interface errors in real-time.

Netdata’s powerful correlation capabilities mean you stop guessing and start seeing. Instead of manually parsing logs, you have a unified, real-time view of your entire stack, from the load balancer’s perspective down to the kernel-level activity on each target. This transforms troubleshooting from a reactive exercise into a proactive discipline, enabling you to fix issues before they impact your users.

To move from reacting to 5xx errors to preventing them, explore what Netdata can do for your AWS infrastructure.

Industry

Technology

Use cases

Troubleshooting