You’ve deployed your application to Kubernetes, and everything seems fine until a user reports the dreaded 502 Bad Gateway
or 504 Gateway Timeout
. In a traditional VM setup, your first step would be to SSH into the server and check the NGINX configuration and logs. But in the ephemeral, abstracted world of Kubernetes, the problem is rarely that simple. A gateway error from the NGINX Ingress Controller is often a symptom of a deeper issue within the cluster’s networking or health management systems.
These errors signal that the NGINX Ingress Controller, which acts as the front door to your cluster, could not get a valid or timely response from the service it was trying to reach. This “upstream” isn’t a single IP address anymore; it’s a dynamic set of Pods orchestrated by Kubernetes. Understanding how to debug this complex interaction between the Ingress, Services, and Pod health probes is essential for any engineer running applications on Kubernetes.
The Journey of a Request and Where It Fails
To effectively troubleshoot a kubernetes 502
or 504
error, you must first visualize the path a request takes to reach your application.
- User to Load Balancer: The user’s request hits an external Load Balancer (provided by your cloud, like an AWS ELB/ALB or a GKE Load Balancer).
- Load Balancer to Ingress: The Load Balancer forwards traffic to one of the NGINX Ingress Controller pods running on your cluster nodes.
- Ingress to Service: The NGINX pod inspects the request’s host and path, consults the Ingress resource rules, and determines which Kubernetes Service to route the request to.
- Service to Pod: The Kubernetes Service, which acts as an internal load balancer, selects a healthy, ready Pod from its list of endpoints and forwards the traffic.
- Pod Processes Request: The container inside the selected Pod receives the request and processes it.
A 502 or 504 error occurs when there’s a breakdown at step 4 or 5. NGINX tried to complete its job but the upstream—the combination of the Service and its backing Pods—failed. Let’s examine the three main culprits.
Culprit #1: The NGINX Ingress Controller and Its Configuration
While the Ingress Controller itself might be running fine, its configuration can be a primary source of timeouts and bad gateway errors.
Misconfigured Ingress Rules
The most basic error is a typo in your Ingress resource definition. If you specify the wrong serviceName
or servicePort
, NGINX won’t be able to find a valid destination. You can inspect your Ingress resource definition to check the Backends
section. If it shows an error or indicates that endpoints for the service were not found, you have a direct clue that the service name is wrong or the service itself has no ready pods.
Ingress Controller Timeouts
Just like a standard NGINX setup, the Ingress Controller has timeout settings. If your application needs more than the default 60 seconds to process a request, you’ll see a 504 Gateway Timeout
. These are often configured globally in the NGINX Ingress Controller’s ConfigMap
. A better practice for specific long-running endpoints is to set them on a per-Ingress basis using annotations, such as nginx.ingress.kubernetes.io/proxy-read-timeout
. This is a common fix for 504 errors, but it should be applied judiciously.
Checking Ingress Logs
The logs from the Ingress Controller pod are your best friend. They will tell you exactly why NGINX failed. After finding your Ingress Controller pod, you can stream its logs and look for error messages like:
connect() failed (111: Connection refused) while connecting to upstream
: This often means thetargetPort
on the Service doesn’t match thecontainerPort
on the Pod, or the application inside the container isn’t listening on that port.upstream timed out (110: Connection timed out) while reading response header from upstream
: This is the classic 504 error. The application received the request but took too long to respond.no endpoints available for <namespace>/<service-name>
: This is a critical message. It means NGINX asked Kubernetes for the list of IPs for a Service and got an empty list back. This leads us to our next culprit.
Culprit #2: The Kubernetes Service
A Kubernetes Service provides a stable network identity for a group of Pods. If it’s not configured correctly, it creates a dead end for traffic.
The most common failure mode is a selector mismatch. The Service uses a label selector to identify which Pods belong to it. If no Pods have the labels defined in the Service’s selector, its endpoint list will be empty.
To debug this, first examine the Service definition. Pay close attention to two fields: Selector (the labels the Service is looking for) and Endpoints (the list of IP:Port
combinations for Pods that match the selector and are passing readiness probes). If the Endpoints field is empty or shows <none>
, you’ve found your problem. You should then check the labels on the Pods you expect to be part of the service and compare them to the Service’s selector. A single typo is all it takes to break the connection.
Another common issue is a mismatch between the port
and targetPort
in the Service definition. The port
is what the Service exposes internally, while targetPort
is the actual port the container is listening on. If targetPort
is wrong, you’ll see Connection refused
errors in the NGINX logs.
Culprit #3: Liveness and Readiness Probes
Health probes are Kubernetes’s way of managing pod health, but when misconfigured, they become a silent killer of services, directly causing 502 errors.
The Role of Readiness Probes
The kubelet uses readiness probes to know when a container is ready to accept traffic. The most important thing to understand is this: If a Pod’s readiness probe is failing, Kubernetes removes its IP address from the Service’s list of endpoints.From the NGINX Ingress Controller’s perspective, the Pod has simply vanished. If all Pods for a service fail their readiness probes, the endpoint list becomes empty. When NGINX tries to proxy a request to that Service, it finds no available backends and returns a 502 Bad Gateway
. This is one of the most common and confusing sources of 502 errors in Kubernetes. Your Pods might be Running
, but if they aren’t Ready
, they can’t serve traffic.
The Role of Liveness Probes
The kubelet uses liveness probes to know when to restart a container. If a liveness probe fails, Kubernetes kills the container and attempts to restart it. While the container is restarting, it’s not ready, which again removes it from the Service’s endpoints. If this happens across all your replicas, you’ll experience a brief outage and see 502s. Aggressive liveness probes on an application that is under heavy load or has slow startup times can lead to CrashLoopBackOff
scenarios, creating a sustained outage.
How to Debug Probe Failures
You can investigate probe issues by describing the pod’s status. Look at the Events section in the output. You’ll see explicit messages if probes are failing, such as Warning Unhealthy Readiness probe failed
or Warning Unhealthy Liveness probe failed
. This tells you that the problem isn’t with NGINX or the Service, but with the application’s health check itself. The next step is to check the application’s logs to understand why it’s failing the health check. Is it overloaded? Can’t connect to the database? These are application-level problems revealed by the infrastructure.
A Systematic Troubleshooting Workflow
When faced with a 502 or 504, follow this process from the outside in:
- Check Ingress Status: Start by describing your Ingress resource. Check for errors, correct backends, and relevant timeout annotations.
- Check Ingress Controller Logs: Look for logs from the controller pod itself. Filter for error messages related to your service, such as
connect()
,timed out
, orno endpoints available
. - Check the Service Endpoints: Describe your Service resource. Is the
Endpoints
list populated or is it empty? Verify theSelector
andtargetPort
. - Check Pod Status: Get the pods for your service. Are they in the
Running
state? Look at theirREADY
status (e.g.,1/1
vs0/1
) and theirRESTARTS
count. - Check Pod Events: If a pod is not ready or is restarting, describe the pod. The
Events
section will almost always tell you about probe failures or other startup issues. - Check Application Logs: Finally, check the application’s own logs to find the root cause of the probe failure or slow response.
Gateway errors in Kubernetes can be daunting because they span multiple interconnected components. However, by methodically tracing the request path and inspecting each component—Ingress, Service, and Pods (especially their health probes)—you can systematically isolate the source of the failure. These errors are not just NGINX issues; they are signals about the health and configuration of your entire application stack within the cluster.
To accelerate this debugging process, you need visibility across all these layers at once. A powerful monitoring solution can correlate NGINX error rates with Pod resource utilization, probe success rates, and application performance metrics, turning hours of manual checks into a single, intuitive dashboard. Get started with Netdata for free and see how real-time, high-granularity monitoring can transform your Kubernetes troubleshooting workflow.