Debugging a 502 Bad Gateway Error: A Real-Time Scenario and Step-by-Step Fix

Arvind Kumar
3 min readFeb 5, 2025

--

Introduction

Encountering a 502 Bad Gateway error can be frustrating, especially when it disrupts the flow of critical APIs in a production environment. Recently, I faced this exact issue while working on a GraphQL API integrated into a microservices architecture. In this article, I’ll try to walk you through a real-time scenario, the investigation process, and the fix.

The Scenario

While integrating the orderByUuid API in our microservices-based application, I encountered the following error:

Error while calling orderByUuid API for orderUuid: ca0acc78-3197-4152-a11c-cbc10840e5be, 
error: 502 Bad Gateway on POST request for
"http://ecom-order-service.prod.codefarm.services/ecom-order-service/graphql":
"<html><head><title>502 Bad Gateway</title></head><body><center><h1>502 Bad Gateway</h1></center></body></html>"

This error occurred intermittently, which made debugging even trickier.

Step-by-Step Investigation

1. Understand What a 502 Error Means

A 502 Bad Gateway indicates that the server acting as a gateway (like a load balancer or reverse proxy) received an invalid response from an upstream server. In our case, the API Gatewaycouldn’t get a valid response from the ecom-order-service.

2. Replicating the Issue

To ensure the problem wasn’t on the client side, I tried to replicate the request using cURL:

curl -X POST -H "Content-Type: application/json" \
-d '{"query": "{ orderByUuid(uuid: \"ca0acc78-3197-4152-a11c-cbc10840e5be\") { id, title, content } }"}' \
http://ecom-order-service.prod.codefarm.services/ecom-order-service/graphql

Hitting the above curl every time worked properly, that increased the confusion and frustration.

Eventually, the error replicated in the test environment when we did the load test with a similar load on a similar environment setup as prod.

3. Checking API Gateway Logs

Next, I dived into the API Gateway logs. The logs showed the following:

upstream timed out (110: Connection timed out) while connecting to upstream, 
client: 10.20.30.40, server: ecom-order-service,
request: "POST /ecom-order-service/graphql HTTP/1.1"

This indicated that the gateway request timed out while waiting for a response from the upstream service.

4. Verifying the Health of Upstream Service

I checked the health status of the ecom-order-service:

curl http://ecom-order-service.prod.codefarm.services/actuator/health

Surprisingly, it returned “UP”. This suggested that the service was running, but something inside was causing delays.

5. Diving into Service Logs

I examined the service logs for errors or exceptions:

tail -f /var/log/ecom-order-service/application.log

I found repeated instances of this log:

WARN: Slow query detected - execution time: 12s 
Query: SELECT * FROM order WHERE uuid = 'ca0acc78-3197-4152-a11c-cbc10840e5be'

Bingo! The service was taking too long to fetch data from the database, exceeding the API Gateway’s timeout threshold.

6. Analyzing Database Performance

Connecting to the database, I ran an EXPLAIN ANALYZE on the problematic query:

EXPLAIN ANALYZE SELECT * FROM orders WHERE uuid = 'ca0acc78-3197-4152-a11c-cbc10840e5be';

The output revealed a full table scan, even though uuid was supposed to be a primary key. Upon further investigation, I realized the uuid column was missing an index due to a recent schema migration.

The Fix

✅ Step 1: Add Missing Index

CREATE INDEX idx_order_uuid ON orders(uuid);

This drastically reduced query execution time from 12 seconds to under 50 ms.

✅ Step 2: Adjust API Gateway Timeout (Temporary Fix)

While the database fix was being deployed, I increased the API Gateway timeout as a temporary measure:

proxy_read_timeout 30s;

✅ Step 3: Improve Monitoring

To avoid blind spots in the future, I:

  • Added custom metrics for database query performance using Prometheus.
  • Set up Grafana dashboards to monitor response times.
  • Created alerts for slow queries and API timeouts.

Key Learnings

  1. 502 errors are often not client-side issues — start by checking upstream dependencies.
  2. Database performance issues can silently cause API failures.
  3. Proper indexing is critical, especially for frequently queried fields.
  4. Monitoring and alerts can help detect performance degradation before it impacts production.

Conclusion

Debugging 502 Bad Gateway errors requires a methodical approach:

  • Replicate the issue,
  • Check logs at every layer,
  • Analyze service dependencies,
  • And implement long-term monitoring solutions.

I hope this real-time scenario helps you tackle similar issues effectively. If you’ve faced any interesting debugging challenges, share them in the comments — I’d love to hear your experiences!

--

--

Arvind Kumar
Arvind Kumar

Written by Arvind Kumar

Staff Engineer @Chegg || Passionate about technology || https://youtube.com/@codefarm0

No responses yet