Debugging a 502 Bad Gateway Error: A Real-Time Scenario and Step-by-Step Fix
Introduction
Encountering a 502 Bad Gateway error can be frustrating, especially when it disrupts the flow of critical APIs in a production environment. Recently, I faced this exact issue while working on a GraphQL API integrated into a microservices architecture. In this article, I’ll try to walk you through a real-time scenario, the investigation process, and the fix.
The Scenario
While integrating the orderByUuid
API in our microservices-based application, I encountered the following error:
Error while calling orderByUuid API for orderUuid: ca0acc78-3197-4152-a11c-cbc10840e5be,
error: 502 Bad Gateway on POST request for
"http://ecom-order-service.prod.codefarm.services/ecom-order-service/graphql":
"<html><head><title>502 Bad Gateway</title></head><body><center><h1>502 Bad Gateway</h1></center></body></html>"
This error occurred intermittently, which made debugging even trickier.
Step-by-Step Investigation
1. Understand What a 502 Error Means
A 502 Bad Gateway indicates that the server acting as a gateway (like a load balancer or reverse proxy) received an invalid response from an upstream server. In our case, the API Gatewaycouldn’t get a valid response from the ecom-order-service
.
2. Replicating the Issue
To ensure the problem wasn’t on the client side, I tried to replicate the request using cURL
:
curl -X POST -H "Content-Type: application/json" \
-d '{"query": "{ orderByUuid(uuid: \"ca0acc78-3197-4152-a11c-cbc10840e5be\") { id, title, content } }"}' \
http://ecom-order-service.prod.codefarm.services/ecom-order-service/graphql
Hitting the above curl every time worked properly, that increased the confusion and frustration.
Eventually, the error replicated in the test environment when we did the load test with a similar load on a similar environment setup as prod.
3. Checking API Gateway Logs
Next, I dived into the API Gateway logs. The logs showed the following:
upstream timed out (110: Connection timed out) while connecting to upstream,
client: 10.20.30.40, server: ecom-order-service,
request: "POST /ecom-order-service/graphql HTTP/1.1"
This indicated that the gateway request timed out while waiting for a response from the upstream service.
4. Verifying the Health of Upstream Service
I checked the health status of the ecom-order-service
:
curl http://ecom-order-service.prod.codefarm.services/actuator/health
Surprisingly, it returned “UP”. This suggested that the service was running, but something inside was causing delays.
5. Diving into Service Logs
I examined the service logs for errors or exceptions:
tail -f /var/log/ecom-order-service/application.log
I found repeated instances of this log:
WARN: Slow query detected - execution time: 12s
Query: SELECT * FROM order WHERE uuid = 'ca0acc78-3197-4152-a11c-cbc10840e5be'
Bingo! The service was taking too long to fetch data from the database, exceeding the API Gateway’s timeout threshold.
6. Analyzing Database Performance
Connecting to the database, I ran an EXPLAIN ANALYZE
on the problematic query:
EXPLAIN ANALYZE SELECT * FROM orders WHERE uuid = 'ca0acc78-3197-4152-a11c-cbc10840e5be';
The output revealed a full table scan, even though uuid
was supposed to be a primary key. Upon further investigation, I realized the uuid
column was missing an index due to a recent schema migration.
The Fix
✅ Step 1: Add Missing Index
CREATE INDEX idx_order_uuid ON orders(uuid);
This drastically reduced query execution time from 12 seconds to under 50 ms.
✅ Step 2: Adjust API Gateway Timeout (Temporary Fix)
While the database fix was being deployed, I increased the API Gateway timeout as a temporary measure:
proxy_read_timeout 30s;
✅ Step 3: Improve Monitoring
To avoid blind spots in the future, I:
- Added custom metrics for database query performance using Prometheus.
- Set up Grafana dashboards to monitor response times.
- Created alerts for slow queries and API timeouts.
Key Learnings
- 502 errors are often not client-side issues — start by checking upstream dependencies.
- Database performance issues can silently cause API failures.
- Proper indexing is critical, especially for frequently queried fields.
- Monitoring and alerts can help detect performance degradation before it impacts production.
Conclusion
Debugging 502 Bad Gateway errors requires a methodical approach:
- Replicate the issue,
- Check logs at every layer,
- Analyze service dependencies,
- And implement long-term monitoring solutions.
I hope this real-time scenario helps you tackle similar issues effectively. If you’ve faced any interesting debugging challenges, share them in the comments — I’d love to hear your experiences!