Troubleshooting and Resolving High Latency Due to Database Contention

2 min readDec 3, 2024

When a customer-facing application experiences high latency, database contention is a common root cause. Here’s a detailed step-by-step approach to troubleshooting and resolve this issue, along with a diagram for better understanding.

Below is the flow of steps that we will explore in this article

Scenario:

Imagine a food delivery app, QuickBite, where customers report delays in placing orders. The root cause is suspected to be contention in the database that stores order and restaurant information.

Step 1: Identify Symptoms and Collect Metrics

Key Areas to Check:

Query Response Times: Use monitoring tools like New Relic, Datadog, or AWS RDS Performance Insights.
Lock Wait Times: Check if transactions are waiting for locks due to concurrent writes.
Deadlocks: Investigate if multiple queries are blocking each other.
CPU and IOPS: High CPU or insufficient IOPS can exacerbate contention.

Example:

In QuickBite, queries updating the Orders table show a spike in Lock_wait_time. Multiple users trying to update the same restaurant inventory row cause contention.

Step 2: Analyze the Query Patterns

Use EXPLAIN in SQL to review query execution plans.
Identify long-running queries or those with a full table scan.

Example:

EXPLAIN SELECT * FROM Orders WHERE restaurant_id = 123 FOR UPDATE;

Finding: The query locks the entire row for restaurant_id, causing contention.

Step 3: Immediate Mitigation

Increase Database Capacity: Scale vertically (increase CPU/RAM) or horizontally (add read replicas).
Optimize Indices:Ensure proper indexing for frequently queried columns.

Example: Add an index on restaurant_id to speed up filtering.

CREATE INDEX idx_restaurant_id ON Orders(restaurant_id);

Step 4: Resolve Root Causes

Refactor Queries:

Use techniques like batch updates or chunked processing.
Convert locking queries to non-blocking reads where possible.
Example: Use SELECT ... FOR SHARE instead of FOR UPDATE for read-only operations.

Apply Database Partitioning:

Partition Orders by restaurant_id to reduce contention scope.

Implement a Queue:

For heavy-write scenarios, introduce a queue like Kafka or RabbitMQ to process updates asynchronously.

Step 5: Validate and Monitor

Test the solution in a staging environment.
Set alerts for metrics like query execution time and lock wait time.

Conclusion:

After partitioning the Orders table by restaurant_id and replacing locking updates with non-blocking reads, QuickBite reduced latency a lot, improving customer satisfaction.

This structured approach ensures effective identification and resolution of database contention issues in real-world applications.

— — — — — — — — — — — — — — — — — — — — — — — — — — —

Follow me and subscribe to YouTube channel codefarm for more such content