Troubleshooting and Resolving High Latency Due to Database Contention

Arvind Kumar
2 min readDec 3, 2024

--

When a customer-facing application experiences high latency, database contention is a common root cause. Here’s a detailed step-by-step approach to troubleshooting and resolve this issue, along with a diagram for better understanding.

Below is the flow of steps that we will explore in this article

Scenario:

Imagine a food delivery app, QuickBite, where customers report delays in placing orders. The root cause is suspected to be contention in the database that stores order and restaurant information.

Step 1: Identify Symptoms and Collect Metrics

Key Areas to Check:

  1. Query Response Times: Use monitoring tools like New Relic, Datadog, or AWS RDS Performance Insights.
  2. Lock Wait Times: Check if transactions are waiting for locks due to concurrent writes.
  3. Deadlocks: Investigate if multiple queries are blocking each other.
  4. CPU and IOPS: High CPU or insufficient IOPS can exacerbate contention.

Example:

In QuickBite, queries updating the Orders table show a spike in Lock_wait_time. Multiple users trying to update the same restaurant inventory row cause contention.

Step 2: Analyze the Query Patterns

  • Use EXPLAIN in SQL to review query execution plans.
  • Identify long-running queries or those with a full table scan.

Example:

EXPLAIN SELECT * FROM Orders WHERE restaurant_id = 123 FOR UPDATE;

Finding: The query locks the entire row for restaurant_id, causing contention.

Step 3: Immediate Mitigation

  1. Increase Database Capacity: Scale vertically (increase CPU/RAM) or horizontally (add read replicas).
  2. Optimize Indices:Ensure proper indexing for frequently queried columns.

Example: Add an index on restaurant_id to speed up filtering.

CREATE INDEX idx_restaurant_id ON Orders(restaurant_id);

Step 4: Resolve Root Causes

Refactor Queries:

  • Use techniques like batch updates or chunked processing.
  • Convert locking queries to non-blocking reads where possible.
    Example: Use SELECT ... FOR SHARE instead of FOR UPDATE for read-only operations.

Apply Database Partitioning:

  • Partition Orders by restaurant_id to reduce contention scope.

Implement a Queue:

  • For heavy-write scenarios, introduce a queue like Kafka or RabbitMQ to process updates asynchronously.

Step 5: Validate and Monitor

  • Test the solution in a staging environment.
  • Set alerts for metrics like query execution time and lock wait time.

Conclusion:

After partitioning the Orders table by restaurant_id and replacing locking updates with non-blocking reads, QuickBite reduced latency a lot, improving customer satisfaction.

This structured approach ensures effective identification and resolution of database contention issues in real-world applications.

— — — — — — — — — — — — — — — — — — — — — — — — — — —

Follow me and subscribe to YouTube channel codefarm for more such content

--

--

Arvind Kumar
Arvind Kumar

Written by Arvind Kumar

Staff Engineer @Chegg || Passionate about technology || https://youtube.com/@codefarm0

No responses yet