Why Saving URLs in Reverse Order Improves Database Performance

When designing databases for web applications, efficient data retrieval is a critical factor. One seemingly minor decision—how URLs are stored—can significantly impact performance. Saving URLs in reverse order, such as com.cnn.subdomain, instead of the traditional subdomain.cnn.com, can improve database performance in certain scenarios. This blog explores why this is the case, demonstrates how to reproduce the results using PostgreSQL, explains how to interpret EXPLAIN output, and summarizes the key takeaways.


About

Websites often store URLs in databases to manage resources such as pages, assets, or API endpoints. These URLs can be queried based on patterns like prefixes or suffixes. For example:

  • Prefix query: Retrieve all URLs under com.cnn.
  • Suffix query: Retrieve all URLs ending with cnn.com.

Databases are optimized to handle prefix-based queries efficiently when indexes are used. However, suffix-based queries often require full table scans, which degrade performance as the dataset grows. By reversing URLs and leveraging indexed prefix queries, we can mitigate these performance challenges.


How to Reproduce on PostgreSQL

1. Set Up the Database

Create a websites table to store URLs:

CREATE TABLE websites (
    id SERIAL PRIMARY KEY,
    url TEXT NOT NULL
);

2. Populate the Table

Insert sample data to simulate real-world conditions:

DO $$
DECLARE
    prefixes TEXT[] := ARRAY['com.cnn', 'com.google', 'org.wikipedia', 'com.yahoo', 'edu.stanford'];
    suffixes TEXT[] := ARRAY['www', 'api', 'news', 'mail', 'search'];
    i INT;
BEGIN
    FOR i IN 1..100000 LOOP
        INSERT INTO websites (url)
        VALUES (
            prefixes[ceil(random() * array_length(prefixes, 1))] || '.' ||
            suffixes[ceil(random() * array_length(suffixes, 1))] || '.' ||
            ceil(random() * 100000)
        );
    END LOOP;
END $$;

3. Create an Index

Create a B-tree index on the url column to optimize prefix searches:

CREATE INDEX idx_url_col ON websites (url COLLATE "C");

Why Use COLLATE "C"?

The default collation in many PostgreSQL installations can affect index performance for string comparisons. Using COLLATE "C" ensures byte-wise comparison, which is faster and aligns better with the requirements of URL matching patterns.

4. Execute Queries

Run two queries to compare performance:

  • Prefix query:

    EXPLAIN (ANALYZE) SELECT * FROM websites WHERE url LIKE 'com.cnn.%';
    
  • Suffix query:

    EXPLAIN (ANALYZE) SELECT * FROM websites WHERE url LIKE '%.cnn.com';
    

How to Read EXPLAIN Output

PostgreSQL’s EXPLAIN (ANALYZE) command provides insight into query execution plans. Here’s how to interpret key components:

Example Output for Prefix Query

Bitmap Heap Scan on websites  (cost=219.56..1087.83 rows=14141 width=24) (actual time=1.875..8.080 rows=14215 loops=1)
  Filter: (url ~~ 'com.cnn.%'::text)
  Heap Blocks: exact=695
  ->  Bitmap Index Scan on idx_url_col  (cost=0.00..216.03 rows=13861 width=0) (actual time=1.770..1.770 rows=14215 loops=1)
Planning Time: 0.486 ms
Execution Time: 8.796 ms

Key Insights:

  1. Bitmap Index Scan: The query uses the idx_url_col index to identify matching rows efficiently.
  2. Heap Blocks Accessed: The matching rows are fetched in batches, reducing I/O overhead.
  3. Execution Time: The query completes quickly (8.796 ms) due to index optimization.

Example Output for Suffix Query

Seq Scan on websites  (cost=0.00..1945.00 rows=10 width=24) (actual time=19.182..19.183 rows=0 loops=1)
  Filter: (url ~~ '%.cnn.com'::text)
  Rows Removed by Filter: 100000
Planning Time: 0.350 ms
Execution Time: 19.254 ms

Key Insights:

  1. Sequential Scan: The query scans all rows in the table because the index cannot optimize suffix-based searches.
  2. Rows Removed by Filter: All 100,000 rows are evaluated, but none match the condition.
  3. Execution Time: The query takes significantly longer (19.254 ms).

Why Reverse URLs Improve Performance

By reversing URLs (e.g., com.cnn.www instead of www.cnn.com), we transform suffix queries into prefix queries. This allows the database to utilize indexes effectively for both types of queries.

Here is the comparison of performance:

Aspect Bitmap Heap Scan (LIKE 'com.cnn.%') Seq Scan (LIKE '%.cnn.com')
Index Usage Yes (Index on url) No (Full Table Scan)
Execution Time 8.796 ms 19.254 ms
Rows Scanned 14,215 rows (index lookup) 100,000 rows (all rows scanned)
Rows Filtered Minimal (695 heap blocks) 100,000 rows removed by filter
Query Optimization Efficient (uses index and batches data) Inefficient (scans entire table)

In other words, the query using Bitmap Heap Scan (LIKE 'com.cnn.%') is more performant because it leverages an index to reduce the number of rows scanned, and also minimizes I/O and processing overhead by using batch lookups.


Summary

Storing URLs in reverse order offers significant performance advantages for database queries, particularly for suffix-based searches. By leveraging index optimization for prefix queries, you can avoid costly sequential scans and improve execution times. This blog demonstrated how to reproduce the issue in PostgreSQL, interpret EXPLAIN output, and implement the solution.

Key Takeaways

  • Prefix queries are optimized with indexes, while suffix queries often result in full table scans.
  • Reversing URLs transforms suffix queries into prefix queries, enabling efficient index usage.
  • PostgreSQL’s EXPLAIN (ANALYZE) provides detailed insights into query performance, guiding optimization efforts.

Try this approach in your applications to unlock better database performance and scalability!

2025-01-10