Why Saving URLs in Reverse Order Improves Database Performance
2025-01-10Learn how storing URLs in reverse order (com.example.www instead of www.example.com) significantly improves database query performance by transforming inefficient suffix searches into efficient prefix searches that can leverage database indexes.
When designing databases for web applications, efficient data retrieval is a critical factor. One seemingly minor decision—how URLs are stored—can significantly impact performance. Saving URLs in reverse order, such as com.cnn.subdomain
, instead of the traditional subdomain.cnn.com
, can improve database performance in certain scenarios. This blog explores why this is the case, demonstrates how to reproduce the results using PostgreSQL, explains how to interpret EXPLAIN output, and summarizes the key takeaways.
About
Websites often store URLs in databases to manage resources such as pages, assets, or API endpoints. These URLs can be queried based on patterns like prefixes or suffixes. For example:
- Prefix query: Retrieve all URLs under
com.cnn
. - Suffix query: Retrieve all URLs ending with
cnn.com
.
Databases are optimized to handle prefix-based queries efficiently when indexes are used. However, suffix-based queries often require full table scans, which degrade performance as the dataset grows. By reversing URLs and leveraging indexed prefix queries, we can mitigate these performance challenges.
How to Reproduce on PostgreSQL
1. Set Up the Database
Create a websites
table to store URLs:
CREATE TABLE websites (
id SERIAL PRIMARY KEY,
url TEXT NOT NULL
);
2. Populate the Table
Insert sample data to simulate real-world conditions:
DO $$
DECLARE
prefixes TEXT[] := ARRAY['com.cnn', 'com.google', 'org.wikipedia', 'com.yahoo', 'edu.stanford'];
suffixes TEXT[] := ARRAY['www', 'api', 'news', 'mail', 'search'];
i INT;
BEGIN
FOR i IN 1..100000 LOOP
INSERT INTO websites (url)
VALUES (
prefixes[ceil(random() * array_length(prefixes, 1))] || '.' ||
suffixes[ceil(random() * array_length(suffixes, 1))] || '.' ||
ceil(random() * 100000)
);
END LOOP;
END $$;
3. Create an Index
Create a B-tree index on the url
column to optimize prefix searches:
CREATE INDEX idx_url_col ON websites (url COLLATE "C");
Why Use COLLATE "C"
?
The default collation in many PostgreSQL installations can affect index performance for string comparisons. Using COLLATE "C"
ensures byte-wise comparison, which is faster and aligns better with the requirements of URL matching patterns.
4. Execute Queries
Run two queries to compare performance:
-
Prefix query:
EXPLAIN (ANALYZE) SELECT * FROM websites WHERE url LIKE 'com.cnn.%';
-
Suffix query:
EXPLAIN (ANALYZE) SELECT * FROM websites WHERE url LIKE '%.cnn.com';
How to Read EXPLAIN Output
PostgreSQL’s EXPLAIN (ANALYZE)
command provides insight into query execution plans. Here’s how to interpret key components:
Example Output for Prefix Query
Bitmap Heap Scan on websites (cost=219.56..1087.83 rows=14141 width=24) (actual time=1.875..8.080 rows=14215 loops=1)
Filter: (url ~~ 'com.cnn.%'::text)
Heap Blocks: exact=695
-> Bitmap Index Scan on idx_url_col (cost=0.00..216.03 rows=13861 width=0) (actual time=1.770..1.770 rows=14215 loops=1)
Planning Time: 0.486 ms
Execution Time: 8.796 ms
Key Insights:
- Bitmap Index Scan: The query uses the
idx_url_col
index to identify matching rows efficiently. - Heap Blocks Accessed: The matching rows are fetched in batches, reducing I/O overhead.
- Execution Time: The query completes quickly (8.796 ms) due to index optimization.
Example Output for Suffix Query
Seq Scan on websites (cost=0.00..1945.00 rows=10 width=24) (actual time=19.182..19.183 rows=0 loops=1)
Filter: (url ~~ '%.cnn.com'::text)
Rows Removed by Filter: 100000
Planning Time: 0.350 ms
Execution Time: 19.254 ms
Key Insights:
- Sequential Scan: The query scans all rows in the table because the index cannot optimize suffix-based searches.
- Rows Removed by Filter: All 100,000 rows are evaluated, but none match the condition.
- Execution Time: The query takes significantly longer (19.254 ms).
Why Reverse URLs Improve Performance
By reversing URLs (e.g., com.cnn.www
instead of www.cnn.com
), we transform suffix queries into prefix queries. This allows the database to utilize indexes effectively for both types of queries.
Here is the comparison of performance:
Aspect | Bitmap Heap Scan (LIKE 'com.cnn.%') | Seq Scan (LIKE '%.cnn.com') |
---|---|---|
Index | Usage Yes (Index on url) | No (Full Table Scan) |
Execution Time | 8.796 ms | 19.254 ms |
Rows Scanned | 14,215 rows (index lookup) | 100,000 rows (all rows scanned) |
Rows Filtered | Minimal (695 heap blocks) | 100,000 rows removed by filter |
Query Optimization | Efficient (uses index and batches data) | Inefficient (scans entire table) |
In other words, the query using Bitmap Heap Scan (LIKE 'com.cnn.%'
) is more performant because it leverages an index to reduce the number of rows scanned, and also minimizes I/O and processing overhead by using batch lookups.
Summary
Storing URLs in reverse order offers significant performance advantages for database queries, particularly for suffix-based searches. By leveraging index optimization for prefix queries, you can avoid costly sequential scans and improve execution times. This blog demonstrated how to reproduce the issue in PostgreSQL, interpret EXPLAIN
output, and implement the solution.
Key Takeaways
- Prefix queries are optimized with indexes, while suffix queries often result in full table scans.
- Reversing URLs transforms suffix queries into prefix queries, enabling efficient index usage.
- PostgreSQL’s
EXPLAIN (ANALYZE)
provides detailed insights into query performance, guiding optimization efforts.
Try this approach in your applications to unlock better database performance and scalability!