When designing databases for web applications, efficient data retrieval is a critical factor. One seemingly minor decision—how URLs are stored—can significantly impact performance. Saving URLs in reverse order, such as com.cnn.subdomain
, instead of the traditional subdomain.cnn.com
, can improve database performance in certain scenarios. This blog explores why this is the case, demonstrates how to reproduce the results using PostgreSQL, explains how to interpret EXPLAIN output, and summarizes the key takeaways.
Websites often store URLs in databases to manage resources such as pages, assets, or API endpoints. These URLs can be queried based on patterns like prefixes or suffixes. For example:
com.cnn
.cnn.com
.Databases are optimized to handle prefix-based queries efficiently when indexes are used. However, suffix-based queries often require full table scans, which degrade performance as the dataset grows. By reversing URLs and leveraging indexed prefix queries, we can mitigate these performance challenges.
Create a websites
table to store URLs:
CREATE TABLE websites (
id SERIAL PRIMARY KEY,
url TEXT NOT NULL
);
Insert sample data to simulate real-world conditions:
DO $$
DECLARE
prefixes TEXT[] := ARRAY['com.cnn', 'com.google', 'org.wikipedia', 'com.yahoo', 'edu.stanford'];
suffixes TEXT[] := ARRAY['www', 'api', 'news', 'mail', 'search'];
i INT;
BEGIN
FOR i IN 1..100000 LOOP
INSERT INTO websites (url)
VALUES (
prefixes[ceil(random() * array_length(prefixes, 1))] || '.' ||
suffixes[ceil(random() * array_length(suffixes, 1))] || '.' ||
ceil(random() * 100000)
);
END LOOP;
END $$;
Create a B-tree index on the url
column to optimize prefix searches:
CREATE INDEX idx_url_col ON websites (url COLLATE "C");
COLLATE "C"
?The default collation in many PostgreSQL installations can affect index performance for string comparisons. Using COLLATE "C"
ensures byte-wise comparison, which is faster and aligns better with the requirements of URL matching patterns.
Run two queries to compare performance:
Prefix query:
EXPLAIN (ANALYZE) SELECT * FROM websites WHERE url LIKE 'com.cnn.%';
Suffix query:
EXPLAIN (ANALYZE) SELECT * FROM websites WHERE url LIKE '%.cnn.com';
PostgreSQL’s EXPLAIN (ANALYZE)
command provides insight into query execution plans. Here’s how to interpret key components:
Bitmap Heap Scan on websites (cost=219.56..1087.83 rows=14141 width=24) (actual time=1.875..8.080 rows=14215 loops=1)
Filter: (url ~~ 'com.cnn.%'::text)
Heap Blocks: exact=695
-> Bitmap Index Scan on idx_url_col (cost=0.00..216.03 rows=13861 width=0) (actual time=1.770..1.770 rows=14215 loops=1)
Planning Time: 0.486 ms
Execution Time: 8.796 ms
idx_url_col
index to identify matching rows efficiently.Seq Scan on websites (cost=0.00..1945.00 rows=10 width=24) (actual time=19.182..19.183 rows=0 loops=1)
Filter: (url ~~ '%.cnn.com'::text)
Rows Removed by Filter: 100000
Planning Time: 0.350 ms
Execution Time: 19.254 ms
By reversing URLs (e.g., com.cnn.www
instead of www.cnn.com
), we transform suffix queries into prefix queries. This allows the database to utilize indexes effectively for both types of queries.
Here is the comparison of performance:
Aspect | Bitmap Heap Scan (LIKE 'com.cnn.%') | Seq Scan (LIKE '%.cnn.com') |
---|---|---|
Index | Usage Yes (Index on url) | No (Full Table Scan) |
Execution Time | 8.796 ms | 19.254 ms |
Rows Scanned | 14,215 rows (index lookup) | 100,000 rows (all rows scanned) |
Rows Filtered | Minimal (695 heap blocks) | 100,000 rows removed by filter |
Query Optimization | Efficient (uses index and batches data) | Inefficient (scans entire table) |
In other words, the query using Bitmap Heap Scan (LIKE 'com.cnn.%'
) is more performant because it leverages an index to reduce the number of rows scanned, and also minimizes I/O and processing overhead by using batch lookups.
Storing URLs in reverse order offers significant performance advantages for database queries, particularly for suffix-based searches. By leveraging index optimization for prefix queries, you can avoid costly sequential scans and improve execution times. This blog demonstrated how to reproduce the issue in PostgreSQL, interpret EXPLAIN
output, and implement the solution.
EXPLAIN (ANALYZE)
provides detailed insights into query performance, guiding optimization efforts.Try this approach in your applications to unlock better database performance and scalability!