How to Interpret EXPLAIN Output from Bloom Filter Index on PostgreSQL

PostgreSQL is known for its robust indexing options, including the Bloom filter index (bloom), which can improve query performance for specific workloads. This blog will discuss how to read the EXPLAIN output, why rechecking is needed, and when to choose a Bloom filter index over a B-tree index.

A Bloom filter is a space-efficient, probabilistic data structure designed to test whether an element is part of a set. It’s highly efficient in terms of memory but allows for a small probability of false positives (i.e., reporting that an element exists when it doesn’t).

I'm going to skip explaining Bloom filter in details, as there are many other good online resources explaining how Bloom filter works with nice diagrams or clean reference implementations.

TLDR, if you can accept lossy predictions by having smaller index sizes, you might want to consider Bloom filter index over creating B-tree indexes for every columns in conditions.

How to Interpret EXPLAIN Output

Let’s analyze a sample EXPLAIN output:

Bitmap Heap Scan on tbloom  (cost=1792.00..1799.69 rows=2 width=24) (actual time=0.388..0.388 rows=0 loops=1)
  Recheck Cond: ((i2 = 898732) AND (i5 = 123451))
  Rows Removed by Index Recheck: 29
  Heap Blocks: exact=28
  ->  Bitmap Index Scan on bloomidx  (cost=0.00..1792.00 rows=2 width=0) (actual time=0.356..0.356 rows=29 loops=1)
        Index Cond: ((i2 = 898732) AND (i5 = 123451))

Key Terms in EXPLAIN

Read my another blog if you want to know more about Bitmap Index Scan and Bitmap Heap Scan.

Bitmap Heap Scan:
- The actual time shows the time taken for this step, with identical start and end times (0.388..0.388), meaning the operation completed in a single step.
Recheck Cond:
- The Recheck Cond specifies conditions applied to verify the accuracy of rows fetched by the index.
Rows Removed by Index Recheck:
- Indicates how many rows retrieved by the Bloom filter index were false positives and were removed after rechecking. In this case, 29 rows were false positives.
Heap Blocks:
- exact=28 means 28 blocks of the table were accessed to fetch the required rows.
Bitmap Index Scan:
- This shows that the Bloom filter index was used to retrieve 29 candidate rows matching the index condition.

Please note that Bloom filters are probabilistic and may return false positives. Rechecking ensures accuracy. The index returns a superset of rows that might match the query, so PostgreSQL must verify each row to confirm whether it meets the conditions. For example, in the output above, 29 rows were identified as candidates but were later discarded. In other words, the Bloom filter reduces the I/O cost of scanning large datasets but introduces a small CPU cost for rechecking.

When to Choose Bloom Filter Index Over B-tree Index?

In practical world, there are some cases you might want to choose

Advantages of Bloom Filter Index

Multi-Column Queries:
- Bloom filters are ideal for queries involving multiple equality conditions on columns.
Space Efficiency:
- They are compact and consume less disk space than traditional indexes.
High Cardinality Columns:
- Efficient for datasets with many unique values per column.

Limitations of Bloom Filter Index

False Positives:
- Rechecking adds overhead, especially for queries with low selectivity.
Read-Intensive Use Cases Only:
- Bloom filters are static and cannot handle updates efficiently.
- They are not suitable for tables with frequent INSERT, UPDATE, or DELETE operations.

B-Tree vs. Bloom Filter

Feature	B-tree Index	Bloom Filter Index
Precision	Exact	Approximate (allows false positives)
Space Usage	Larger	Compact
Updates	Efficient	Inefficient (static)
Query Types	Range and equality conditions	Equality conditions only

Choosing the Right Index

Use B-tree indexes for:
- Frequently updated tables.
- Range queries or precise lookups.
Use Bloom filter indexes for:
- Read-heavy workloads.
- Multi-column equality conditions on static data.

Summary

Bloom filter indexes in PostgreSQL provide a space-efficient solution for read-heavy workloads involving multi-column equality queries. Understanding their behavior, as seen in EXPLAIN output, is essential to leverage their strengths and mitigate their limitations. The trade-offs between precision and efficiency make them ideal for certain scenarios but unsuitable for others.

When designing indexes, always consider the query patterns, update frequency, and data characteristics to choose the best indexing strategy. By doing so, you can ensure optimal database performance and resource utilization.

2025-01-13

Ken W.

Programmer. Generalist.