Recently, while exploring RocksDB, I became fascinated by its Log-Structured Merge Tree (LSM-Tree) architecture. Unlike traditional B+ Tree-based databases like MySQL’s InnoDB or PostgreSQL, LSM-Trees take a completely different approach to handling writes and reads. This raised a fundamental question for me: Why was LSM-Tree invented in the first place, and when should we choose it over a B+ Tree?
To answer this, I decided to dive deep into how LSM-Trees work, their core components (MemTable, SSTable, Compaction), and their impact on performance. This blog post explores these details and compares them with B+ Trees to understand which data structure is best suited for different workloads.
LSM-Tree is a data structure optimized for write-heavy workloads. Unlike B+ Trees, which update records in-place, LSM-Trees follow an append-only design, meaning that writes are always sequential and never overwrite existing data immediately.
LSM-Trees were designed to solve a fundamental problem with B+ Trees: random writes. In traditional databases, frequent updates cause page splits and disk I/O overhead, slowing down performance. LSM-Trees solve this by buffering writes in memory and persisting them in an efficient, append-only manner.
There are two key components of LSM-Trees. The first player is MemTable. You can say that MemTable is an in-memory write buffer. The basic idea here is for incoming writes first go to the MemTable (often implemented as a skip list or balanced tree), because writes can be fast in theory as they are sequential and avoiding expensive disk I/O.
Once the MemTable fills up, it is flushed to disk as an immutable SSTable (Sorted-String Table), which is the second key components of LSM-Trees. SSTables are append-only, meaning they are never modified once written. Yes, that's immutable. Multiple SSTables exist at different levels with different sizes.
How can we make sure to delete obsolete data from that immutable SSTables? This is where compaction comes into play. Compaction merges smaller SSTables into larger ones, keeping data sorted and reducing storage overhead. This process ensures reads don’t have to scan too many SSTables. (I wonder how this is different from VACUUM-ing in PostgreSQL or MVCC-based DBMS. Structurally, they look similar to me).
Here is a basic comparison between LSM-Tree and B+ Tree:
Feature | LSM-Tree | B+ Tree |
---|---|---|
Write Performance | Fast (sequential writes to MemTable, then flushed to SSTables) | Slower (random writes due to in-place updates) |
Read Performance | Slower (may require searching multiple SSTables) | Faster (single lookup in index structure) |
Storage Efficiency | Higher (compaction reduces fragmentation, compression algorithm) | Lower (fragmentation from page splits, but in practical many DBMS implements optimisation in page splitting/merging and compression) |
Deletes/Updates | Marked as tombstones, cleaned up later | Immediate updates in place |
Use Case | Write-heavy, time-series, logs, analytics | Read-heavy, OLTP transactions |
LSM-Trees don't come without a cost. As compaction is essential to merge SSTables, reclaim space, and prevent query slowdowns, you need to take care of compaction processes as they are resource-intensive consuming CPU and disk I/O.
Better write performance is a trade-off between potentially degraded read performance in some cases, too. Since data is stored across multiple SSTables, a query may need to scan multiple files to locate a value. This increases read amplification, making LSM-Trees less suitable for workloads requiring fast point lookups.
Lastly, operating LSM-tree based DBMS required a different mindset and knowledge from operating B+ tree based DBMS. You need to keep learning like the better tuning strategies for compaction, for example.
When to Choose LSM-Tree Over B+ Tree? This is a very basic checklist (you might need to adjust it to your own requirements, but this list can be a starting point):
Choose LSM-Tree if:
✅ Your workload is write-heavy (e.g., logging, event streaming, analytics).
✅ You need high compression and efficient storage utilization.
✅ Reads involve range scans rather than point lookups.
Choose B+ Tree if:
✅ Your workload is read-heavy and requires low-latency queries.
✅ You need fast point lookups, such as retrieving a user profile by ID.
✅ You have transactional workloads (OLTP) with frequent small updates.
LSM-Trees and B+ Trees each excel in different scenarios. LSM-Trees prioritize fast writes and efficient storage, making them ideal for log-structured workloads. B+ Trees offer fast lookups and consistent read performance, making them better suited for transactional databases.
If you’re building a write-heavy system like RocksDB, Cassandra, or LevelDB, LSM-Trees provide a powerful alternative to traditional databases. However, if your workload involves frequent point lookups or transactional consistency, a B+ Tree-based database like MySQL or PostgreSQL is a better fit.