Skip to main content
Back to all posts
DatabasesData StructuresStorageBackend

LSM Trees: The Write-Optimized Data Structure Behind Modern Databases

A deep dive into Log-Structured Merge-Trees — how they work, why they outperform B-Trees for writes, and why RocksDB, Cassandra, and LevelDB rely on them.

Key concepts, architecture diagram, and interactive demo

A Log-Structured Merge-Tree (LSM Tree) converts every write into a sequential append — no read-modify-write cycle, no random I/O. Writes land in an in-memory MemTable, get flushed to immutable on-disk files called SSTables, and are merged in the background by a process called compaction. Bloom filters let reads skip irrelevant SSTables without any disk I/O, keeping latency manageable despite multi-level storage. The tradeoff: reads are slower than B-Trees, but writes are dramatically faster — which is why LSM Trees power Cassandra, RocksDB, LevelDB, HBase, and ClickHouse.

LSM TreeB-Tree
Write I/OSequential append (fast, even on HDD)Random read-modify-write (slow for small records)
Read latencyMulti-level search, mitigated by Bloom filtersO(log n) with low constant — 3–4 levels for billions of keys
Write amplificationModerate — 14x (RocksDB, 128B records)High — 64x (WiredTiger, 128B records)
Space overhead1.1x leveled / up to 2x size-tiered~1.5x (pages average 2/3 full)
Best forLogs, metrics, IoT, event streams, time-seriesOLTP, point queries, mixed read-write
B-Trees Have Adapted Too
Modern B-Tree databases mitigate write amplification through write-ahead logs, write buffering, and architectural variants like fractal tree indexes (opens in new window) and Microsoft's Bw-Tree (opens in new window) (used in Azure Cosmos DB) that avoid in-place writes entirely. A dedicated post on B-Trees will cover these in depth.

See It In Action

Try the simulator below. Write some key-value pairs, watch the WAL and MemTable fill up, then trigger a flush by filling the MemTable to capacity. Use Read to trace the search path across levels, and Delete to write a tombstone and see how deletions work.

LSM Tree Simulator
Enter a key and value, then click Write to see the LSM Tree in action.
Memory (RAM)
WALwrite-ahead log
empty
MemTable0/6 sorted
empty
Disk (SSTables)
Level 0
overlapping · 0/4
no SSTables yet
Level 1non-overlapping key ranges
fill L0 with 4 SSTables to trigger compaction here
WAL — durability logMemTable — sorted buffer (max 6)L0 — flushed SSTables (max 4, may overlap)L1 — compacted, non-overlappingClick SSTable cards to expand entries
Interactive LSM Tree — write key-value pairs and watch data flow through WAL → MemTable → SSTables → Compaction

Anatomy of an LSM Tree

An LSM Tree is not a single data structure — it is a system of components that work together across memory and disk. Understanding each component is essential before we trace the write and read paths.

Loading diagram...

The Three Amplification Factors

Every storage engine makes tradeoffs across three fundamental costs. Understanding these is key to choosing the right engine for your workload.

Amplification FactorDefinitionLSM TreeB-Tree
Write AmplificationRatio of total bytes written to disk vs. bytes written by the applicationModerate (compaction rewrites data, but all I/O is sequential)High for small records (read-modify-write on full pages, random I/O)
Read AmplificationNumber of disk reads required to serve a single point queryHigher (must check multiple levels; mitigated by Bloom filters)Lower (B-Tree height is typically 3-4 levels for billions of keys)
Space AmplificationRatio of total disk space used vs. actual live data sizeVaries by strategy: ~1.1x (leveled) to ~2x (size-tiered)~1.5x (pages are typically 2/3 full on average due to splits)
No Free Lunch
It is impossible to optimize all three amplification factors simultaneously. This is sometimes called the RUM Conjecture (opens in new window) (Read, Update, Memory): you can optimize for at most two of the three. LSM Trees optimize for writes (updates) at the cost of reads. B-Trees optimize for reads at the cost of writes.
The NVMe Factor
These tradeoffs were characterized for spinning disks, where sequential I/O is 100–1,000x faster than random I/O. Modern NVMe SSDs dramatically narrow this gap — random 4 KB writes can nearly saturate NVMe bandwidth, weakening the primary justification for LSM Tree adoption. A 2022 USENIX study (opens in new window) found that B-Trees modified with in-storage compression achieved write amplification of ~28 versus RocksDB's ~38, with 19% higher write throughput on NVMe. On modern hardware, workload specifics — record size, cache-to-dataset ratio, access skew — often matter more than the B-Tree vs. LSM Tree architectural choice.

Real-World Systems Using LSM Trees

The LSM Tree is not an academic curiosity — it is the storage engine behind some of the most heavily used databases in production today.

SystemOriginLSM Tree RoleNotable Users
LevelDBGoogle (2011)Embedded key-value store; the reference implementation of leveled compactionChrome, Bitcoin Core
RocksDBFacebook/Meta (2012)Fork of LevelDB, heavily optimized for SSDs with multi-threaded compaction, column families, and tunable compactionMeta, Netflix, Uber, LinkedIn
Apache CassandraFacebook/Apache (2008)Distributed wide-column store using LSM Trees for local storage with size-tiered compactionNetflix, Apple, Discord
Apache HBaseHadoop ecosystemDistributed wide-column store modeled after Google Bigtable, using LSM Trees per regionFacebook Messenger, Yahoo
ClickHouseYandex (2016)Column-oriented OLAP database using a MergeTree engine inspired by LSM principles for batch insertsCloudflare, eBay, Uber
CockroachDBCockroach Labs (2014)Distributed SQL database using RocksDB (now Pebble) as its storage engineDoorDash, Netflix
ScyllaDBScyllaDB (2015)C++ reimplementation of Cassandra's architecture with per-core LSM TreesDiscord, Comcast, Samsung

When to Use LSM Trees vs. B-Trees

The choice between LSM Trees and B-Trees is not about which is "better" — it is about matching the storage engine to your workload characteristics.

Loading diagram...

Comments

0/2000