LSM Trees: The Write-Optimized Data Structure Behind Modern Databases

Key concepts, architecture diagram, and interactive demo

A Log-Structured Merge-Tree (LSM Tree) converts every write into a sequential append — no read-modify-write cycle, no random I/O. Writes land in an in-memory MemTable, get flushed to immutable on-disk files called SSTables, and are merged in the background by a process called compaction. Bloom filters let reads skip irrelevant SSTables without any disk I/O, keeping latency manageable despite multi-level storage. The tradeoff: reads are slower than B-Trees, but writes are dramatically faster — which is why LSM Trees power Cassandra, RocksDB, LevelDB, HBase, and ClickHouse.

	LSM Tree	B-Tree
Write I/O	Sequential append (fast, even on HDD)	Random read-modify-write (slow for small records)
Read latency	Multi-level search, mitigated by Bloom filters	O(log n) with low constant — 3–4 levels for billions of keys
Write amplification	Moderate — 14x (RocksDB, 128B records)	High — 64x (WiredTiger, 128B records)
Space overhead	1.1x leveled / up to 2x size-tiered	~1.5x (pages average 2/3 full)
Best for	Logs, metrics, IoT, event streams, time-series	OLTP, point queries, mixed read-write

B-Trees Have Adapted Too

Modern B-Tree databases mitigate write amplification through write-ahead logs, write buffering, and architectural variants like fractal tree indexes and Microsoft's Bw-Tree (used in Azure Cosmos DB) that avoid in-place writes entirely. A dedicated post on B-Trees will cover these in depth.

See It In Action

Try the simulator below. Write some key-value pairs, watch the WAL and MemTable fill up, then trigger a flush by filling the MemTable to capacity. Use Read to trace the search path across levels, and Delete to write a tombstone and see how deletions work.

LSM Tree Simulator

○Enter a key and value, then click Write to see the LSM Tree in action.

Memory (RAM)

WALwrite-ahead log

empty

MemTable0/6 sorted

empty

Disk (SSTables)

Level 0

overlapping · 0/4

no SSTables yet

Level 1non-overlapping key ranges

fill L0 with 4 SSTables to trigger compaction here

WAL — durability logMemTable — sorted buffer (max 6)L0 — flushed SSTables (max 4, may overlap)L1 — compacted, non-overlappingClick SSTable cards to expand entries

Interactive LSM Tree — write key-value pairs and watch data flow through WAL → MemTable → SSTables → Compaction

Anatomy of an LSM Tree

An LSM Tree is not a single data structure — it is a system of components that work together across memory and disk. Understanding each component is essential before we trace the write and read paths.

Loading diagram...

The Three Amplification Factors

Every storage engine makes tradeoffs across three fundamental costs. Understanding these is key to choosing the right engine for your workload.

Amplification Factor	Definition	LSM Tree	B-Tree
Write Amplification	Ratio of total bytes written to disk vs. bytes written by the application	Moderate (compaction rewrites data, but all I/O is sequential)	High for small records (read-modify-write on full pages, random I/O)
Read Amplification	Number of disk reads required to serve a single point query	Higher (must check multiple levels; mitigated by Bloom filters)	Lower (B-Tree height is typically 3-4 levels for billions of keys)
Space Amplification	Ratio of total disk space used vs. actual live data size	Varies by strategy: ~1.1x (leveled) to ~2x (size-tiered)	~1.5x (pages are typically 2/3 full on average due to splits)

No Free Lunch

It is impossible to optimize all three amplification factors simultaneously. This is sometimes called the RUM Conjecture (Read, Update, Memory): you can optimize for at most two of the three. LSM Trees optimize for writes (updates) at the cost of reads. B-Trees optimize for reads at the cost of writes.

The NVMe Factor

These tradeoffs were characterized for spinning disks, where sequential I/O is 100–1,000x faster than random I/O. Modern NVMe SSDs dramatically narrow this gap — random 4 KB writes can nearly saturate NVMe bandwidth, weakening the primary justification for LSM Tree adoption. A 2022 USENIX study found that B-Trees modified with in-storage compression achieved write amplification of ~28 versus RocksDB's ~38, with 19% higher write throughput on NVMe. On modern hardware, workload specifics — record size, cache-to-dataset ratio, access skew — often matter more than the B-Tree vs. LSM Tree architectural choice.

Real-World Systems Using LSM Trees

The LSM Tree is not an academic curiosity — it is the storage engine behind some of the most heavily used databases in production today.

System	Origin	LSM Tree Role	Notable Users
LevelDB	Google (2011)	Embedded key-value store; the reference implementation of leveled compaction	Chrome, Bitcoin Core
RocksDB	Facebook/Meta (2012)	Fork of LevelDB, heavily optimized for SSDs with multi-threaded compaction, column families, and tunable compaction	Meta, Netflix, Uber, LinkedIn
Apache Cassandra	Facebook/Apache (2008)	Distributed wide-column store using LSM Trees for local storage with size-tiered compaction	Netflix, Apple, Discord
Apache HBase	Hadoop ecosystem	Distributed wide-column store modeled after Google Bigtable, using LSM Trees per region	Facebook Messenger, Yahoo
ClickHouse	Yandex (2016)	Column-oriented OLAP database using a MergeTree engine inspired by LSM principles for batch inserts	Cloudflare, eBay, Uber
CockroachDB	Cockroach Labs (2014)	Distributed SQL database using RocksDB (now Pebble) as its storage engine	DoorDash, Netflix
ScyllaDB	ScyllaDB (2015)	C++ reimplementation of Cassandra's architecture with per-core LSM Trees	Discord, Comcast, Samsung

When to Use LSM Trees vs. B-Trees

The choice between LSM Trees and B-Trees is not about which is "better" — it is about matching the storage engine to your workload characteristics.