Database Indexes Explained: B-Tree, B+Tree, Clustered, and LSM

The instinct to "add an index" when a query gets slow is correct, but the explanation usually stops at "indexes make queries fast." That hides the structural choices that decide which workloads benefit and which don't. The difference between a B-Tree and a B+Tree is one design decision — about whether leaf nodes link to each other — and that one decision is why range queries on a B+Tree cost almost nothing extra, while the same query on a vanilla B-Tree might require repeatedly climbing back to the root.

This reference covers four index families engineers encounter daily: B-Tree (the textbook foundation), B+Tree (what every major RDBMS actually ships), clustered indexes (where the data and the index are the same structure), and LSM trees (the write-optimized alternative behind RocksDB, Cassandra, ClickHouse). Each section leads with the design decision, not the history.

B-Tree vs B+Tree: the one detail

A B-Tree is a self-balancing search tree where each node holds multiple keys and child pointers. Every node — internal or leaf — stores both keys and associated data values. That last bit is the trap.

A B+Tree changes one thing: internal nodes hold only routing keys; all data lives exclusively in leaf nodes, and the leaf nodes are linked together in a doubly-linked list. That linked-leaf structure is the whole game for range queries. A query like WHERE user_id BETWEEN 1000 AND 2000 on a B-Tree finds 1000 in a leaf, returns it, then climbs back to an internal ancestor to find the next qualifying key. On a B+Tree, the engine finds 1000 in a leaf and walks the leaf-level linked list until it hits 2001. Sequential page reads, cache-friendly, no re-traversal.

Because internal B+Tree nodes carry no data, they fit more keys per page. That increases fanout — the number of children each internal node points to — which keeps the tree shallow.

Fanout, depth, and disk pages

Tree depth determines how many disk seeks a cold point lookup costs. Each level traversed is potentially a random I/O. Keeping depth low isn't aesthetic; it's the difference between 3 and 8 disk seeks per query.

Fanout depends on page size and key size. Postgres uses an 8 KB default block size (per the Postgres docs on Database Page Layout). After the page header (~24 bytes) and per-entry overhead (item identifier + tuple header), an internal B+Tree page storing 4-byte integer keys with 6-byte tuple pointers fits roughly 800 entries per page in practice. That figure varies with fill factor, but it's the right ballpark for a 4-byte key.

Depth at fanout 800: log₈₀₀(100,000,000) ≈ 3 levels (root + 2 internal + leaf). That means 3–4 page reads per cold point lookup, and often just one once the upper levels are cached. Compare to a hypothetical fanout of 100 → log₁₀₀(100M) = 4 — one extra level. On a cold database with 10 ms seek time, that extra level adds 10 ms per query.

Postgres's default fill factor for indexes is 90, leaving 10% of each page free for inserts to land without splitting. Heavy insert workloads gradually fragment the tree, raising effective depth. REINDEX CONCURRENTLY is the repair operation; tracking pg_relation_size('myindex') against table size is the preventive measure.

Clustered vs non-clustered indexes

A clustered index is one where the leaf nodes are the table's data pages. There is no pointer to a separate heap; the data is physically ordered by the clustered key. MySQL InnoDB clusters every table by its primary key by default — see the MySQL 8.0 InnoDB Index Types reference. SQL Server uses clustered indexes extensively. Oracle has Index-Organized Tables (IOT) for the same effect.

Postgres is the notable exception: it uses heap-organized storage by default. Its CLUSTER command physically reorders the heap once but does not maintain that order on future writes — the next sequence of inserts and updates will scatter rows across the heap again.

Three rules follow directly from the definition:

A table can have exactly one clustered index. The data can be physically sorted in only one order. A second clustered index would require duplicating the entire table.
Choose a narrow, unique, monotonically increasing clustered key — typically a surrogate integer primary key. UUID v4 primary keys cause page splits on nearly every insert because UUIDs are random; UUID v7 (sortable, RFC 9562) sidesteps the problem while preserving uniqueness.
Range queries on the clustered key are extremely fast — qualifying rows are physically adjacent on disk.

A non-clustered index stores keys in B+Tree leaf nodes alongside pointers (row IDs, primary key values, or heap pointers depending on the engine). A lookup always requires two steps: find the leaf containing the key, then follow the pointer to fetch the row from the heap. Query plans call this a "key lookup" or "bookmark lookup."

Postgres's HOT (Heap-Only Tuple) update mechanism mitigates this for the common case: when an update doesn't change any indexed column, Postgres chains the new tuple to the old heap slot without rewriting any index entries. When any indexed column changes, every relevant index gets a new entry and the old one becomes a dead tuple. VACUUM ANALYZE marks dead tuples reclaimable; VACUUM FULL rewrites the relation but takes a table lock. A practical signal worth knowing: when the index size exceeds the table size, bloat has accumulated and it's time to reindex.

LSM trees: the write-optimized cousin

B+Trees optimize for reads: low depth, sequential leaf traversal, good cache behavior on hot pages. Writes are also reasonable — until you factor in page splits and the cost of maintaining sorted order on disk for every insert.

LSM trees flip the priority. Writes go into an in-memory MemTable (sorted skip list or red-black tree). When the MemTable hits a size threshold, it flushes to disk as an immutable sorted file (SSTable). Reads check the MemTable first, then SSTable levels in order. Periodic compaction merges lower levels into larger, deduplicated files at higher levels.

The tradeoff is write amplification: a single logical write can trigger compaction work that rewrites the same data across level boundaries multiple times. Typical write amplification in LevelDB and RocksDB is 10–30×, depending on level multiplier and write rate, per the FAST '19 LSM-tree survey paper. SSDs handle this fine because the writes are sequential, but NAND wear accumulates and matters for SSD lifetime in write-heavy deployments.

One subtle read penalty: a point lookup for a key that doesn't exist must check every SSTable level before confirming absence. Bloom filters — probabilistic structures that answer "definitely not present" in O(1) — are the standard mitigation. RocksDB enables them by default.

When LSM wins: write-heavy workloads (event logs, time series, CDC pipelines), key spaces with mostly sequential inserts, workloads that tolerate slightly higher read latency. When B+Tree wins: read-heavy OLTP, workloads needing strong point-lookup latency guarantees, range queries on secondary indexes, cases where operational simplicity matters. Cassandra, ClickHouse, RocksDB, LevelDB, and HBase use LSM variants. Postgres and MySQL InnoDB use B+Trees.

FAQ

Does column order in a composite index matter? Yes, decisively. A composite index on (last_name, first_name) is usable for queries filtering on last_name alone or both columns, but not for queries filtering on first_name alone. The leftmost-prefix rule applies in MySQL, Postgres, and almost every SQL engine.

Why a narrow integer for the InnoDB clustered key, not a UUID? InnoDB stores all secondary indexes with the clustered key value as the row pointer. A 16-byte UUID in every secondary index entry vs a 4-byte int adds up fast. More importantly, random UUIDs cause page splits on most inserts once the table is large, fragmenting the clustered index and bloating it. UUID v7 (or any sortable variant) sidesteps the random-insert problem.

When does the planner skip an index? Several scenarios — none of them bugs:

The predicate matches a large fraction of rows (typically ~5–10% selectivity, depending on cost estimates). A sequential scan can outperform many random index lookups.
Statistics are stale. Run ANALYZE; check EXPLAIN ANALYZE for estimated vs actual row counts.
A function wraps the indexed column: WHERE lower(email) = '...' can't use an index on email. The fix is an expression index: CREATE INDEX ON users (lower(email)).
Implicit type casts on filter predicates can prevent index use in some engines.

Does adding more indexes always help? No. Every index on a table adds overhead to every write. A table with 15 indexes and moderate write throughput can spend more time maintaining indexes than writing data. Postgres tracks usage in pg_stat_user_indexes; idx_scan = 0 after weeks of production traffic means the index is dead weight.

Limitation worth saying out loud

This post covers the four major index families and their canonical questions. Covering indexes (and Postgres 11+ INCLUDE columns), partial indexes with a WHERE clause, expression indexes, and the Postgres-specific GIN/GiST/BRIN/SP-GiST index types each deserve their own treatment — they aren't footnotes, and squeezing them here would produce thin summaries.

For related reading, see where to actually store JWTs for another decision-tree-shaped tradeoff, YAML parsing traps for a similar "the spec hides a footgun" pattern, and the JSON/YAML converter tool for quick mechanical conversion while working through schema decisions.