Advanced12 min readlive prototype

B-Tree

A wide, shallow tree where each node holds many keys. Built so one disk read fetches a whole node — the index structure under most databases and filesystems.

Overview

What this concept solves

A B-tree is a balanced multi-way search tree designed around one constraint: one node must fit in one disk block. On a spinning hard drive or an SSD, the cost of seeking to a page and reading it into RAM dwarfs arithmetic by a factor of 10,000 or more. Binary search trees minimize comparisons — but comparisons are cheap. What's expensive is the number of pages you must fetch. A node with one key forces you to visit O(log₂ n) pages. A node with 200 keys keeps the same sorted order but collapses the tree to O(log₂₀₀ n) pages — fewer by a factor of 7–8 at database scale.

The mechanics flow from a single parameter: *order m**, also called the branching factor. A node can hold at most m − 1 keys and at most m children. Every internal node (except the root) must hold at least ⌈m/2⌉ − 1 keys — this lower bound is what keeps the tree balanced after deletions, preventing any branch from becoming sparse. All leaves sit at the same depth. The result is a wide, shallow* tree: height grows as log_m(n), so a real database index with m ≈ 200 reaches every one of a billion records in four or five page reads.

Rudolf Bayer and Edward McCreight invented the B-tree at Boeing Research Labs in 1970 (published 1972) specifically to manage large sorted files on block-addressed storage. The core insight has never become obsolete. Every major relational database — PostgreSQL, MySQL, Oracle, SQL Server, SQLite — uses B-trees (or the leaf-linked variant, B+ trees) as the default index structure. Filesystems (NTFS, HFS+, ext4's htree) use them for directory lookups. The data structure is now over 50 years old and remains the dominant on-disk search structure because the disk-read bottleneck it was designed to solve is as real today as it was in 1970.

Mechanics

How it works

Searching a B-tree

  1. Start at the root node. Scan its keys left-to-right (or binary-search them) to find the first key ≥ your target.
  2. If the target equals that key, you're done — the record is here (or a pointer to it is).
  3. If the target is less than that key, follow the child pointer to the left of it. If you've passed all keys, follow the rightmost child pointer.
  4. Repeat at the child node. Each level eliminates all but one subtree.
  5. If you reach a leaf and the key is absent, the target is not in the tree.

Insertion and split-on-overflow

  1. Descend to the correct leaf using the same search logic — follow child pointers at each level until you hit a leaf.
  2. Insert the new key into the leaf in sorted order.
  3. If the leaf now has m keys (one too many), split it: the middle key is promoted to the parent; the remaining keys are divided left and right into two new sibling nodes.
  4. The parent has gained a key. If the parent is now also full, split it too — the middle key of the parent promotes to its parent. This can cascade upward.
  5. If the root splits, a new root is created holding just the single promoted key and two children. This is the only way a B-tree grows taller — trees always grow at the root, never at the leaves.
  6. All leaves remain at the same depth after every insertion.

Why fewer levels = fewer disk seeks

Each level of the tree is one page fetch. A binary search tree over 1 million records has depth ~20; an order-200 B-tree has depth ~3. That's the difference between 20 disk reads and 3. On spinning storage at 10 ms per seek, 20 seeks costs 200 ms; 3 seeks costs 30 ms — a 6× wall-clock difference on a single lookup, compounding across every query a database processes.

The minimum-keys rule

Every non-root internal node must hold at least ⌈m/2⌉ − 1 keys. For order 4, that's 1 key minimum, 3 keys maximum. This half-full guarantee bounds the tree's height from both directions: nodes can't be too sparse (tree stays shallow) and can't overflow (splits keep capacity bounded). The root is exempt — it can hold as few as 1 key and 2 children, which is what it looks like right after a root split.

Interactive prototype

Run it. Break it. Tune it.

Sandboxed simulation embedded right in the page. No setup, no install.

About this simulation

An order-4 B-tree — each node holds up to 3 keys and up to 4 children. Type a number and press Insert, or hit Random to pick one, then watch the tree respond. When a node fills up it flashes red, the middle key is promoted to the parent, and the node splits in two — try Sequential 10→80 to see this cascade repeatedly and the root itself split. Use Search to trace a lookup from root to leaf; the stat bar tracks total keys, current height, live nodes, and splits performed. The Speed slider controls animation pace.

Hands-on

Try these on your own

Open the prototype above, run each experiment, predict the answer, then verify.

try 01

Watch repeated splits and root growth

Press Sequential 10→80 and let it run at default speed. Count the red flashes — each one is a node overflowing and splitting. Notice that once the root splits, the tree gains a new level: height jumps from 1 to 2, then later from 2 to 3. In an order-4 tree, the root first splits when the 4th key is inserted; subsequent root splits take progressively longer because more leaf splits must cascade up first. Try pausing with the Speed slider near zero to catch each promotion in slow motion.

try 02

Explore the mixed-insert pattern

Hit Reset then press Mixed inserts. This sequence interleaves small and large values so splits land at different internal nodes rather than cascading straight up the right spine. Watch how the tree's shape differs from the sequential case — the key distribution is more even, splits are more spread out, and the height grows more slowly. When the demo finishes, compare the splits counter to the sequential run: same number of insertions, but fewer root splits because internal nodes absorb more promotions before filling.

try 03

Search and count node visits

After running either demo, type a key that you saw inserted (e.g. 45) into the input and press Search. Watch the prototype highlight each node it visits, root → internal → leaf. Count them — that number is the height of the tree, and it equals the number of disk-page reads a real database would issue. Now search for a key that was never inserted (e.g. 99). The traversal still visits the same number of nodes before concluding the key is absent: B-tree search cost is always O(height), hit or miss.

try 04

Free play — break it yourself

Hit Reset and insert keys in a hand-crafted order. Try to force a root split with as few insertions as possible (hint: 4 insertions suffice in order 4). Then insert a key that lands in the same leaf as an existing key and confirm the split still produces a valid sorted tree. Finally, use Random repeatedly and watch whether the tree ever becomes unbalanced — it won't, and that's the point. The B-tree's invariants guarantee balance regardless of insertion order, which is exactly the guarantee an unbalanced BST cannot give.

In practice

When to use it — and what you give up

When a B-tree is the right structure

  • Database indexes — any SQL CREATE INDEX defaults to a B-tree. Range queries (WHERE age BETWEEN 20 AND 30), equality lookups, and ORDER BY all use the sorted leaf layer efficiently.
  • Filesystem directories — large directories (thousands of entries) in NTFS, HFS+, and ext4's dir_index feature are B-trees over filename keys, giving O(log n) lookup instead of O(n) linear scan.
  • Key-value stores on disk — RocksDB, LevelDB, and WiredTiger all use variants of B-trees or LSM trees where the sorted-page property is central.
  • Any sorted dataset larger than RAM — if the data fits in memory, a red-black tree or skip list is simpler; once the dataset exceeds RAM, the page-aligned B-tree wins on I/O.
  • Ordered iteration — B-trees (especially B+ trees with linked leaves) support efficient SELECT ... ORDER BY and range scans without a full sort.

When to reach for something else

  • In-memory data, small dataset — a red-black tree (used in std::map, Java TreeMap) or AVL tree has lower per-operation constant overhead when every node is already in cache.
  • Hash-only lookups, no ranges needed — a hash table gives O(1) exact-match lookup vs O(log_m n) for B-tree; use it when you never need range queries or sorted order.
  • Write-heavy workloads needing maximum throughput — LSM trees (Log-Structured Merge trees, as in RocksDB) batch writes in memory and flush sequentially, avoiding random-write amplification that B-trees can suffer on SSDs.
  • Multidimensional spatial queries — a B-tree indexes one dimension. Spatial queries (points within a rectangle) need an R-tree, k-d tree, or GiST index instead.

Pros

  • Logarithmic height on the branching factor — O(log_m n) pages per lookup; at m = 200, a billion-record index is only 4–5 levels deep.
  • Guaranteed balance — every leaf is at the same depth after every operation. No rebalancing heuristics, no worst-case degeneration like an unbalanced BST.
  • Efficient range scans — keys within a node are sorted; in a B+ tree, leaves are linked so a range query becomes one root-to-leaf descent then a sequential leaf scan.
  • Tunable to hardware — setting the order so each node fills exactly one disk block (typically 4 KB or 16 KB) aligns I/O to the storage unit, eliminating partial-page waste.
  • Half-century of production hardening — the structure and its deletion/merge algorithms are implemented, tested, and understood in every major database engine.

Cons

  • Split cascades add write amplification — a single insert can cause splits at every level up to the root, writing O(log_m n) pages instead of one.
  • Complex implementation — split, promote, merge-on-underflow, and borrow-from-sibling cover many edge cases. A correct, concurrent B-tree (with page latching) is a weeks-long engineering effort.
  • Poor locality for random writes on SSDs — SSD pages must be erased in large blocks; random in-place overwrites during splits cause write amplification at the firmware level, which LSM trees avoid.
  • Space overhead — the minimum-keys rule means pages are at most half-empty after a sequence of deletions, wasting up to 50 % of allocated space until a vacuum/reorganize pass.
  • Not cache-friendly for tiny datasets — for data that fits in L2/L3 cache, a sorted array with binary search often outperforms a B-tree due to lower pointer-chasing overhead.

Reference

Code & further reading

A minimal reference implementation and pointers worth bookmarking.

b-tree.ts
// Minimal B-tree: search + insert-with-split. Order is configurable.
// "Order m" means: max m children, max m-1 keys, min ⌈m/2⌉-1 keys (non-root).

const ORDER = 4; // prototype uses order 4 (max 3 keys/node)
const MAX_KEYS = ORDER - 1;       // 3
const MIN_KEYS = Math.ceil(ORDER / 2) - 1; // 1

interface BNode {
  keys: number[];
  children: BNode[];
  isLeaf: boolean;
}

function makeNode(isLeaf: boolean): BNode {
  return { keys: [], children: [], isLeaf };
}

// --- Search ---
// Returns true if key exists anywhere in the subtree rooted at node.
function search(node: BNode, key: number): boolean {
  let i = 0;
  while (i < node.keys.length && key > node.keys[i]) i++;

  if (i < node.keys.length && key === node.keys[i]) return true; // found
  if (node.isLeaf) return false;                                  // not here
  return search(node.children[i], key);                           // descend
}

// --- Insert (non-full root entry point) ---
let root: BNode = makeNode(true);

function insert(key: number): void {
  if (root.keys.length === MAX_KEYS) {
    // Root is full — split it and grow the tree upward.
    const oldRoot = root;
    root = makeNode(false);
    root.children.push(oldRoot);
    splitChild(root, 0); // split the only child (oldRoot)
  }
  insertNonFull(root, key);
}

// Precondition: node has room for at least one more key.
function insertNonFull(node: BNode, key: number): void {
  let i = node.keys.length - 1;
  if (node.isLeaf) {
    node.keys.push(0); // make room
    while (i >= 0 && key < node.keys[i]) {
      node.keys[i + 1] = node.keys[i];
      i--;
    }
    node.keys[i + 1] = key;
  } else {
    while (i >= 0 && key < node.keys[i]) i--;
    i++;
    if (node.children[i].keys.length === MAX_KEYS) {
      splitChild(node, i);
      if (key > node.keys[i]) i++;
    }
    insertNonFull(node.children[i], key);
  }
}

// Split the i-th child of parent (which must be full).
// Middle key is promoted into parent; two half-nodes remain.
function splitChild(parent: BNode, i: number): void {
  const full = parent.children[i];
  const mid = Math.floor(MAX_KEYS / 2); // index of the promoted key
  const right = makeNode(full.isLeaf);

  right.keys = full.keys.splice(mid + 1);       // right half
  const promoted = full.keys.splice(mid)[0];     // middle key

  if (!full.isLeaf) {
    right.children = full.children.splice(mid + 1);
  }

  parent.keys.splice(i, 0, promoted);            // promote
  parent.children.splice(i + 1, 0, right);       // link right sibling
}

References & further reading

7 sources

Knowledge check

Did the prototype land?

Quick questions, answers revealed on submit. No scoring saved.

question 01 / 03

In an order-4 B-tree, a leaf node currently holds the keys [10, 20, 30]. You insert 25. What happens?

question 02 / 03

A B-tree with order m over n records guarantees a search visits at most how many nodes (disk pages)?

question 03 / 03

B-trees always grow upward — the root splits rather than leaves being pushed down. What consequence does this have?

0/3 answered

Was this concept helpful?

Tell us what worked, or what to improve. We read every note.