Bloom Filter — Bloom & Cuckoo Filters

Overview

What this concept solves

A Bloom filter is the smallest useful 'have I seen this before?' data structure. It's a bit array, plus a handful of hash functions, plus a deliberate willingness to be wrong in one direction. Burton Bloom proposed it in 1970 to keep a dictionary of hyphenation rules in 24 KB instead of the megabytes a full hash table needed — and the same trick has been carrying weight ever since.

The setup is two lines of code: an array of m bits all starting at zero, and k independent hash functions that each map an item to one of those bits. Add sets the k bits to 1. Check asks: are all k of this item's bits set? If even one is zero, the item was definitely never added. If all are set, the item is maybe in the set — those bits could have been turned on by other items.

That's the whole structure. No keys are stored anywhere. You can't enumerate the set, you can't remove an item (we'll fix that in the next concept), and you can't be sure of a positive answer. In exchange you get constant-time lookups, ~10 bits per item for a 1% false-positive rate, and a filter so small it fits in L1 cache for millions of items.

Mechanics

How it works

Adding an item

Compute k hash values for the item — h₁(x), h₂(x), ..., hₖ(x) — each modulo m to get a bit index.
Set those k bits in the array to 1. Bits that were already 1 stay 1.
That's it. The item itself is never stored. You can't even tell which bits belong to which item.

Checking an item

Compute the same k hash values for the query item.
Look at those k bits. If any is 0, return definitely-not — no item that hashes to that bit was ever added.
If all are 1, return maybe-yes — either the item was added, or other items happen to have collectively set every one of its bits (a false positive).

The false-positive formula

After inserting n items into a Bloom filter of m bits with k hash functions, the probability of a false positive is roughly (1 - e^(-kn/m))^k. The optimal k for a given m/n is (m/n) · ln 2, which makes the formula collapse to (1/2)^k. The practical takeaway: aim for ~10 bits per item and k ≈ 7 to land near a 1% false-positive rate.

It only gets worse, not better

A Bloom filter's false-positive rate climbs monotonically as you add items — once a bit is set, nothing in a vanilla Bloom filter can ever turn it off. That's why a filter is sized for the maximum expected n upfront and rebuilt from scratch when it gets too full, not edited in place.

Interactive prototype

Run it. Break it. Tune it.

Sandboxed simulation embedded right in the page. No setup, no install.

simulation › Bloom Filter

About this simulation

A 24-cell bit array with k = 3 hash functions. Pick a scenario — Add & check, False positive, or Saturation — and step through with Prev / Next, or jump into Free play to add and check words yourself. Watch the bits light up, the stats update, and the narration explain why each answer is definitely-no or maybe-yes.

Hands-on

Try these on your own

Open the prototype above, run each experiment, predict the answer, then verify.

try 01

Walk Add & check

Open the Add & check scenario and step through with Next. Watch the bits for apple light up, then a check for apple highlight the same three bits (all on → maybe-yes), then a check for banana find an empty bit (definitely-no). Notice that the definitely-no answer is exact: no probability involved.

try 02

See a false positive happen

Switch to the False positive scenario. A handful of words is added until their bits cover most of the array, then a word that was never added is checked and all three of its bits happen to be lit by other words. Maybe-yes — but it's wrong. This is the rate the formula predicts; it isn't a bug.

try 03

Watch saturation kill the filter

Step through Saturation. As more words are added, the Bits set stat climbs and the Estimated FPR climbs with it. After enough adds, almost every check is a false positive. The practical lesson: size m for the worst-case n, and rebuild before saturation if the data keeps growing.

try 04

Free play — break it yourself

Open Free play. Add half a dozen common words, then check ones you never added — try short words first (more likely to hash-collide). Note the Estimated FPR in the stats and try to find a false positive yourself. Use Reset to start over with an empty filter.

In practice

When to use it — and what you give up

When it's the right tool

You need to skip an expensive lookup — disk I/O, network call, a cache miss that costs you 10ms. The filter rules out the definite-misses in nanoseconds.
Your set is mostly append-only — a crawl frontier of visited URLs, a denylist that rebuilds nightly, an SSTable's key summary that's immutable once written.
Some false positives are acceptable because you have a real source of truth behind the filter that confirms the maybe-hits.
Memory matters more than perfection — billions of items in a few GB instead of 50× that.

When to reach for something else

You need to remove items — use a [counting Bloom filter](/topics/bloom-filters/counting-bloom) or a [cuckoo filter](/topics/bloom-filters/cuckoo).
You need to enumerate or iterate the set — Bloom filters can't list what's in them. Use a hash set.
You need exact answers — payment systems, primary-key indexes, uniqueness constraints. The 1% wrong-answer rate is a bug, not a feature.
Your set is tiny (a few hundred items) — the overhead of k hashes per op isn't worth it vs. a plain hash set that already fits in cache.

Pros

Tiny memory — ~9.6 bits per item for a 1% false-positive rate, independent of item size. A million 100-byte URLs fit in 1.2 MB.
O(k) add and check — constant time, no allocations, no rebalancing, no GC pressure.
No-false-negative guarantee — a 'no' answer is always correct, which lets you use the filter as a definite skip.
Mergeable — two Bloom filters of the same shape can be OR'd together to get the union of their sets. Maps cleanly onto distributed systems.
Cache-friendly — k hash lookups touch at most k cache lines, often fewer.

Cons

Can't delete — once a bit is set, it stays set. Removing an item would require turning off bits that other items might still need.
FPR grows with fill — overestimate n and you waste memory; underestimate and the false-positive rate spirals.
Fixed size — resizing means rehashing every item; in practice you rebuild from the source of truth instead.
Not key-recoverable — there's no way to retrieve what was inserted; the filter only answers yes/no.
k hashes per op — for k=7 that's seven cache touches; cuckoo filters do the same job in just two.

Reference

Code & further reading

A minimal reference implementation and pointers worth bookmarking.

bloom-filter.ts

// A teaching Bloom filter — k hashes, m bits, no deletion.
class BloomFilter {
  private bits: Uint8Array;

  constructor(
    private readonly m: number,   // bit-array size
    private readonly k: number,   // number of hash functions
  ) {
    this.bits = new Uint8Array(m);
  }

  add(item: string): void {
    for (const i of this.indices(item)) this.bits[i] = 1;
  }

  has(item: string): boolean {
    for (const i of this.indices(item)) {
      if (this.bits[i] === 0) return false;   // definitely-not
    }
    return true;                              // maybe-yes
  }

  // Double-hashing: produce k indices from two base hashes,
  // following Kirsch & Mitzenmacher's "Less Hashing, Same Performance."
  private *indices(item: string) {
    const h1 = fnv1a(item);
    const h2 = murmur3(item);
    for (let i = 0; i < this.k; i++) {
      yield Math.abs((h1 + i * h2) % this.m);
    }
  }

  // Optimal sizing for n items at false-positive rate p.
  static optimalSize(n: number, p: number) {
    const m = Math.ceil(-(n * Math.log(p)) / Math.LN2 ** 2);
    const k = Math.max(1, Math.round((m / n) * Math.LN2));
    return { m, k };
  }
}

// Example: 1M items at 1% FPR → m ≈ 9.6 MB of bits, k = 7
const { m, k } = BloomFilter.optimalSize(1_000_000, 0.01);
const filter = new BloomFilter(m, k);
filter.add("alice@example.com");
filter.has("alice@example.com");  // true (was added)
filter.has("bob@example.com");    // almost certainly false

References & further reading

6 sources

Knowledge check

Did the prototype land?

Quick questions, answers revealed on submit. Sign in to save your best score.

question 01 / 03

A Bloom filter says a key is 'present.' Which of these is the only safe conclusion?

question 02 / 03

Why can't a standard Bloom filter support deletion?

question 03 / 03

You're designing a Bloom filter for 10 million URLs at a target false-positive rate of 1%. Roughly how many bits do you need?

0/3 answered