Bloom & Cuckoo Filters
Three probabilistic set-membership structures that answer 'have I seen this before?' in a few bytes per item. From the classic bit-array Bloom filter to the counting variant that can delete, to the cuckoo filter that does it all with a smaller memory footprint.
What is Bloom & Cuckoo Filters?
The 60-second primer
A probabilistic filter answers 'have I seen this before?' in a handful of bytes per item. A hash set works too, but it has to store every key — fine for thousands, painful for billions. A Bloom filter, a counting Bloom filter, or a cuckoo filter give up certainty in one direction and get back orders of magnitude less memory. You'll see them in browsers (Safari's bad-URL list), databases (Cassandra/HBase/RocksDB skip disk reads), CDNs (cached-or-not), and every CRDB-style system that has to ask 'do I have this row?' over the network.
The trick is the kind of error they allow. All three are one-sided: a positive answer is maybe, but a negative answer is definitely not. That asymmetry is the whole product. You use the filter to skip an expensive lookup that would have failed; the rare false-positive just means you pay for the lookup anyway. You never miss a real hit.
The three variants here build on each other. Bloom is the original — bits, hashes, beautifully small, but you can never delete. Counting Bloom turns each bit into a small counter so you can delete by decrementing, at the cost of more memory. Cuckoo is the modern alternative: stores short fingerprints in two candidate buckets and uses a 'cuckoo' eviction to make room — it deletes safely and uses less space than Bloom at the same false-positive rate.
Where this shows up
- Database storage engines — Cassandra, HBase, LevelDB, RocksDB, ScyllaDB attach a Bloom filter to every SSTable so a key lookup can skip files that definitely don't hold it instead of opening every file on disk.
- CDNs and caches — 'do I have this object?' before a multi-hop fetch. A 0.1%-false-positive filter the size of a thumbnail can rule out a million URLs.
- Web browsers — Chrome's Safe Browsing and old Safari builds shipped a Bloom filter of malicious URLs locally; only the maybe-hits actually call the remote checker.
- Crypto wallets & Bitcoin SPV — wallets register a Bloom filter of their addresses so the node sends back only matching transactions without learning the wallet's full set.
- Networking & dedup — TCP/IP duplicate-packet detection, CDN URL deduplication, web-crawler 'already-visited' set, Kafka exactly-once log compaction. Anywhere a 'have I seen this?' check fronts an expensive op.
- Privacy-preserving sketches — sending a filter is cheaper and leaks less than sending the raw set, which is why so many distributed systems use them at the protocol level.
Why give up exactness?
A hash set storing 1 billion 64-byte URLs needs ~60 GB. A Bloom filter with the same membership question, tuned for a 1% false-positive rate, needs ~1.2 GB — fifty times less. You trade rare extra lookups for fitting the whole thing in RAM. For most 'do I have this?' questions, that's the right trade.
Side-by-side
How they compare
The same concepts, on the same axes. Use this as a map; the individual pages are the territory.
| Filter | Delete? | Bits per item (1% FPR) | False-positive shape | Use when |
|---|---|---|---|---|
01Bloom Filter | No | ~9.6 bits | Grows with fill; can't shrink | Append-only sets — crawl frontier, SSTable summaries, denylists. |
02Counting Bloom | Yes (decrement) | ~38 bits (4-bit counters) | Same shape as Bloom | Membership with churn — sessions, sliding-window dedup, cache admission. |
03Cuckoo Filter | Yes (clean remove) | ~7 bits (typical) | Stable; insert can fail when very full | Modern default — smaller than Bloom, supports delete, predictable lookups. |
- Delete?
- No
- Bits per item (1% FPR)
- ~9.6 bits
- False-positive shape
Grows with fill; can't shrink- Use when
- Append-only sets — crawl frontier, SSTable summaries, denylists.
- Delete?
- Yes (decrement)
- Bits per item (1% FPR)
- ~38 bits (4-bit counters)
- False-positive shape
Same shape as Bloom- Use when
- Membership with churn — sessions, sliding-window dedup, cache admission.
- Delete?
- Yes (clean remove)
- Bits per item (1% FPR)
- ~7 bits (typical)
- False-positive shape
Stable; insert can fail when very full- Use when
- Modern default — smaller than Bloom, supports delete, predictable lookups.
Decision guide
Which one should you use?
A practical tour of when each algorithm wins.
How to pick
- You only ever add (a crawl frontier, an immutable SSTable summary, a denylist that you rebuild instead of patching) → Bloom filter. Smallest and simplest; deletion isn't worth paying for if you don't need it.
- *You need to add and remove, and you don't want to think about 'is the filter full?' edge cases → Counting Bloom*. Mature, easy to reason about, costs ~4× the memory.
- *You need delete and the smallest filter at a given false-positive rate, and you can handle the rare insertion failure → Cuckoo filter*. Slightly more code, noticeably less RAM, faster lookups (only two cache lines touched).
- You're picking a default in 2026 → Cuckoo. The Fan/Andersen/Kaminsky paper changed the landscape; for delete-supporting filters it's almost always the better engineering choice.
Always pair it with the source of truth
A filter never answers a query — it only lets you skip the work for the definite-misses. Behind every Bloom/cuckoo filter in production there is a real store (a hash table, a B-tree, an SSTable on disk) that confirms the maybe-hits. Designing without that backstop is how you turn a 1% false-positive rate into a 1% wrong-result rate, which is a bug.
Concepts in this track
3 concepts, in order
Each links to a concept page with its own explanation, prototype, and quiz.
Bloom Filter
A bit array plus a few hash functions: definitely-not or maybe-yes membership, in a fraction of the memory a hash set needs.
Counting Bloom Filter
Each bit becomes a small counter so items can be removed by decrementing — at roughly 4× the memory of a vanilla Bloom filter.
Cuckoo Filter
Store short fingerprints in two candidate buckets and bump existing entries to their other home on collision. Delete-supporting and usually smaller than Bloom.
Related tracks
If this one clicks, try these next.
Cache Write Policies
Three ways to handle a write when you have a cache in front of the store. Each policy is a different bet about durability, throughput, and how stale your data is allowed to get.
Cache Eviction
When the cache fills up, something has to go — and which one you pick decides your hit rate. Ten classic policies, side-by-side.
Garbage Collection
How a runtime reclaims memory you stopped using — without you ever calling free(). Eight algorithms, from the counter on every object to the collectors that run alongside your program.