The Shape of Duplicate Detection

TL;DR

Duplicate detection looks like a solved problem. You keep a set, skip values you have already seen, and move on. A benchmark suite with 4050 measurement rows across 480 workloads says the reality is more interesting.

For dense 32-bit integers, a bitset can be 94 times faster than std::unordered_set at a million elements. For long strings, sorting with fingerprints can beat a hash set by 2.7 times. For streaming workloads, an in-memory sliding window costs about 68 nanoseconds per event, while a PostgreSQL-backed detector with per-event transactions costs about 6.1 milliseconds. That is a 90,000-fold gap between the fastest and slowest way to answer the same question.

The right structure depends on the shape of your keys, the pattern of your duplicates, the input size, and the durability guarantees you need.

The simple problem that is not simple

The standard answer to duplicate detection is a loop and a hash set:

if (!seen.contains(key)) {
    seen.insert(key);
    output.push_back(key);
}

This works. It is also a placeholder. The machine underneath that abstraction has to answer harder questions. I wrote about this trap in Reading Algorithms Like an Engineer: pseudocode names like L, V, and “remove” are not implementation details to fill in later. They are where the performance and correctness work begins. The same thing happens with duplicate detection. The word “set” hides a choice of representation, and that choice dominates the cost. Are your keys dense integers that could index directly into an array? Are duplicates clustered together, or scattered uniformly? Are your strings short enough to compare quickly, or long enough that every comparison walks cache lines? Is the input a finite batch you can sort, or an infinite stream where you must eventually forget old keys? Do your results need to survive a process restart?

I built a C++ benchmark suite to measure those questions directly. The code is at dubeyKartikay/de-dup-bench. It covers two problem families: finite duplicate removal from a fixed input, and streaming duplicate detection that returns stable idempotency keys for incoming events. The key types are 32-bit integers, 64-bit integers, 16-byte short strings, and 500-byte long strings. The duplicate patterns are uniform random, mostly unique, mostly duplicate, clustered duplicates, and exactly one duplicate. For streams, the backends are an in-memory sliding window, RocksDB with and without batched writes, and PostgreSQL with and without batched commits. The full suite produced 4050 measurement rows.

This post is the story those numbers tell.

Dense integers are an array problem

When your keys are dense 32-bit integers sampled from a bounded range, a hash set is doing unnecessary work. It hashes the key, probes buckets, handles collisions, and chases pointers through memory. A bitset turns the same membership test into a single indexed bit operation.

The charts below show two strategies across two duplicate distributions, from 100 elements to 1 million. The bitset stays flat and fast across every distribution. The hash set climbs steadily.

Dense integer scaling — uniform random

Bitset stays flat while hash and tree sets climb with input size. For dense bounded integers, membership is an indexing problem, not a hashing problem.

Dense integer scaling — clustered duplicates

Clustered duplicates narrow the gap because recent hash buckets stay hot. Locality helps the hash set, yet the bitset still wins.

At one million elements with uniformly distributed integers, the benchmark measured these per-insertion costs:

strategynanoseconds per insert
growable bitset5.1
roaring bitmap165
sort then unique60
std::unordered_set483
std::set1154

The bitset finishes in 5.1 nanoseconds. std::unordered_set takes 483 nanoseconds. That is a 94-fold difference. Both structures answer the same question correctly. One treats the key as an array index. The other treats it as a generic object. I explored this exact gap in more detail in Why You Should Never Use a set, where the bitset outperformed the hash table by roughly 15× on a similar dense-integer workload.

The gap narrows when duplicates are clustered. With clustered duplicates, the bitset still wins but by a smaller margin: 1.1 nanoseconds versus 10.7 nanoseconds for the hash set, about 9.7 times faster. Clustered duplicates mean many early insertions succeed, so the hash set spends less time on collision chains. The bitset is still cheaper because its membership test is a single memory access regardless of the duplicate pattern.

The lesson here is direct: if your key is already an array index, do not turn it into a hash problem.

There is no universal winner

The bitset dominates dense integers, but it does not generalize. Across all finite workloads and stream workloads, the fastest strategy changes with every dimension.

The abbreviations map to the following strategies: BIT is a growable bitset; RBM is a Roaring compressed bitmap; SRT is sort-then-unique over the full key; UNS is std::unordered_set; S+F sorts by a 64-bit XXH3 fingerprint and verifies the full string only on fingerprint collision; SSI is sorted-vector insertion; SIF is sorted-vector insertion with fingerprints; FLT is a bitset pre-filter backed by a hash set; and L+F is a linear fingerprint scan.

The heatmaps below show which strategy wins for each combination of duplicate distribution and input size. BIT dominates 32-bit integers. SRT and RBM trade wins for 64-bit integers. For short strings, S+F and UNS trade places. For long strings, S+F wins most cells. The color blocks change shape as you move across the grid. A strategy that is optimal at one coordinate is mediocre at another.

Fastest finite strategy — 32-bit integer keys

Every cell is BIT. For dense 32-bit IDs, the best strategy is stable across all tested sizes and duplicate patterns.

Fastest finite strategy — 64-bit integer keys

The winner shifts between sorting, Roaring, hash sets, and sorted insertion. Once integers stop being a compact array index, the best choice depends on size and duplicate pattern.

Fastest finite strategy — short text keys

Fingerprinted sorting takes unique-like cells; hash sets take duplicate-heavy ones. Short strings need a strategy chosen from the duplicate shape, not just the key type.

Fastest finite strategy — long text keys

Fingerprinted sorting dominates uniform and mostly-unique workloads; hash sets win when repeats cluster. Long strings reward avoiding full comparisons unless repeated keys stay hot.

For 64-bit integers with a uniform distribution, the winner at one million elements is sort plus unique at 65 nanoseconds per insert, followed by Roaring bitmap at 165 nanoseconds, with std::unordered_set trailing at 518 nanoseconds. For mostly-unique 64-bit integers, the Roaring bitmap takes the lead at 49 nanoseconds, with sort plus unique at 79 nanoseconds and the hash set at 338 nanoseconds. The hash set does relatively better on mostly-duplicate workloads because most lookups hit an existing entry early, but it still loses to a bitset or a sort.

For short strings at one million elements with uniform distribution, std::unordered_set costs about 1234 nanoseconds per insert. Sorting the strings directly costs 991 nanoseconds. Sorting with a 64-bit XXH3 fingerprint first, then verifying the full string only on fingerprint collisions, drops the cost to 430 nanoseconds. The fingerprinted sort is 2.9 times faster than the hash set.

For long strings under the same conditions, the hash set balloons to 3052 nanoseconds. Direct sorting costs 2103 nanoseconds. Fingerprinted sorting drops to 1139 nanoseconds, about 2.7 times faster than the hash set. The fingerprint helps more for long strings because it replaces expensive full-string comparisons with cheap 64-bit integer comparisons during the sort.

Clustered duplicate strings are the one case where the hash set shines. When the same strings appear in dense clusters, the hash set caches recently accessed buckets, and most lookups hit the first probe. On clustered short strings, the hash set can beat sorting because it avoids the O(n log n) overhead for data that is already nearly grouped.

The fastest strategy depends on the key type, the duplicate distribution, and the input size. Any advice that says “just use a set” has not looked at the numbers.

How expensive is the default choice?

The heatmaps below show the ratio of std::unordered_set time to the fastest strategy time for every finite workload. Deep blue means the hash set is close to optimal. Red and orange mean it is falling behind. For dense 32-bit integers, the hash set is up to 94 times slower. For dense integers in a known universe, the gap stretches past 110x. For short strings in the one-duplicate pattern, the hash set is nearly optimal because there is almost nothing to deduplicate. The pattern is not random. It is structural.

Hash set slowdown vs. fastest strategy — 32-bit integer keys

The hash set ranges from about 2× slower to over 200× slower. Dense 32-bit integers are the worst place to spend cycles on generic hashing.

Hash set slowdown vs. fastest strategy — 64-bit integer keys

The hash set usually stays within a few multiples of the winner, with worst cells around 10×. For 64-bit integers, hashing is less disastrous but still often not optimal.

Hash set slowdown vs. fastest strategy — short text keys

Hash sets are nearly optimal in clustered-duplicate cells but fall behind on unique-looking workloads. Short-text performance is controlled by how often repeats appear.

Hash set slowdown vs. fastest strategy — long text keys

The hash set is close to optimal for clustered repeats but lags on mostly-unique long strings. Long-key hashing pays off only when duplicates are hot enough.

The worst case for the hash set is dense integers in a known universe. At one million elements with a known bound of 10 * N, the pre-sized bitset takes 2.1 nanoseconds while the hash set takes 236 nanoseconds. That is roughly 110 times slower. Even with a tight bound of exactly N, the pre-sized bitset takes 4.5 nanoseconds while the hash set takes 147 nanoseconds, about 33 times slower.

The hash set is not always bad. For sparse data or unbounded keys, it is often the only practical exact structure. The point is that it is expensive insurance against structure you may already know. If you can answer the shape questions from the previous section, you can often do better.

Text keys change the cost model

With long strings, the dominant cost is not the hash table structure. It is the work of comparing full keys. Every hash collision forces a string comparison. Every sort comparison walks both strings byte by byte. For 500-byte strings, that is a lot of cache lines.

The bar chart below compares the three approaches for long strings across all five duplicate distributions.

Fingerprint speedup for finite string workloads — long text keys, N=1M

Fingerprinted sorting cuts long-string costs sharply in unique-like workloads, while hash sets stay best for dense repeats. The longer the string, the more valuable it is to avoid full comparisons.

Fingerprints change the math by making the common case cheap. The benchmark uses XXH3 to compute a 64-bit fingerprint for each string. In the fingerprinted sort, strings are sorted by fingerprint first. The full string comparison only runs when two fingerprints collide, which is rare for 64-bit hashes.

The speedup is measurable. For long strings at one million elements, fingerprinted sorting is 1.8 times faster than direct sorting. For short strings, the gain is 2.3 times. The fingerprint helps more for longer strings because the baseline comparison cost is higher, so avoiding it saves more work.

Fingerprints are not magic. They add the cost of hashing every string upfront. For short strings, that overhead can outweigh the savings. For long strings, the upfront hash is a clear win. You also still need full-key verification if correctness matters, because a 64-bit fingerprint can theoretically collide. In practice, the collision probability is negligible for most workloads, but the verification step is what makes the guarantee exact.

With 500-byte long strings, a fingerprinted sliding window costs 344 nanoseconds per event versus 872 nanoseconds for the full-key version. That is a 2.5 times improvement. For short strings, the gap is smaller: 169 nanoseconds versus 199 nanoseconds, about 1.2 times, because the string comparison itself is already cheap.

The question for strings is not just “hash table or sort?” It is “how often do I force the CPU to inspect the whole string?”

Known bounds are a superpower

If your integer keys have a known range, you can pre-size the bitset. You can also reserve capacity in the hash set. Both optimizations turn metadata into performance.

Pre-sized bitsets with a known bound of N or 10 * N sit at the bottom of both panels. Reserved hash sets improve over the unreserved baseline for uniform input, but can actually cost more when duplicates dominate.

Known-universe vs unknown-universe — uniform random

Pre-sized bitsets and reserved hash sets both improve uniform random workloads, yet bitsets remain far below hash sets. A key-range hint is high-leverage when values are spread across the universe.

Known-universe vs unknown-universe — mostly duplicate

The growable bitset is already near the floor, while reserved hash sets can actually cost more than the unreserved version. When duplicates dominate, preallocation is not automatically a win.

The benchmark measures this directly with a known-universe variant. For 32-bit integers at one million elements with a known bound of 10 * N, the pre-sized bitset drops from 5.1 nanoseconds to 2.1 nanoseconds. The same bound with a hash set and reserve(10 * N) drops the cost from 483 nanoseconds to 236 nanoseconds. With a tight bound of exactly N, the pre-sized bitset takes 4.5 nanoseconds. The reserved hash set takes 147 nanoseconds. The ratio is still 33 times in favor of the bitset, but both numbers are better than their unknown-universe counterparts.

If your API can accept a key-range hint, that hint can be worth more than a micro-optimization inside the loop. A caller who knows the universe can pass it down. A library that ignores that information leaves performance on the table.

Streaming duplicate detection is a different beast

Batch duplicate removal and stream duplicate detection are not the same problem. A batch algorithm sees the entire input before it produces output. A stream algorithm must decide, for each incoming event, whether that event is a duplicate of something that arrived earlier. It also has to answer a question that batch algorithms do not: how long do you remember old keys?

Backend cost by duplicate distribution for streaming

Memory and RocksDB stay in the nanosecond-to-microsecond range while PostgreSQL moves into microseconds or milliseconds. Backend guarantees dominate over duplicate distribution.

Memory windows cost tens to hundreds of nanoseconds. RocksDB costs single-digit microseconds. PostgreSQL costs tens to hundreds of microseconds. The duplicate pattern changes the absolute numbers but not the relative ordering.

The benchmark tests three tiers of stream backends. The fastest is a bounded in-memory sliding window. At 100 elements, it costs about 68 nanoseconds per event for integers and about 199 nanoseconds for short strings. The window can forget old keys, which makes it fast but also means duplicates outside the window pass through.

The middle tier is RocksDB on local disk with batched writes. For integer streams, the fastest RocksDB configuration costs about 4400 nanoseconds per event. That is roughly 65 times slower than the memory window, but the state survives process restart.

The slowest tier is PostgreSQL with shared transactional state. Batched commits bring the cost to about 116,000 nanoseconds per event, or 116 microseconds. Per-event transactions, which give full transactional isolation, cost about 6.1 milliseconds per event. That is 1700 times slower than batched commits and roughly 90,000 times slower than the in-memory window.

The fastest duplicate detector is the one allowed to forget. Memory is nanoseconds. RocksDB is microseconds. PostgreSQL is milliseconds. The gap between tiers is not a constant factor. It is orders of magnitude, and each tier buys a different guarantee.

The price of stronger guarantees

Speed cost of stronger stream de-dup guarantee levels

Each stronger guarantee tier shifts right by orders of magnitude. The fastest detector is the one allowed to keep less durable state.

The chart is not “which is fastest?” It is “what guarantee are you buying?” If you need deduplication across process restarts and your application runs on a single machine, RocksDB is a reasonable middle ground. If you need shared state across a distributed system, PostgreSQL is the right tool, but you must accept the latency. If you only need to deduplicate within a bounded recent window, memory is the obvious choice.

Latency is the price tag on your failure semantics. The question is not which backend is fastest in isolation. It is which backend is fast enough for your guarantee requirements.

Batching is the difference between a strategy and a disaster

For durable duplicate detection, batching is not an optimization. It is the design.

SQL and RocksDB batch-size impact

Batching collapses PostgreSQL commit cost from milliseconds toward microseconds; RocksDB sees a smaller but real gain. Durable deduplication should be designed around grouped writes.

The benchmark measures SQL and RocksDB with batch sizes of 1 and 1000. For PostgreSQL, the difference is extreme. With a batch size of 1, batched SQL commits cost about 6.9 milliseconds per event. With a batch size of 1000, the cost drops to about 116 microseconds. That is a 60-fold improvement. The per-event transaction path, which uses no batching at all, costs about 6.1 milliseconds. Batching changes the order of magnitude.

RocksDB also benefits from batching, but less dramatically. With batch size 1, RocksDB batched writes cost about 5093 nanoseconds per event. With batch size 1000, the cost drops to about 4400 nanoseconds. That is only a 1.2 times improvement because RocksDB’s write batch is already efficient at the storage engine level. The win is smaller, but it is still present.

The lesson is consistent across both backends: if you are writing to durable storage, do not commit on every event. Group them. The exact batch size depends on your latency budget and memory constraints, but the difference between batch size 1 and batch size 1000 is the difference between a usable system and a bottleneck.

Window size is a workload parameter

For in-memory streaming windows, larger windows are not automatically faster or slower. The interaction between window size and duplicate pattern matters.

The line chart below shows how the in-memory window cost changes as the window grows from 100 to 5000 elements, for integer keys and each duplicate distribution. For mostly-duplicate streams, the lines stay flat or drop, because a larger window catches more repeats. For mostly-unique streams, the lines climb, because every new key is a cache miss into a larger working set.

Window-size sensitivity — 32-bit integer keys

Larger windows hurt uniform and mostly-unique integer streams but help mostly-duplicate ones. The right window size is the expected duplicate distance, not the largest value you can afford.

Stream deduplication should be tuned from the expected duplicate distance, not from a generic default. If your duplicates arrive within seconds of each other, a small window and frequent eviction work. If duplicates might be hours apart, you need a larger window or a durable backend.

A practical decision table

Here is how the benchmark results map to choices.

Dense bounded integer IDs. Use a pre-sized bitset. At one million elements, it can be 30 to 110 times faster than a hash set depending on how tight the bound is.

Sparse integer IDs. Try a compressed bitmap like Roaring for 64-bit keys, or sort plus unique if the input is a finite batch. Both can beat std::unordered_set by a factor of 3 to 10.

Mostly-unique finite data. Sorting often competes well because memory access is predictable. At large sizes, the O(n log n) cost is hidden by cache-friendly sequential access.

Long strings. Use fingerprints carefully, with full-key verification if correctness matters. Fingerprinted sorting can be 1.8 to 2.7 times faster than a hash set.

Clustered duplicate strings. A hash set can be excellent because recent buckets stay hot in cache.

Streaming with bounded memory. Use an in-memory sliding window. It costs tens to hundreds of nanoseconds per event.

Streaming with local durability. Use RocksDB with batched writes. It costs a few microseconds per event.

Streaming with shared transactional semantics. Use PostgreSQL, but batch commits. Per-event transactions cost milliseconds. Batched commits cost microseconds.

Closing

Duplicate detection is not one algorithm. It is a family of representations, and the right representation depends on the key shape, the duplicate distribution, the input size, the memory locality, and the durability requirements.

A dense integer is an array index. A long string is a comparison problem. A stream is a forgetting problem. A durable store is a batching problem. Each of these shapes has a structure that fits it well, and a default choice that fits it poorly.

The fastest set is often not a set. It is the data structure your key space was trying to be.