Why You Should Never Use a set: C++ Memory Costs

TL;DR

Hash tables have a cost that shows up in hot loops. If your ids are dense integers, std::unordered_set has to hash, allocate, grow buckets, and miss the cache. A bitset does the same membership check with one indexed memory operation.

Deduplicating 1000 vectors of 500,000 dense ids took about 50,302 ms with std::unordered_set, 26,759 ms with sort + unique, and 3,423 ms with boost::dynamic_bitset<>.

When ids are sparse and the universe is huge, std::unordered_set does better because a bitset has to clear a mostly empty universe each time. Before you reach for std::unordered_set, ask whether your set is actually an array problem.

The trap in the word “set”

std::unordered_set is sometimes the right tool. The problem is we reach for it too early, especially inside search loops.

It acts like an expensive memory machine. Buckets get allocated. Keys get hashed. The table grows, rehashes, and scatters data in ways the prefetcher cannot follow. The lookup is average O(1) on a whiteboard. On actual silicon the constant hiding inside that O(1) can be brutal.

I ran into this while implementing Vamana for the post on Reading Algorithms Like an Engineer. The DiskANN pseudocode has a visited set V. C++ cannot allocate an abstract set. It has to pick a representation, and that representation becomes part of the real cost.

In greedy graph search you keep asking the same thing:

Have I already inserted this node?

That sounds like a set, so the first draft usually uses std::unordered_set. But if node ids are dense integers, the better move is a single indexed memory access.

This post measures that decision directly: what happens when the visited set is a hash table instead of a dense bitset.

Benchmark setup

I used a small C++ benchmark compiled with:

g++ -O3 -march=native -std=c++20 \
  -I/home/kartikayd/.lafayette/work/deps/boost_1_85_0 \
  dup_bench_mem.cpp -o dup_bench_mem

For each n, the benchmark deduplicates 1000 generated vectors and reports total latency. Each vector has n random uint32_t ids.

I deliberately wrote the straightforward version. The std::unordered_set uses no reserve(). The output vector also uses no reserve(). Container growth and rehashing are part of the measurement. The boost::dynamic_bitset<> object is allocated once per run and cleared before each vector. Its output vector is rebuilt every time, so it does not reuse capacity across inputs. I saved the raw measurements in dup-bench-results.csv and rendered the charts with Matplotlib.

The hash-table version looks like this:

std::unordered_set<uint32_t> seen;
std::vector<uint32_t> out;

for (uint32_t x : input) {
    if (seen.insert(x).second) out.push_back(x);
}

The dense-id bitset version uses boost::dynamic_bitset<>:

boost::dynamic_bitset<> seen(universe);

for each input vector:
    seen.reset();
    std::vector<uint32_t> out;

    for (uint32_t x : input) {
        if (!seen.test(x)) {
            seen.set(x);
            out.push_back(x);
        }
    }

I also included sort + unique. Contiguous memory can make even O(n log n) surprisingly competitive with a hash table.

Dense node ids

In the first benchmark, ids are sampled from [0, 2n). This is close to the graph-search case: node ids are dense, bounded, and directly indexable.

Dense benchmark

n	unordered_set ms	sort+unique ms	boost bitset ms	unordered vs bitset
128	5	3	1	7.6×
2,048	77	72	8	9.2×
32,768	1,649	1,455	177	9.3×
131,072	7,940	6,384	832	9.5×
500,000	50,302	26,759	3,423	14.7×

At n = 500,000, std::unordered_set took about 50,302 ms for 1000 vectors. boost::dynamic_bitset<> took about 3,423 ms.

Both versions remove duplicates. With a bitset, membership is just an indexed bit test. The hash table has to hash the key, grow buckets, handle collisions, and chase pointers through memory.

sort + unique also beats std::unordered_set at larger sizes. It works over contiguous memory, and CPUs like that.

Sparse ids, huge universe

Now suppose ids are sparse. I sampled only n ids from a fixed universe of 100,000,000 possible values. A direct bitset has to represent that whole universe and clear it before every vector.

Sparse benchmark

n	unordered_set ms	boost bitset ms	faster
8	0.2	138.3	unordered_set
128	6.3	149.7	unordered_set
2,048	91.9	145.5	unordered_set
8,192	379.5	181.1	bitset
32,768	1,877.6	402.4	bitset
65,536	4,169.3	985.4	bitset

std::unordered_set wins for small sparse inputs. At n = 128, the hash set took about 6.3 ms, while the bitset took about 149.7 ms. The bitset pays to clear memory for a massive universe even though the input is tiny. Once n grows enough, that fixed clearing cost gets amortized and the bitset pulls ahead.

So when do you pick what?

Dense ids: use a bitset or another array-indexed structure. Sparse or unbounded ids: a hash set may be the right tool.

std::unordered_set wins when the alternative forces you to pay for a huge mostly empty universe.

Why say “never use it”?

People reach for std::unordered_set because O(1) sounds unbeatable. In performance-sensitive code, especially inside a search loop, Big-O is only the first filter. The CPU still has to execute the thing.

A dense bitset packs memory tight and the operations stay simple. Vectors keep everything contiguous, and if you sort them you get predictable access. A hash table is more general, but you pay for that flexibility with a heavier constant.

The habit that helps:

Never use std::unordered_set until you have asked whether your data is actually an array problem.

Dense ids? Make the duplicate check an indexed load or store. Small sparse input? A hash table may be better. Data you can sort once and scan many times? Try a sorted vector. The point is to pick the container based on the memory access pattern it creates, not just the Big-O.

Practical rule

Use std::unordered_set when:

ids are sparse or unbounded
the universe is much larger than the number of elements
keys are not integer-indexable
convenience matters more than hot-loop speed

Avoid it when:

ids are dense indexes
the code is inside a hot loop
the set is rebuilt again and again
a bitset, boolean array, sorted vector, or plain vector scan fits the data shape

CPUs are fast at arithmetic and slow at waiting for memory. std::unordered_set often makes them wait. That is the same lesson from the Vamana post in smaller form: the paper gives the map; the memory operations are the terrain.