Skip to content

CONCEPT Cited by 2 sources

Small-map-as-sorted-vec

Represent an associative container whose key count is empirically small and bounded as a flat sorted Vec<(key, value)> (a.k.a. flat map, sorted vector map) rather than a balanced-tree map or hash map. Theoretically worse on asymptotic insert (O(n) shift vs O(log n) tree rotation), practically faster at small N on modern CPUs because the entire working set fits in a handful of cache lines, amortised allocation overhead is zero, and pointer-chasing is eliminated. The bound that justifies the choice is empirical, not theoretical — it comes from the product/schema shape, not the container's capacity.

When it applies

Three conditions must hold simultaneously:

  1. Small N. Average entry count per instance is small (typically tens; certainly < 1,000). At large N, tree structures' asymptotic wins dominate cache-locality gains.
  2. Bounded by product shape, not runtime data. The upper bound comes from something that evolves slowly — a schema, an enum, a config table. Not runtime-variable user input.
  3. Read-dominated or append-dominated hot path. If the workload inserts/removes randomly in the middle, the O(n) shift cost resurfaces. Common fits: deserialization (entries arrive sorted), configuration (loaded once, read many times), fixed-schema metadata.

If any of those fail, use a tree-based map (BTreeMap) or hash map (HashMap) as usual.

Why it wins at small N

  • Cache locality. A 64-entry Vec<(u16, u64)> is ~640 bytes — ~10 cache lines, often block-prefetched. A BTreeMap<u16, u64> of the same size scatters across per-node allocations with pointer-chased traversal.
  • Zero per-entry allocation. One contiguous allocation for the whole map, vs one allocation per internal tree node.
  • Branch predictability. Binary search over a dense array is branch-predictable and SIMD-friendly; tree traversal is not.
  • Smaller per-entry footprint. A B-tree node has slack bits for child pointers and entry counts; a Vec<(K, V)> entry is exactly sizeof(K) + sizeof(V) plus alignment padding.

Canonical instances

Boundary and failure modes

  • Don't use if inserts/deletes are random and frequent. O(n) shift cost returns. Measure the workload, not just the size.
  • Don't extrapolate from microbenchmarks. The cache-locality win depends on real allocator behaviour and CPU cache hierarchy of the target platform; cross-platform (ARM-Linux vs x86-Linux vs macOS) profiles can differ.
  • Resizing cost. The Vec reallocates on growth past capacity; a BTreeMap doesn't. Pre-size if possible.
  • No iterator stability across mutation. Inserting anywhere may shift all subsequent entries. BTreeMap iterators are more robust to concurrent mutation.

Relationship to adjacent techniques

  • Composes with concepts/tagged-pointer. Once you have a flat Vec<(K, V)>, tagged pointers collapse it further to Vec<packed_u64> if sizeof(K) + sizeof(V_pointer) ≤ 64 bits.
  • Structural cousin of struct-of-arrays. Both trade abstraction for contiguous-layout performance; columnar is typically chosen for scan workloads, flat-map for lookup workloads.
  • Not the same as a hash table with open addressing — open-addressed hash tables also exploit cache locality but pay collision-resolution cost on lookup; flat maps pay binary-search cost. Open-addressed hashing wins when the key domain is large + unstructured; flat maps win when the key domain is small + structured.

Lesson

Asymptotic analysis lies at small N on modern hardware. The theoretically-worse container can be practically-faster when the entire working set fits in cache and allocator overhead dominates algorithmic overhead. Always measure. The bound that matters is the empirical distribution of the workload, not the container's worst-case capacity.

Sibling in the wiki

systems/trie-hard is the same lesson at a different substrate. Cloudflare's systems/pingora-origin clear_internal_headers replaced HashMap<HeaderName> with a custom trie whose entire node graph lives in one contiguous memory region and whose edges are bit-packed into unsigned integers. Cache-locality over asymptotic-complexity again: 1.53 µs → 0.93 µs, 1.28 % of 40,000-core CPU saved (~550 cores). Both cases hinge on a schema-bounded small N and a known workload shape; both shipped as custom structures because general-purpose containers didn't match the target. See sources/2024-09-10-cloudflare-a-good-day-to-trie-hard.

Last updated · 200 distilled / 1,178 read