CONCEPT Cited by 2 sources
Benchmark representativeness¶
Definition¶
Benchmark representativeness is the measurable property that a benchmark's execution behaviour matches the distribution of execution behaviour of the target workload. It is the inverse of concepts/benchmark-methodology-bias: bias describes the failure mode (benchmark skews away from target), representativeness describes the success property (benchmark tracks target).
Critically, representativeness must be measured at the level at which the benchmark's output is consumed. For hardware-evaluation consumers (CPU vendors, capacity planners), that level is microarchitectural — not aggregate score.
Meta's operationalisation (DCPerf, 2024-08-05)¶
Meta validates DCPerf representativeness on two microarchitectural metrics, publishing comparison graphs of each:
1. Instructions-Per-Cycle (IPC)¶
Compare the IPC distribution exhibited by:
- Meta production applications (running live).
- DCPerf benchmarks.
- SPEC CPU workloads.
Meta's published graph clusters (1) and (2) together, with (3) apart. "Red circles highlight that DCPerf more accurately represents IPC of production applications."
2. Core frequency¶
Same three-population comparison on average core frequency. Same outcome: (1) and (2) cluster, (3) sits apart. "DCPerf more accurately represents the frequency characteristics of production applications."
The choice of IPC + frequency is not arbitrary: together they quantify "power and performance characteristics" — the axes hyperscale capacity-planning + vendor co-optimization care about.
Why microarchitectural metrics matter¶
- Aggregate scores (SPECrate, SPECspeed) lose the signal. Two benchmarks with equal aggregate score can have wildly different IPC / frequency / cache-behaviour — and therefore behave very differently under a new CPU architecture. Aggregate-only representativeness is not enough for procurement at hyperscale.
- Benchmark-informed hardware choices assume microarchitectural representativeness. When a CPU vendor tunes SoC power management using DCPerf, the assumption is that DCPerf's frequency profile tracks production's; if it doesn't, the optimization won't translate.
- Emerging architectures amplify the gap. ARM64 vs x86 differ in IPC behaviour on many hyperscale workloads. Chiplet architectures introduce new cache-sharing topologies. A benchmark that's representative-on-x86-monolithic may not be representative-on-ARM-chiplet. Multi-ISA + emerging-topology support is load-bearing for sustained representativeness.
How to achieve representativeness¶
- Anchor each benchmark to a real production workload (patterns/workload-representative-benchmark-from-production).
- Measure at the level your consumer cares about. For CPU-vendor consumers, microarchitectural distributions. For capacity-planning consumers, throughput + tail latency distributions. For application-config consumers (e.g. Figma OpenSearch shard-count), query-latency distributions — patterns/custom-benchmarking-harness is the application- layer expression.
- Validate the match empirically. Not "we think it's representative" — publish the comparison graph.
- Evolve with the target. Workloads change; benchmarks decay. DCPerf is a rolling version stream, not a frozen drop.
Seen in¶
- sources/2024-08-05-meta-dcperf-open-source-benchmark-suite — canonical microarchitectural-representativeness-with-IPC-and- frequency-comparison instance. Meta's evidence that DCPerf is representative and SPEC CPU is not (for hyperscale).
- sources/2025-10-14-cloudflare-unpacking-cloudflare-workers-cpu-performance-benchmarks — the bias side of the same coin: Cloudflare's six bias classes are the failure modes an un-representative benchmark surfaces.
- sources/2026-04-21-figma-the-search-for-speed-in-figma —
application-layer representativeness via
custom Go harness
against the server-side
tooklatency field. - sources/2026-04-21-planetscale-benchmarking-postgres — database-vendor-comparison representativeness instance. PlanetScale's methodology disclosure for its Telescope benchmarking harness is explicit that "no single benchmark can capture the performance characteristics of all such databases. Data size, hot:cold ratios, QPS variability, schema structure, indexes, and 100 other factors determine the requirements of your relational database setup. You cannot look at a benchmark and know for sure that your workload will perform the same given all other factors are the same." Instead of claiming representativeness-for-your-workload, the post commits to answering four specific questions (latency, typical OLTP, high-pressure IOPS/caching, price- performance) and leaves workload-specific representativeness to the reader. Canonical wiki framing: a multi-vendor comparative benchmark represents the comparison, not your workload.
Related¶
- concepts/benchmark-methodology-bias — the failure-mode sibling concept; bias is lack-of-representativeness at work.
- concepts/hyperscale-compute-workload — the workload category DCPerf argues SPEC CPU is non-representative for.
- systems/dcperf — canonical instance.
- systems/spec-cpu — the incumbent benchmark DCPerf argues is non-representative for hyperscale.
- patterns/workload-representative-benchmark-from-production — the design rule.
- patterns/custom-benchmarking-harness — the Figma application-layer sibling.