CONCEPT Cited by 2 sources

Benchmark representativeness¶

Definition¶

Benchmark representativeness is the measurable property that a benchmark's execution behaviour matches the distribution of execution behaviour of the target workload. It is the inverse of concepts/benchmark-methodology-bias: bias describes the failure mode (benchmark skews away from target), representativeness describes the success property (benchmark tracks target).

Critically, representativeness must be measured at the level at which the benchmark's output is consumed. For hardware-evaluation consumers (CPU vendors, capacity planners), that level is microarchitectural — not aggregate score.

Meta's operationalisation (DCPerf, 2024-08-05)¶

Meta validates DCPerf representativeness on two microarchitectural metrics, publishing comparison graphs of each:

1. Instructions-Per-Cycle (IPC)¶

Compare the IPC distribution exhibited by:

Meta production applications (running live).
DCPerf benchmarks.
SPEC CPU workloads.

Meta's published graph clusters (1) and (2) together, with (3) apart. "Red circles highlight that DCPerf more accurately represents IPC of production applications."

2. Core frequency¶

Same three-population comparison on average core frequency. Same outcome: (1) and (2) cluster, (3) sits apart. "DCPerf more accurately represents the frequency characteristics of production applications."

The choice of IPC + frequency is not arbitrary: together they quantify "power and performance characteristics" — the axes hyperscale capacity-planning + vendor co-optimization care about.

Why microarchitectural metrics matter¶

Aggregate scores (SPECrate, SPECspeed) lose the signal. Two benchmarks with equal aggregate score can have wildly different IPC / frequency / cache-behaviour — and therefore behave very differently under a new CPU architecture. Aggregate-only representativeness is not enough for procurement at hyperscale.
Benchmark-informed hardware choices assume microarchitectural representativeness. When a CPU vendor tunes SoC power management using DCPerf, the assumption is that DCPerf's frequency profile tracks production's; if it doesn't, the optimization won't translate.
Emerging architectures amplify the gap. ARM64 vs x86 differ in IPC behaviour on many hyperscale workloads. Chiplet architectures introduce new cache-sharing topologies. A benchmark that's representative-on-x86-monolithic may not be representative-on-ARM-chiplet. Multi-ISA + emerging-topology support is load-bearing for sustained representativeness.

How to achieve representativeness¶

Anchor each benchmark to a real production workload (patterns/workload-representative-benchmark-from-production).
Measure at the level your consumer cares about. For CPU-vendor consumers, microarchitectural distributions. For capacity-planning consumers, throughput + tail latency distributions. For application-config consumers (e.g. Figma OpenSearch shard-count), query-latency distributions — patterns/custom-benchmarking-harness is the application- layer expression.
Validate the match empirically. Not "we think it's representative" — publish the comparison graph.
Evolve with the target. Workloads change; benchmarks decay. DCPerf is a rolling version stream, not a frozen drop.

Seen in¶

sources/2024-08-05-meta-dcperf-open-source-benchmark-suite — canonical microarchitectural-representativeness-with-IPC-and- frequency-comparison instance. Meta's evidence that DCPerf is representative and SPEC CPU is not (for hyperscale).
sources/2025-10-14-cloudflare-unpacking-cloudflare-workers-cpu-performance-benchmarks — the bias side of the same coin: Cloudflare's six bias classes are the failure modes an un-representative benchmark surfaces.
sources/2026-04-21-figma-the-search-for-speed-in-figma — application-layer representativeness via custom Go harness against the server-side took latency field.
sources/2026-04-21-planetscale-benchmarking-postgres — database-vendor-comparison representativeness instance. PlanetScale's methodology disclosure for its Telescope benchmarking harness is explicit that "no single benchmark can capture the performance characteristics of all such databases. Data size, hot:cold ratios, QPS variability, schema structure, indexes, and 100 other factors determine the requirements of your relational database setup. You cannot look at a benchmark and know for sure that your workload will perform the same given all other factors are the same." Instead of claiming representativeness-for-your-workload, the post commits to answering four specific questions (latency, typical OLTP, high-pressure IOPS/caching, price- performance) and leaves workload-specific representativeness to the reader. Canonical wiki framing: a multi-vendor comparative benchmark represents the comparison, not your workload.
— social-graph workload representativeness instance. Liz van Dijk (PlanetScale, 2022-09-08) introduces TAOBench as a benchmark whose representativeness claim is anchored in two VLDB-published characterisation papers from Audrey Cheng's Berkeley/Meta team: VLDB 2021 characterises the production TAO workload; VLDB 2022 distils that characterisation into the TAOBench workload shape. Van Dijk's framing: "In practice, the behavior of these workloads may come across as somewhat synthetic and reductionist, but taking some time to read the white papers will clarify how the two formulas end up very closely resembling the real world behavior observed at Meta." Canonical wiki datum: representativeness can be argued via academic workload-characterisation papers, a stronger methodology than "we think this looks like production" and a weaker one than "we replay production traces". TAOBench sits in the middle — synthesised from characterised real workloads, not a replay. The representativeness argument is also workload- shape-specific: TAOBench is representative of social-graph workloads, distinct from sysbench-tpcc's OLTP-shaped representativeness. This extends the wiki's benchmark-representativeness taxonomy with a social-graph shape alongside the hyperscale compute (DCPerf) and OLTP (sysbench, PlanetScale Postgres benchmarks) instances already canonicalised.

concepts/benchmark-methodology-bias — the failure-mode sibling concept; bias is lack-of-representativeness at work.
concepts/hyperscale-compute-workload — the workload category DCPerf argues SPEC CPU is non-representative for.
systems/dcperf — canonical instance.
systems/spec-cpu — the incumbent benchmark DCPerf argues is non-representative for hyperscale.
patterns/workload-representative-benchmark-from-production — the design rule.
patterns/custom-benchmarking-harness — the Figma application-layer sibling.