PATTERN Cited by 1 source
Workload-representative benchmark from production¶
Intent¶
When existing public benchmarks don't represent your production workload shape — and procurement, capacity planning, or vendor co-optimization decisions depend on that shape — build a benchmark suite where each benchmark is anchored to a real production application and validated against that application at the level at which the benchmark's output will be consumed.
Context¶
Every mature engineering org at scale hits the same wall: industry-
standard benchmarks (SPEC CPU, SPECjbb, TPC-H, JMH, sysbench,
vendor-supplied harnesses) were designed for a different workload
population or a different consumer. Using them to guide decisions
at your workload is an instance of
concepts/benchmark-methodology-bias: the benchmark's shape
systematically skews the signal.
The fix is not to abandon benchmarks. The fix is to build a benchmark suite whose shape matches yours — and to validate the match empirically, not assert it.
Mechanism¶
Each benchmark anchors to a real application¶
"Each benchmark within DCPerf is designed by referencing a large application within Meta's production server fleet." (Source: sources/2024-08-05-meta-dcperf-open-source-benchmark-suite)
Not synthetic. Not a public-benchmark derivative. A real large internal application → a benchmark you can run + ship + version externally without leaking the application itself.
Capture workload characteristics at multiple levels¶
Meta explicitly spans "low-level hardware microarchitecture features to application and library usage profiles" when analysing production workloads, then "captures the important characteristics of these workloads in DCPerf."
Pick the right level for your consumer:
- Hardware / CPU vendor: microarchitectural distributions (IPC, core frequency, cache-miss rate, branch-miss rate, memory- bandwidth consumption).
- Capacity planning: throughput, tail latency, core-count scaling.
- Application tuning: server-side latency fields, per-query cost.
Validate representativeness publicly¶
Meta publishes IPC and core-frequency comparison graphs between production, DCPerf, SPEC CPU. That comparison is the evidence of representativeness. Without it, "workload-representative" is an assertion consumers can't audit.
See concepts/benchmark-representativeness for the full property.
Evolve the suite with the target¶
"Over the past few years, we have continuously enhanced these benchmarks to make them compatible with different instruction set architectures, including x86 and ARM." New ISAs, chiplet topologies, multi-tenancy / increasing core counts — the benchmark suite is a living codebase, not a one-shot drop.
Open-source to align an industry¶
DCPerf is shipped on GitHub with an explicit ambition of becoming "an industry standard method to capture important workload characteristics of compute workloads that run in hyperscale datacenter deployments." Opening the source turns a point- solution into a coordination tool between hyperscalers, hardware vendors, and academic researchers.
Canonical instance — Meta DCPerf¶
Meta's five internal use cases (from the 2024-08-05 post):
- Data-center deployment configuration choices.
- Early performance projections for capacity planning.
- Identifying performance bugs in hardware and system software.
- Joint platform optimization with hardware-industry collaborators (see patterns/pre-silicon-validation-partnership).
- Deciding which platforms to deploy in Meta data centers.
Each of these is a decision that was previously made with signal from SPEC CPU + vendor-supplied benchmarks, and each has become better-informed with DCPerf alongside.
Anti-patterns¶
- "We use SPEC CPU because everyone does." For most enterprise or HPC workloads, this is fine. For hyperscale cloud deployments, Meta's IPC/frequency graphs are direct evidence that the baseline biases against the workload.
- Aggregate-score-only representativeness. Two benchmarks with equal SPECrate scores can have wildly different IPC distributions. Aggregate match is necessary but not sufficient.
- One-shot benchmark drop. Workloads evolve; hardware evolves. A benchmark that was representative two years ago may not be now. DCPerf is versioned, maintained, and extended.
- Private-only benchmark. Keeping a workload-representative benchmark internal prevents CPU vendors from pre-silicon-tuning against it (see patterns/pre-silicon-validation-partnership) and prevents academic / peer validation.
Relationship to adjacent patterns¶
- patterns/custom-benchmarking-harness — the application-layer sibling. Figma's afternoon-of-Go OpenSearch harness is the same pattern at a different altitude: vendor's benchmark shape doesn't match, so build one that does + measure at the right field.
- patterns/pre-silicon-validation-partnership — the downstream pattern this enables. A workload-representative benchmark is the artifact vendor-collaboration runs against.
- patterns/measurement-driven-micro-optimization — related discipline: both insist measurement shape must match consumption shape.
Seen in¶
- sources/2024-08-05-meta-dcperf-open-source-benchmark-suite — DCPerf: canonical hyperscale / microarchitectural-axis instance. Open-sourced so industry can standardise on it.
Related¶
- systems/dcperf — the pattern's canonical instance.
- systems/spec-cpu — the non-representative baseline DCPerf supplements.
- concepts/benchmark-representativeness — the property this pattern achieves.
- concepts/benchmark-methodology-bias — the failure this pattern avoids.
- concepts/hyperscale-compute-workload — the workload shape the Meta instance targets.
- patterns/custom-benchmarking-harness — application-layer sibling.
- patterns/pre-silicon-validation-partnership — downstream consumer pattern.