CONCEPT Cited by 1 source

Hyperscale compute workload¶

Definition¶

A hyperscale compute workload is the shape of application compute that runs in large-scale internet-company datacenter cloud deployments — the workload population served by Meta, Google, AWS, Microsoft, Tencent and similar. Meta's canonical statement:

"Workloads developed by large-scale internet companies running in their datacenters have very different characteristics than those in high performance computing (HPC) or traditional enterprise market segments." (Source: sources/2024-08-05-meta-dcperf-open-source-benchmark-suite)

This is a distinct market segment with distinct characteristics at the microarchitectural level — not just a large quantity of enterprise workloads.

Why it's distinct¶

Hyperscale workloads differ from HPC + enterprise on several axes that shape hardware + software co-design:

Application diversity in one deployment. A single Meta data center runs feed ranking, timeline generation, WhatsApp media processing, ads serving, newsfeed fan-out, Presto SQL, training jobs — each stressing cache / memory-bandwidth / branch- prediction differently. HPC deployments are narrow (a few applications run for weeks). Enterprise deployments are narrow differently.
Low IPC + modest frequency dominance. Hyperscale workloads tend to have lower IPC than SPEC CPU workloads because they stall on memory (long pointer chases, large working sets beyond LLC). Operating frequency distribution also differs from synthetic benchmarks. Meta's 2024-08-05 DCPerf post shows production apps + DCPerf clustering together on IPC and frequency, with SPEC CPU apart from them.
Multi-tenant, rapidly-increasing core counts. Meta named adding multi-tenancy support to DCPerf to "scale and make use of rapidly increasing core counts on modern server platforms." Co-tenancy + contention for cache / memory-bandwidth / power budget is part of the workload, not a nuisance on top.
Production-driven hardware selection. Hyperscalers procure hardware informed by their own workloads, not by public benchmark scores. SPEC CPU ranking does not cleanly translate to hyperscale fleet TCO.
Constraint-shaped. Cooling, power density, rack space, ISA diversity (ARM64 + x86), chiplet trends are all material constraints on platform selection. Compare systems/grand-teton's air-cooling decision at 700 W TDP because cooling infrastructure couldn't be changed in time.

Implications for benchmarking¶

Standard industry benchmarks (e.g. SPEC CPU 2017) don't capture these workloads' microarchitectural behaviour. Using them as the sole decision signal is an instance of concepts/benchmark-methodology-bias — the benchmark's setup skews against the subject of procurement.
Hyperscalers build their own workload-representative benchmark suites. Meta's answer is DCPerf, open-sourced in 2024 with the ambition of becoming the industry-standard hyperscale-workload benchmark.
The relevant property to measure is concepts/benchmark-representativeness at microarchitectural level (IPC distribution, core-frequency distribution, power, cache behaviour) — not just an aggregate score.

Implications for hardware reliability + operations¶

Hyperscale workloads also expose distinct reliability patterns at scale — see concepts/hardware-reliability-at-scale and concepts/gpu-training-failure-modes for the GPU-training- cluster version of the same general point: at hyperscale, per- component failure rates that are ignorable in enterprise become load-bearing operational problems.

Seen in¶

sources/2024-08-05-meta-dcperf-open-source-benchmark-suite — Meta's canonical statement of hyperscale as a distinct workload-characterization category + motivation for DCPerf.
Adjacent at Meta: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — the GenAI-training-cluster side of the hyperscale shape.

systems/dcperf — Meta's hyperscale-oriented benchmark.
systems/spec-cpu — the incumbent industry benchmark that under-represents this workload.
concepts/benchmark-representativeness — the measurable property a hyperscale-oriented benchmark must satisfy.
concepts/benchmark-methodology-bias — the failure mode of using a non-representative benchmark for hyperscale procurement.
concepts/hardware-reliability-at-scale — the reliability axis of the same hyperscale-is-distinct framing.
companies/meta.