Skip to content

CONCEPT Cited by 1 source

Hyperscale compute workload

Definition

A hyperscale compute workload is the shape of application compute that runs in large-scale internet-company datacenter cloud deployments — the workload population served by Meta, Google, AWS, Microsoft, Tencent and similar. Meta's canonical statement:

"Workloads developed by large-scale internet companies running in their datacenters have very different characteristics than those in high performance computing (HPC) or traditional enterprise market segments." (Source: sources/2024-08-05-meta-dcperf-open-source-benchmark-suite)

This is a distinct market segment with distinct characteristics at the microarchitectural level — not just a large quantity of enterprise workloads.

Why it's distinct

Hyperscale workloads differ from HPC + enterprise on several axes that shape hardware + software co-design:

  • Application diversity in one deployment. A single Meta data center runs feed ranking, timeline generation, WhatsApp media processing, ads serving, newsfeed fan-out, Presto SQL, training jobs — each stressing cache / memory-bandwidth / branch- prediction differently. HPC deployments are narrow (a few applications run for weeks). Enterprise deployments are narrow differently.
  • Low IPC + modest frequency dominance. Hyperscale workloads tend to have lower IPC than SPEC CPU workloads because they stall on memory (long pointer chases, large working sets beyond LLC). Operating frequency distribution also differs from synthetic benchmarks. Meta's 2024-08-05 DCPerf post shows production apps + DCPerf clustering together on IPC and frequency, with SPEC CPU apart from them.
  • Multi-tenant, rapidly-increasing core counts. Meta named adding multi-tenancy support to DCPerf to "scale and make use of rapidly increasing core counts on modern server platforms." Co-tenancy + contention for cache / memory-bandwidth / power budget is part of the workload, not a nuisance on top.
  • Production-driven hardware selection. Hyperscalers procure hardware informed by their own workloads, not by public benchmark scores. SPEC CPU ranking does not cleanly translate to hyperscale fleet TCO.
  • Constraint-shaped. Cooling, power density, rack space, ISA diversity (ARM64 + x86), chiplet trends are all material constraints on platform selection. Compare systems/grand-teton's air-cooling decision at 700 W TDP because cooling infrastructure couldn't be changed in time.

Implications for benchmarking

  • Standard industry benchmarks (e.g. SPEC CPU 2017) don't capture these workloads' microarchitectural behaviour. Using them as the sole decision signal is an instance of concepts/benchmark-methodology-bias — the benchmark's setup skews against the subject of procurement.
  • Hyperscalers build their own workload-representative benchmark suites. Meta's answer is DCPerf, open-sourced in 2024 with the ambition of becoming the industry-standard hyperscale-workload benchmark.
  • The relevant property to measure is concepts/benchmark-representativeness at microarchitectural level (IPC distribution, core-frequency distribution, power, cache behaviour) — not just an aggregate score.

Implications for hardware reliability + operations

Hyperscale workloads also expose distinct reliability patterns at scale — see concepts/hardware-reliability-at-scale and concepts/gpu-training-failure-modes for the GPU-training- cluster version of the same general point: at hyperscale, per- component failure rates that are ignorable in enterprise become load-bearing operational problems.

Seen in

Last updated · 319 distilled / 1,201 read