Skip to content

PATTERN Cited by 2 sources

Load-test at scale (before real workloads)

Load-test at scale is the practice of running a synthetic workload on a new platform sized to match the largest real workloads you plan to host there, before those real workloads are migrated in. The goal is to force-discover the platform-sizing problems that only manifest at production cardinality.

The failure mode it prevents

New-platform load tests commonly run "enough to exercise the code path," not "enough to exercise the control-plane at real fan-out." The result: the migration's first production workload becomes the load test, and scaling bugs blow up with a customer on the other end.

Typical scale-only failure modes:

  • Control-plane-component sizing (API server, controller, policy engine, cluster-DNS) is too small and slows down pod scheduling / startup.
  • Metric-cardinality explosion surfaces only at N_services × N_pods.
  • EDS / xDS push volume overwhelms clients under real topology size.
  • Networking limits (conntrack, ephemeral-port-exhaustion) trip at production QPS.
  • Scheduler scoring becomes a bottleneck at N_nodes × N_pods.

None of these show up in a 10-pod test.

The move

  • Define a "Hello, World" service — minimal logic, just enough to exercise the end-to-end platform path (deploy → schedule → health → serve → observe).
  • Scale it up to the pod count of your biggest real service, not to the test-cluster's node capacity.
  • Observe what breaks or slows down — each one is a platform tuning issue to fix before onboarding real workloads.

Figma's instantiation

Scaled a Hello-World service to the same pod count as their largest services pre-migration. Outcome: had to tune the size and scale of core compute services that support the platform. One named example: Kyverno (cluster security assertions). If Kyverno is undersized, new-pod startup slows because every admission check passes through it.

Without this load test, the discovery would have happened when Figma's first real service migrated in — and the slow pod startup would have manifested as service-degradation symptoms rather than a cleanly-attributable platform-layer issue.

Contrast with shadow migration

  • patterns/shadow-migration runs real production inputs through the new system alongside the old to validate correctness-at-scale for data workloads.
  • Load-test-at-scale uses synthetic workloads to validate control-plane and orchestration behavior for compute workloads. The shape of the workload doesn't matter; the cardinality does.

Complementary: large-scale platforms often use both at different phases.

Figma's other note: "We even migrated one of our services over before we had finished building the staging environment, and it turned out to be well worth it; it quickly derisked the end to end ability to effectively run workloads and helped us identify bottlenecks and bugs." Combined with the Hello-World test, this is a real data over staged data principle at the migration-validation tier.

Seen in

Last updated · 200 distilled / 1,178 read