PATTERN Cited by 2 sources
Load-test at scale (before real workloads)¶
Load-test at scale is the practice of running a synthetic workload on a new platform sized to match the largest real workloads you plan to host there, before those real workloads are migrated in. The goal is to force-discover the platform-sizing problems that only manifest at production cardinality.
The failure mode it prevents¶
New-platform load tests commonly run "enough to exercise the code path," not "enough to exercise the control-plane at real fan-out." The result: the migration's first production workload becomes the load test, and scaling bugs blow up with a customer on the other end.
Typical scale-only failure modes:
- Control-plane-component sizing (API server, controller, policy engine, cluster-DNS) is too small and slows down pod scheduling / startup.
- Metric-cardinality explosion surfaces only at N_services × N_pods.
- EDS / xDS push volume overwhelms clients under real topology size.
- Networking limits (conntrack, ephemeral-port-exhaustion) trip at production QPS.
- Scheduler scoring becomes a bottleneck at N_nodes × N_pods.
None of these show up in a 10-pod test.
The move¶
- Define a "Hello, World" service — minimal logic, just enough to exercise the end-to-end platform path (deploy → schedule → health → serve → observe).
- Scale it up to the pod count of your biggest real service, not to the test-cluster's node capacity.
- Observe what breaks or slows down — each one is a platform tuning issue to fix before onboarding real workloads.
Figma's instantiation¶
Scaled a Hello-World service to the same pod count as their largest services pre-migration. Outcome: had to tune the size and scale of core compute services that support the platform. One named example: Kyverno (cluster security assertions). If Kyverno is undersized, new-pod startup slows because every admission check passes through it.
Without this load test, the discovery would have happened when Figma's first real service migrated in — and the slow pod startup would have manifested as service-degradation symptoms rather than a cleanly-attributable platform-layer issue.
Contrast with shadow migration¶
- patterns/shadow-migration runs real production inputs through the new system alongside the old to validate correctness-at-scale for data workloads.
- Load-test-at-scale uses synthetic workloads to validate control-plane and orchestration behavior for compute workloads. The shape of the workload doesn't matter; the cardinality does.
Complementary: large-scale platforms often use both at different phases.
Related practice: migrate one real service to prod before staging is fully built¶
Figma's other note: "We even migrated one of our services over before we had finished building the staging environment, and it turned out to be well worth it; it quickly derisked the end to end ability to effectively run workloads and helped us identify bottlenecks and bugs." Combined with the Hello-World test, this is a real data over staged data principle at the migration-validation tier.
Seen in¶
- sources/2024-08-08-figma-migrated-onto-k8s-in-less-than-12-months — Hello-World scaled to largest-service-pod-count; Kyverno sizing regression surfaced this way.
- sources/2024-10-28-dropbox-robinhood-in-house-load-balancing — Dropbox load-tested the Robinhood PID-control LB at production fanout scale before rolling it out; similar "platform-sizing first, tenants second" ordering.
Related¶
- patterns/scoped-migration-with-fast-follows — load-testing is one of the migration-execution disciplines this pattern depends on
- patterns/shadow-migration — correctness-validation counterpart