Skip to content

PATTERN Cited by 1 source

Reserved-host repro environment

Reserve a small fleet of hosts, taint them so normal workloads skip them, run synthetic but realistic load with a constant resource footprint, and attach heavyweight profiling that you cannot afford to run fleet-wide. This is the canonical way to turn a sporadic, low-repro-rate production incident into a one-overnight diagnosis exercise.

Pattern

Three moves:

  1. Reserve. Use orchestrator-level mechanism to guarantee no normal workload lands on the hosts. On Kubernetes this is taints with NoExecute or NoSchedule effects; only your debug jobs carry the matching toleration. On ECS this is custom instance attributes + task-definition constraints. On raw EC2 it's just pulling instances out of the autoscaling group.
  2. Repro load. Run a synthetic workload that approximates the real one with a constant resource footprint. Pinterest "repurposed our in-house Hyper-parameter tuning to orchestrate identical model training across reserved machines, allowing each training run's resource footprint to remain fairly constant." (Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks) Constancy matters because noise in the input variable makes the profiling output harder to interpret.
  3. Heavy profiling attached. Run the profiler chain you can't run fleet-wide — continuous perf record generating GBs of profile data per host per day, eBPF probes with high overhead, kernel tracepoints, etc. On reserved hosts the performance hit is irrelevant because there is no customer traffic to protect.

When to use it

  • Sporadic incidents where a random production sample won't catch the event (temporal profiling already says "profile continuously"; this pattern says "plus on a closed fleet where you can afford to").
  • Tail-of-distribution bugs that only occur under specific load shapes you need to reproduce on demand.
  • Invasive instrumentation trials — patched kernels, custom kernel modules, bleeding-edge profilers that might crash the host.

When it fails to reproduce

The uncomfortable failure mode: the bug is environment-specific in a way your repro environment didn't inherit. Pinterest's ENA resets did reproduce on reserved hosts because the root cause (crash-looping ecs-agent from the Deep Learning AMI) was a per-host state accumulation, not a cross-host interaction. If the bug had depended on network topology or multi-tenant behaviour, the reserved-host repro would have looked clean while production kept breaking.

Counter-moves:

  • Use the same base image as production (Pinterest did).
  • Leave the host up for the incubation period. Pinterest's reset recurrence needed ~1 week of uptime to accumulate enough zombie memcgs; a freshly-rebooted reserved host would have been clean. Lesson: either let reserved hosts age naturally, or artificially pre-seed the state.
  • Start one reserved host from a cold snapshot + one from a hot snapshot when you don't yet know whether state accumulation is the variable.

Composed with continuous profiling

Pinterest stacked this pattern with patterns/continuous-perf-record-for-time-travel: the reserved hosts ran the 12-hour continuous perf-record bash loop overnight. Either pattern alone is weaker — reserved hosts without continuous profiling give you nothing to analyse post-incident; continuous profiling without reserved hosts means you can't afford the per-host storage cost fleet-wide. Together they turn 3-month investigations into one-overnight ones.

Seen in

  • sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — canonical production usage. "Reserved a small number of machines (via Kubernetes taints) for analysis. Kicked off a series of training jobs in parallel on these machines. Kicked off a script that ran perf in 2 minute increments." First overnight run caught an ENA reset with full perf data available for post-hoc Flamescope time-travel — the step that broke the incident open.
Last updated · 550 distilled / 1,221 read