PATTERN Cited by 1 source
Split environment per AZ for blast-radius¶
Problem¶
A shared configuration-management environment (a Chef
environment, a Puppet env, a SaltStack env, etc.) is the
blast-radius target of any bad configuration promotion. Even
with staggered per-node runs, newly provisioned nodes —
which may arrive by the dozens or hundreds during scale-out
events — immediately pick up the latest version from the
shared environment, so a bad promotion that isn't caught
before scale-out hits every new node in the fleet.
Per-node cron staggering bounds the time axis; it does not bound the newly-provisioned-nodes axis. Scale-out picks up the latest (possibly bad) version from whichever single environment exists.
Solution¶
Split the single environment into N parallel environments keyed by availability zone, and pin each new instance to exactly one environment at boot time via the cloud-init phase of the AMI. A bad promotion to one environment impacts only the AZ(s) mapped to that environment; other AZs continue to provision and operate from their own environment's configuration.
Rollouts become staged across AZs via staggered promotion to each environment (see patterns/release-train-rollout-with-canary for Slack's specific release-train-with-canary variant).
The three required components¶
1. N parallel environments¶
Each environment has identical semantic meaning — all of
prod-1 … prod-N represent "production" — but each is
independently version-pinned. The environments share no
configuration state by construction.
Slack's instantiation: six production environments
prod-1, prod-2, ..., prod-6 (Source:
sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption).
2. Boot-time mapping from AZ to environment¶
A cloud-init-phase tool in the AMI reads the instance's AZ ID and assigns the node to one of the N environments before the first configuration-management run. If mapping were lazy (on the first Chef run, say), the new instance would already have pulled from whatever default environment was configured — defeating the point.
Slack's instantiation: Poptart Bootstrap is extended to "include logic that inspects the node's AZ ID and assigns it to one of the numbered production Chef environments."
3. Per-environment independent promotion¶
The cookbook-promotion substrate must support promoting a version to each environment independently. Slack's instantiation: Chef Librarian already exposed a "promote a specific version of an artifact to a given environment" API; the split just multiplies the number of targets.
Slack's operational shape¶
- Default in phase 2: 6 production environments (
prod-1…prod-6). - Maps 1:N from AZ to environment (Slack uses 6 environments in a typical 3-AZ region; the AZ-to-env ratio is not disclosed — likely 2 environments per AZ).
prod-1becomes the "canary" environment receiving every new cookbook version hourly.prod-2throughprod-6advance via a release train (see patterns/release-train-rollout-with-canary).
Why boot-time vs lazy mapping is load-bearing¶
┌────────────────────┐
new instance ──▶ │ cloud-init │
boot │ (Poptart Boot.) │──── assign env based on AZ
└────────────────────┘ │
▼
Chef env = prod-3 (say)
│
▼
first chef-client run uses prod-3's
version pin from the start
vs. lazy mapping:
new instance ──▶ first chef-client run ──▶ pulls from default env
(possibly bad version)
│
▼
(later) assigned to prod-3
The lazy path has a window where the new instance runs cookbook-from-default-env, which is exactly the scale-out- picks-up-bad-config failure mode.
Benefits¶
- Scale-out safety. Newly provisioned nodes inherit the AZ-bucket boundary from instance 0.
- Operational blast-radius cap. A bad promotion is confined to the AZ mapped to the target environment.
- Composes with release trains. The per-environment promotion pattern enables staggered rollout across AZs with observation-between-stages.
- Composes with signal-driven triggers. The per-environment signal (see patterns/signal-triggered-fleet-config-apply) means each AZ-bucket applies its version independently; no cross-AZ coordination.
Costs¶
- N× promotion overhead. Each promotion cycle must fan out to N environments. If promotions are sequenced through the N, rollout latency grows linearly with N.
- Non-uniform progression. Not all AZs are on the same version at all times, which makes incident-response diagnosis (and inter-AZ debugging) more complex.
- Mapping strategy is a design choice. Round-robin, hash of AZ ID, explicit pin — each has trade-offs. Slack does not disclose which.
- Doesn't provide per-service isolation. The architectural ceiling (explicitly named in the Slack post): "with the hundreds of services we operate at Slack, this quickly becomes unmanageable at scale". Per-service environments would require N × M environments. Shipyard is built to provide per-service isolation without the N × M explosion.
Relation to cell-based architecture¶
This pattern is a specific instantiation of cell-based architecture at the fleet-configuration altitude. The N environments are the cells; the AZ is the cell boundary; Poptart Bootstrap's AZ-to-env mapping is the cell router; the release train's per-env promotion is the per-cell deployment.
Sibling patterns¶
- patterns/cell-based-architecture-for-blast-radius-reduction — the general form.
- patterns/release-train-rollout-with-canary — the cross-AZ rollout strategy that composes with this pattern.
- patterns/staged-rollout — the general phased- deployment pattern; this specialises to per-AZ phasing.
- patterns/progressive-cluster-rollout — a sibling pattern at the cluster-of-services altitude rather than the AZ-of-instances altitude.
Caveats¶
- AZ ≠ fault boundary for logical failures. AWS's AZ boundary bounds hardware failures; it doesn't, on its own, prevent a logical failure (bad cookbook) from impacting the nodes in the AZ. The environment boundary does; AZ is just the axis along which environments are partitioned.
- Requires the cloud-init-phase tool to be robust. If Poptart Bootstrap (or equivalent) fails to map an instance, the instance has no environment and the first Chef run fails — possibly leaving the instance in a stuck state. Slack's fallback: "posts a success or failure message to Slack."
- Changing the mapping strategy is expensive. Once instances are mapped, re-mapping requires re-launching them.
- The pattern doesn't isolate cross-AZ operations. An operation that affects all AZs simultaneously (e.g., a DNS change, a security-group change) still has full fleet blast-radius.
Seen in¶
- sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption
— canonical: Slack's phase-2 Chef design splits single
prodenvironment into six AZ-bucketed environments (prod-1…prod-6) with boot-time assignment via Poptart Bootstrap and release-train rollout across them.
Related¶
- concepts/az-bucketed-environment-split
- concepts/blast-radius
- concepts/cell-based-architecture
- patterns/cell-based-architecture-for-blast-radius-reduction
- patterns/release-train-rollout-with-canary
- patterns/staged-rollout
- patterns/signal-triggered-fleet-config-apply
- systems/chef
- systems/chef-librarian
- systems/chef-summoner
- systems/poptart-bootstrap