Skip to content

PATTERN Cited by 1 source

Split environment per AZ for blast-radius

Problem

A shared configuration-management environment (a Chef environment, a Puppet env, a SaltStack env, etc.) is the blast-radius target of any bad configuration promotion. Even with staggered per-node runs, newly provisioned nodes — which may arrive by the dozens or hundreds during scale-out events — immediately pick up the latest version from the shared environment, so a bad promotion that isn't caught before scale-out hits every new node in the fleet.

Per-node cron staggering bounds the time axis; it does not bound the newly-provisioned-nodes axis. Scale-out picks up the latest (possibly bad) version from whichever single environment exists.

Solution

Split the single environment into N parallel environments keyed by availability zone, and pin each new instance to exactly one environment at boot time via the cloud-init phase of the AMI. A bad promotion to one environment impacts only the AZ(s) mapped to that environment; other AZs continue to provision and operate from their own environment's configuration.

Rollouts become staged across AZs via staggered promotion to each environment (see patterns/release-train-rollout-with-canary for Slack's specific release-train-with-canary variant).

The three required components

1. N parallel environments

Each environment has identical semantic meaning — all of prod-1prod-N represent "production" — but each is independently version-pinned. The environments share no configuration state by construction.

Slack's instantiation: six production environments prod-1, prod-2, ..., prod-6 (Source: sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption).

2. Boot-time mapping from AZ to environment

A cloud-init-phase tool in the AMI reads the instance's AZ ID and assigns the node to one of the N environments before the first configuration-management run. If mapping were lazy (on the first Chef run, say), the new instance would already have pulled from whatever default environment was configured — defeating the point.

Slack's instantiation: Poptart Bootstrap is extended to "include logic that inspects the node's AZ ID and assigns it to one of the numbered production Chef environments."

3. Per-environment independent promotion

The cookbook-promotion substrate must support promoting a version to each environment independently. Slack's instantiation: Chef Librarian already exposed a "promote a specific version of an artifact to a given environment" API; the split just multiplies the number of targets.

Slack's operational shape

  • Default in phase 2: 6 production environments (prod-1prod-6).
  • Maps 1:N from AZ to environment (Slack uses 6 environments in a typical 3-AZ region; the AZ-to-env ratio is not disclosed — likely 2 environments per AZ).
  • prod-1 becomes the "canary" environment receiving every new cookbook version hourly.
  • prod-2 through prod-6 advance via a release train (see patterns/release-train-rollout-with-canary).

Why boot-time vs lazy mapping is load-bearing

                 ┌────────────────────┐
new instance ──▶ │   cloud-init       │
boot              │   (Poptart Boot.) │──── assign env based on AZ
                 └────────────────────┘          │
                                    Chef env = prod-3 (say)
                              first chef-client run uses prod-3's
                              version pin from the start

vs. lazy mapping:

new instance ──▶ first chef-client run ──▶ pulls from default env
                                            (possibly bad version)
                                     (later) assigned to prod-3

The lazy path has a window where the new instance runs cookbook-from-default-env, which is exactly the scale-out- picks-up-bad-config failure mode.

Benefits

  • Scale-out safety. Newly provisioned nodes inherit the AZ-bucket boundary from instance 0.
  • Operational blast-radius cap. A bad promotion is confined to the AZ mapped to the target environment.
  • Composes with release trains. The per-environment promotion pattern enables staggered rollout across AZs with observation-between-stages.
  • Composes with signal-driven triggers. The per-environment signal (see patterns/signal-triggered-fleet-config-apply) means each AZ-bucket applies its version independently; no cross-AZ coordination.

Costs

  • N× promotion overhead. Each promotion cycle must fan out to N environments. If promotions are sequenced through the N, rollout latency grows linearly with N.
  • Non-uniform progression. Not all AZs are on the same version at all times, which makes incident-response diagnosis (and inter-AZ debugging) more complex.
  • Mapping strategy is a design choice. Round-robin, hash of AZ ID, explicit pin — each has trade-offs. Slack does not disclose which.
  • Doesn't provide per-service isolation. The architectural ceiling (explicitly named in the Slack post): "with the hundreds of services we operate at Slack, this quickly becomes unmanageable at scale". Per-service environments would require N × M environments. Shipyard is built to provide per-service isolation without the N × M explosion.

Relation to cell-based architecture

This pattern is a specific instantiation of cell-based architecture at the fleet-configuration altitude. The N environments are the cells; the AZ is the cell boundary; Poptart Bootstrap's AZ-to-env mapping is the cell router; the release train's per-env promotion is the per-cell deployment.

Sibling patterns

Caveats

  • AZ ≠ fault boundary for logical failures. AWS's AZ boundary bounds hardware failures; it doesn't, on its own, prevent a logical failure (bad cookbook) from impacting the nodes in the AZ. The environment boundary does; AZ is just the axis along which environments are partitioned.
  • Requires the cloud-init-phase tool to be robust. If Poptart Bootstrap (or equivalent) fails to map an instance, the instance has no environment and the first Chef run fails — possibly leaving the instance in a stuck state. Slack's fallback: "posts a success or failure message to Slack."
  • Changing the mapping strategy is expensive. Once instances are mapped, re-mapping requires re-launching them.
  • The pattern doesn't isolate cross-AZ operations. An operation that affects all AZs simultaneously (e.g., a DNS change, a security-group change) still has full fleet blast-radius.

Seen in

Last updated · 470 distilled / 1,213 read