Skip to content

CONCEPT Cited by 1 source

AZ-bucketed environment split

Definition

AZ-bucketed environment split is the engineering choice of partitioning a single fleet-configuration environment (e.g., a Chef prod environment) into N parallel environments keyed by availability zone, and pinning each node to exactly one bucket at boot time. A bad configuration promotion is contained to the AZ-bucket it reached; the other AZs continue to provision and operate on their own bucket's configuration.

The canonical instance is Slack's 2025-10-23 split of prod into prod-1prod-6, with Poptart Bootstrap (Slack's cloud-init-phase AMI tool) deciding the AZ-to-environment assignment at instance boot (Source: sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption).

The load-bearing insight: new-node scale-out is the worst-case axis

Before the split, Slack staggered per-node Chef-run crons across AZs so a bad change on an existing node would only propagate to a subset of AZs at any given minute. But this staggering doesn't protect new nodes: verbatim, "even if Chef wasn't running everywhere at once, any newly provisioned nodes would immediately pick up the latest (possibly bad) changes from that shared environment. This became a significant reliability risk, especially during large scale-out events, where dozens or hundreds of nodes could start up with a broken configuration."

AZ-bucketed environment split closes the gap by moving the blast-radius boundary upstream of the provisioning path. New nodes never see the bad configuration because their AZ hasn't been promoted to it yet.

The three required components

  1. N parallel environments with identical semantic meaning (e.g., prod-1prod-N all represent "production"). They share no configuration state by construction; version-pins are per-environment.
  2. Boot-time mapping. A mechanism that runs before the first configuration-management pass and assigns the node to exactly one environment based on its AZ. If the mapping were lazy, the node would pull from the shared environment before being bucketed — defeating the purpose.
  3. Separate promotion per environment. The promotion substrate (Slack's Chef Librarian) can promote a version to each environment independently. This lets the rollout strategy choose how to stagger across buckets — see patterns/release-train-rollout-with-canary for the release-train-with-canary choice.

Relation to cell-based architecture

AZ-bucketed environment split is cell-based architecture at the fleet-configuration altitude. The N environments are the cells; the AZ is the cell boundary; the boot-time mapping is the cell router. See concepts/cell-based-architecture and patterns/cell-based-architecture-for-blast-radius-reduction for the general shape; this is the specific instantiation at the "one Chef environment per AZ" granularity.

Design trade-offs

  • N vs M where M = AZ count. Slack uses 6 environments; AWS typically offers 3 AZs per region. If N < AZ count, multiple AZs share an environment; if N > AZ count, AZs have multiple environments (arguably less useful). Slack's choice of 6 for a likely-3-AZ region is not explained, but may reflect additional dimensions (primary-AZ × sub-region, or a per-AZ-subdivision sized by instance count).
  • Mapping strategy. Deterministic hash of AZ ID gives idempotent mapping (same AZ always gets same bucket, easy to reason about, easy to debug). Round-robin is simpler but can create imbalance. Explicit per-AZ pin is most flexible but most manual. Slack does not disclose its choice.
  • Global vs per-service bucketing. Slack's buckets are global (all instances in a given AZ pull from the same environment). Per-service bucketing would add another axis but — as the Slack post itself notes — "with the hundreds of services we operate at Slack, this quickly becomes unmanageable at scale." That architectural ceiling is why Shipyard is being built.

Canonical-wiki siblings

Other cell-based-isolation instances at different altitudes:

Caveats

  • AZ is a structural failure domain, not a structural blast-radius boundary. AWS's AZ boundary is designed to isolate hardware failures; it doesn't, on its own, isolate logical failures (bad config). AZ-bucketed environment split uses the AZ boundary as the blast-radius axis for the fleet-configuration substrate, but the AZ boundary itself doesn't enforce the isolation — the Chef environment boundary does.
  • Scale-out still pulls from an environment. The boundary bounds which environment is pulled, but doesn't eliminate the scale-out-can-pick-up-bad-config failure entirely — it just confines it to one AZ-bucket.
  • Rollout speed is a cost. N environments means N promotion steps per rollout; the full-fleet-applied latency grows with N if promotions are serialised (as in Slack's release train). Slack treats this latency as a feature, not a cost.
  • Per-service isolation is not provided. If two services share an environment (they do, in Slack's AZ-bucketed world), a cookbook change that affects both will promote to both simultaneously. Per-service environments were rejected on operational-overhead grounds.

Seen in

Last updated · 470 distilled / 1,213 read