CONCEPT Cited by 1 source

AZ-bucketed environment split¶

Definition¶

AZ-bucketed environment split is the engineering choice of partitioning a single fleet-configuration environment (e.g., a Chef prod environment) into N parallel environments keyed by availability zone, and pinning each node to exactly one bucket at boot time. A bad configuration promotion is contained to the AZ-bucket it reached; the other AZs continue to provision and operate on their own bucket's configuration.

The canonical instance is Slack's 2025-10-23 split of prod into prod-1 … prod-6, with Poptart Bootstrap (Slack's cloud-init-phase AMI tool) deciding the AZ-to-environment assignment at instance boot (Source: sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption).

The load-bearing insight: new-node scale-out is the worst-case axis¶

Before the split, Slack staggered per-node Chef-run crons across AZs so a bad change on an existing node would only propagate to a subset of AZs at any given minute. But this staggering doesn't protect new nodes: verbatim, "even if Chef wasn't running everywhere at once, any newly provisioned nodes would immediately pick up the latest (possibly bad) changes from that shared environment. This became a significant reliability risk, especially during large scale-out events, where dozens or hundreds of nodes could start up with a broken configuration."

AZ-bucketed environment split closes the gap by moving the blast-radius boundary upstream of the provisioning path. New nodes never see the bad configuration because their AZ hasn't been promoted to it yet.

The three required components¶

N parallel environments with identical semantic meaning (e.g., prod-1 … prod-N all represent "production"). They share no configuration state by construction; version-pins are per-environment.
Boot-time mapping. A mechanism that runs before the first configuration-management pass and assigns the node to exactly one environment based on its AZ. If the mapping were lazy, the node would pull from the shared environment before being bucketed — defeating the purpose.
Separate promotion per environment. The promotion substrate (Slack's Chef Librarian) can promote a version to each environment independently. This lets the rollout strategy choose how to stagger across buckets — see patterns/release-train-rollout-with-canary for the release-train-with-canary choice.

Relation to cell-based architecture¶

AZ-bucketed environment split is cell-based architecture at the fleet-configuration altitude. The N environments are the cells; the AZ is the cell boundary; the boot-time mapping is the cell router. See concepts/cell-based-architecture and patterns/cell-based-architecture-for-blast-radius-reduction for the general shape; this is the specific instantiation at the "one Chef environment per AZ" granularity.

Design trade-offs¶

N vs M where M = AZ count. Slack uses 6 environments; AWS typically offers 3 AZs per region. If N < AZ count, multiple AZs share an environment; if N > AZ count, AZs have multiple environments (arguably less useful). Slack's choice of 6 for a likely-3-AZ region is not explained, but may reflect additional dimensions (primary-AZ × sub-region, or a per-AZ-subdivision sized by instance count).
Mapping strategy. Deterministic hash of AZ ID gives idempotent mapping (same AZ always gets same bucket, easy to reason about, easy to debug). Round-robin is simpler but can create imbalance. Explicit per-AZ pin is most flexible but most manual. Slack does not disclose its choice.
Global vs per-service bucketing. Slack's buckets are global (all instances in a given AZ pull from the same environment). Per-service bucketing would add another axis but — as the Slack post itself notes — "with the hundreds of services we operate at Slack, this quickly becomes unmanageable at scale." That architectural ceiling is why Shipyard is being built.

Canonical-wiki siblings¶

Other cell-based-isolation instances at different altitudes:

concepts/sharded-failure-domain-isolation — database shards as the blast-radius boundary (customer routes to 1/N shards).
concepts/active-multi-cluster-blast-radius — service running across multiple active clusters so no single cluster failure takes the whole service down.
concepts/cell-based-architecture — the general concept; cells can be at any altitude.
Regional failover — AZ level, but fully-independent fleets rather than shared-service-with-bucketed-config.

Caveats¶

AZ is a structural failure domain, not a structural blast-radius boundary. AWS's AZ boundary is designed to isolate hardware failures; it doesn't, on its own, isolate logical failures (bad config). AZ-bucketed environment split uses the AZ boundary as the blast-radius axis for the fleet-configuration substrate, but the AZ boundary itself doesn't enforce the isolation — the Chef environment boundary does.
Scale-out still pulls from an environment. The boundary bounds which environment is pulled, but doesn't eliminate the scale-out-can-pick-up-bad-config failure entirely — it just confines it to one AZ-bucket.
Rollout speed is a cost. N environments means N promotion steps per rollout; the full-fleet-applied latency grows with N if promotions are serialised (as in Slack's release train). Slack treats this latency as a feature, not a cost.
Per-service isolation is not provided. If two services share an environment (they do, in Slack's AZ-bucketed world), a cookbook change that affects both will promote to both simultaneously. Per-service environments were rejected on operational-overhead grounds.

Seen in¶

sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption — canonical: Slack's 2025 split of single prod Chef environment into six AZ-bucketed environments (prod-1 … prod-6) with boot-time mapping via Poptart Bootstrap and release-train rollout via Chef Librarian.