CONCEPT Cited by 1 source
AZ-bucketed environment split¶
Definition¶
AZ-bucketed environment split is the engineering choice of
partitioning a single fleet-configuration environment (e.g., a
Chef prod environment) into N parallel environments keyed by
availability zone, and pinning each node to exactly one bucket
at boot time. A bad configuration promotion is contained to
the AZ-bucket it reached; the other AZs continue to provision
and operate on their own bucket's configuration.
The canonical instance is Slack's 2025-10-23 split of prod
into prod-1 … prod-6, with
Poptart Bootstrap (Slack's cloud-init-phase AMI tool) deciding
the AZ-to-environment assignment at instance boot (Source:
sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption).
The load-bearing insight: new-node scale-out is the worst-case axis¶
Before the split, Slack staggered per-node Chef-run crons across AZs so a bad change on an existing node would only propagate to a subset of AZs at any given minute. But this staggering doesn't protect new nodes: verbatim, "even if Chef wasn't running everywhere at once, any newly provisioned nodes would immediately pick up the latest (possibly bad) changes from that shared environment. This became a significant reliability risk, especially during large scale-out events, where dozens or hundreds of nodes could start up with a broken configuration."
AZ-bucketed environment split closes the gap by moving the blast-radius boundary upstream of the provisioning path. New nodes never see the bad configuration because their AZ hasn't been promoted to it yet.
The three required components¶
- N parallel environments with identical semantic meaning
(e.g.,
prod-1…prod-Nall represent "production"). They share no configuration state by construction; version-pins are per-environment. - Boot-time mapping. A mechanism that runs before the first configuration-management pass and assigns the node to exactly one environment based on its AZ. If the mapping were lazy, the node would pull from the shared environment before being bucketed — defeating the purpose.
- Separate promotion per environment. The promotion substrate (Slack's Chef Librarian) can promote a version to each environment independently. This lets the rollout strategy choose how to stagger across buckets — see patterns/release-train-rollout-with-canary for the release-train-with-canary choice.
Relation to cell-based architecture¶
AZ-bucketed environment split is cell-based architecture at the fleet-configuration altitude. The N environments are the cells; the AZ is the cell boundary; the boot-time mapping is the cell router. See concepts/cell-based-architecture and patterns/cell-based-architecture-for-blast-radius-reduction for the general shape; this is the specific instantiation at the "one Chef environment per AZ" granularity.
Design trade-offs¶
- N vs M where M = AZ count. Slack uses 6 environments; AWS typically offers 3 AZs per region. If N < AZ count, multiple AZs share an environment; if N > AZ count, AZs have multiple environments (arguably less useful). Slack's choice of 6 for a likely-3-AZ region is not explained, but may reflect additional dimensions (primary-AZ × sub-region, or a per-AZ-subdivision sized by instance count).
- Mapping strategy. Deterministic hash of AZ ID gives idempotent mapping (same AZ always gets same bucket, easy to reason about, easy to debug). Round-robin is simpler but can create imbalance. Explicit per-AZ pin is most flexible but most manual. Slack does not disclose its choice.
- Global vs per-service bucketing. Slack's buckets are global (all instances in a given AZ pull from the same environment). Per-service bucketing would add another axis but — as the Slack post itself notes — "with the hundreds of services we operate at Slack, this quickly becomes unmanageable at scale." That architectural ceiling is why Shipyard is being built.
Canonical-wiki siblings¶
Other cell-based-isolation instances at different altitudes:
- concepts/sharded-failure-domain-isolation — database shards as the blast-radius boundary (customer routes to 1/N shards).
- concepts/active-multi-cluster-blast-radius — service running across multiple active clusters so no single cluster failure takes the whole service down.
- concepts/cell-based-architecture — the general concept; cells can be at any altitude.
- Regional failover — AZ level, but fully-independent fleets rather than shared-service-with-bucketed-config.
Caveats¶
- AZ is a structural failure domain, not a structural blast-radius boundary. AWS's AZ boundary is designed to isolate hardware failures; it doesn't, on its own, isolate logical failures (bad config). AZ-bucketed environment split uses the AZ boundary as the blast-radius axis for the fleet-configuration substrate, but the AZ boundary itself doesn't enforce the isolation — the Chef environment boundary does.
- Scale-out still pulls from an environment. The boundary bounds which environment is pulled, but doesn't eliminate the scale-out-can-pick-up-bad-config failure entirely — it just confines it to one AZ-bucket.
- Rollout speed is a cost. N environments means N promotion steps per rollout; the full-fleet-applied latency grows with N if promotions are serialised (as in Slack's release train). Slack treats this latency as a feature, not a cost.
- Per-service isolation is not provided. If two services share an environment (they do, in Slack's AZ-bucketed world), a cookbook change that affects both will promote to both simultaneously. Per-service environments were rejected on operational-overhead grounds.
Seen in¶
- sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption
— canonical: Slack's 2025 split of single
prodChef environment into six AZ-bucketed environments (prod-1…prod-6) with boot-time mapping via Poptart Bootstrap and release-train rollout via Chef Librarian.
Related¶
- concepts/blast-radius
- concepts/cell-based-architecture
- concepts/signal-driven-chef-trigger
- concepts/cookbook-artifact-versioning
- patterns/split-environment-per-az-for-blast-radius
- patterns/cell-based-architecture-for-blast-radius-reduction
- patterns/release-train-rollout-with-canary
- systems/chef
- systems/chef-librarian
- systems/chef-summoner
- systems/poptart-bootstrap