Slack — Advancing Our Chef Infrastructure: Safety Without Disruption¶
Summary¶
Archie Gunasekara's 2025-10-23 follow-up to Slack's 2024
Advancing Our Chef Infrastructure
post. Describes phase two of Slack's EC2 / Chef deploy-safety
work: instead of migrating to Chef
Policyfiles (which would have required every service team to
rewrite roles, environments, and cookbooks), Slack extended the
existing EC2 framework in two load-bearing changes — splitting
one production Chef environment into six AZ-bucketed environments
(prod-1 … prod-6) rolled out via a release train with
prod-1 as canary, and replacing cron-driven Chef runs with a
signal-driven pull model via a new service called
Chef Summoner that watches an S3 bucket
populated by the existing Chef Librarian
artifact-promotion service. The post closes by marking the legacy
EC2 platform feature-complete + maintenance-mode and previews
a brand-new EC2 successor called
Shipyard (service-level deployments + metric-driven rollouts +
automated rollbacks) for teams that can't yet move to
Bedrock.
Tier-2 on-scope decisively: real distributed-systems internals (fleet-wide config-management substrate, new service with explicit architectural role, S3-as-fanout primitive, release-train mechanism, fallback-cron safety net), explicit scaling-trade-off rationale ("a huge effort and added more risk than it solved" for the Policyfiles alternative, chosen instead for an incremental path), and concrete operational disclosures (6 production environments, 12-hour compliance SLA, hourly promotions). Natural companion to the 2025-10-07 Deploy Safety retrospective — where that post described the program and the Webapp-backend mechanism at altitude, this post drills into the EC2 / Chef substrate mechanism at one level below.
Key takeaways¶
-
The Policyfiles migration was rejected on blast-radius-of- change grounds. Verbatim: "That would have meant replacing roles and environments and asking dozens of teams to change their cookbooks. In the long run, it might have made things safer, but in the short term it would have been a huge effort and added more risk than it solved." A clean-slate architectural improvement was passed over in favour of an incremental improvement of the existing framework that did not force downstream cookbook changes. This is a canonical instance of the incremental-improvement-over-greenfield trade-off the 2025-10-07 Deploy Safety retrospective called out as a program-level discipline (invest widely, bias for action, avoid migrations that slow innovation). (Source: sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption)
-
One shared
prodenvironment → six AZ-bucketed production environments (prod-1…prod-6). Before the change, every instance — including newly-provisioned nodes during scale-out events — pulled cookbook versions from the singleprodenvironment. A bad cookbook version poisoned the whole fleet at next-Chef-run, plus any instance that came online mid- rollout. After the change, each instance is mapped at boot time to one ofprod-1…prod-6based on its AZ ID by Poptart Bootstrap (Slack's cloud-init-phase AMI tool). Canonicalised as concepts/az-bucketed-environment-split and patterns/split-environment-per-az-for-blast-radius. A bad cookbook version now only impacts the AZ-bucket it reached; the other AZs continue to provision safely from their own environment. -
The release train puts canary before the train, not at the head of it.
prod-1receives the latest cookbook version hourly (the moment the dev environment does), acting as a canary production environment.prod-2throughprod-6progress through a release train — a new version only advances intoprod-2once the previous version has made it all the way throughprod-6. The reasoning is stated explicitly: "If we waited for a version to pass through all environments before updatingprod-1… we'd end up testing artifacts with large, cumulative changes. By contrast, updatingprod-1frequently with the latest version allows us to detect issues closer to when they were introduced." Canonicalised as patterns/release-train-rollout-with-canary. -
Chef runs moved from fixed cron to signal-driven pull. Because each of the six production environments now receives updates at different times, a fixed cron schedule is no longer predictable — "we can't reliably predict when a given Chef environment will receive new changes." Slack replaced the per-instance cron with a new service called Chef Summoner that runs on every node and watches an S3 key corresponding to the node's Chef stack and environment. Canonicalised as concepts/signal-driven-chef-trigger and patterns/signal-triggered-fleet-config-apply.
-
The S3 bucket is the fanout substrate. Chef Librarian (the existing Slack service that uploads cookbook artifacts to every Chef stack, originally described in the 2024 post) was extended to write a JSON message to an S3 bucket whenever it promotes a cookbook version to an environment. The bucket layout is two-level:
chef-run-triggers/<stack>/<env>(two stacks disclosed:basaltandironstone; eleven env subkeys per stack:ami-dev,ami-prod,dev,prod,prod-1…prod-6,sandbox). Each environment key contains a JSON object carrying theSplay, theTimestamp, and the fullManifestRecord(artifact version, Chef shard, commit hash, cookbook versions map, site-cookbook-versions map, S3 bucket + key for the.tar.gz, TTL, upload-complete flag). Canonicalised as concepts/s3-signal-bucket-as-config-fanout. This makes S3 the canonical instance of a shared-storage-as- fanout primitive at the fleet-configuration altitude. -
Splayis the randomised staggering primitive. The S3 signal'sSplayfield (example value 15, presumably seconds or minutes — post does not disclose units) is consumed by Chef Summoner to randomise when the actual Chef run fires. "This helps avoid spikes in load and resource contention." Explicit: "We can also customize the splay depending on our needs — for example, when we trigger a Chef run using a custom signal from Librarian and want to spread the runs out more intentionally." Canonicalised as concepts/splay-randomised-run-jitter. A cousin of patterns/jittered-job-scheduling at the fleet-config altitude. -
Chef Summoner is load-bearing enough to need its own fallback. "Now that Chef Summoner is the primary mechanism we rely on to trigger Chef runs, it becomes a critical piece of infrastructure. After a node is provisioned, subsequent Chef runs are responsible for keeping Chef Summoner itself up to date with the latest changes. But if we accidentally roll out a broken version of Chef Summoner, it may stop triggering Chef runs altogether — making it impossible to roll out a fixed version using our normal deployment flow." Mitigation: a fallback cron job baked on every node that checks the local state Chef Summoner stores (last-run-time + artifact-version) and triggers a Chef run directly if Summoner hasn't run Chef in the last 12 hours. Canonicalised as concepts/fallback-cron-for-self-update-safety and patterns/self-update-with-independent-fallback-cron. A canonical self-update-paradox pattern: the tool that keeps itself up to date cannot be the sole mechanism that keeps itself up to date.
-
12-hour compliance SLA is the floor, not the target. Chef runs exist for two reasons: apply the latest cookbook changes (hot path, optimised for fast rollout) and ensure the fleet remains in its defined configuration state (compliance path, must run at least once every 12 hours per Slack policy). Chef Summoner's signal path handles the first; a 12-hour max-interval check inside Summoner and the fallback-cron's 12-hour window handle the second. This is a canonical instance of a trailing-metric compliance window coexisting with a fast-signal operational window.
-
Version staggering trades deploy speed for safety. Verbatim: "changes now take a bit longer to roll out across all our production nodes. However, this delay provides valuable time between deployments across different availability zones, allowing us to catch and address any issues before problematic changes are fully propagated." The release-train delay is a feature, not a cost. It is the mechanism by which an AZ-level blast-radius boundary becomes a temporal blast-radius boundary.
-
The legacy EC2 platform is now feature-complete; Shipyard is the successor. Verbatim: "we've decided to mark our legacy EC2 platform as feature complete and move it into maintenance mode. In its place, we're building a brand-new EC2 ecosystem called Shipyard, designed specifically for teams that can't yet move to our container-based platform, Bedrock." Shipyard is "a complete reimagining of how EC2-based services should work. It introduces concepts like service-level deployments, metric-driven rollouts, and fully automated rollbacks when things go wrong." Soft-launch was targeted for Q4 2025 ("this quarter" at post's publish date 2025-10-23) with two pilot teams. Canonicalised as systems/slack-shipyard (stub — post is preview-only).
Systems / concepts / patterns extracted¶
Systems¶
- systems/chef (new stub) — the configuration-management
substrate. Ruby DSL with "cookbooks" of "recipes" run by the
chef-clientagent on every node; "environments" pin cookbook versions; "roles" group recipes. Canonicalised here as the substrate, not as a proper wiki system page; extends with Slack's specific usage. - systems/chef-policyfiles (new stub) — the newer Chef feature-bundle that replaces roles + environments with a single policy file; the architectural alternative Slack explicitly rejected on blast-radius-of-change grounds.
- systems/chef-librarian (new) — Slack's existing service
(originally described in the 2024 post) that uploads cookbook
artifacts to every Chef stack and exposes a promote-version-
to-environment API. Extended in this 2025-10 post to
write a JSON signal to
chef-run-triggers/<stack>/<env>in an S3 bucket on every promote. The enhancement is the producer side of the signal-driven fanout. - systems/chef-summoner (new) — Slack's new service that
runs on every node, watches the S3 signal bucket at
chef-run-triggers/<stack>/<env>for a new artifact version, reads theSplayfrom the signal JSON, and schedules a Chef run. Also enforces the 12-hour max-interval compliance floor. Keeps its own local state (last-run-time + artifact-version) to dedupe against already-applied signals. The consumer side of the signal-driven fanout. - systems/poptart-bootstrap (new) — Slack's existing
cloud-init-phase tool baked into all Slack AMIs. Runs at
instance boot to create the Chef node object, set up DNS
entries, post boot-status messages to Slack (customisable
channel per team), and — extended in this post — map the new
instance's AZ ID to one of
prod-1…prod-6. This AZ-to- environment mapping is the structural boundary that makes the environment split possible. - systems/slack-shipyard (new stub) — Slack's upcoming EC2 platform successor; service-level deployments + metric-driven rollouts + automated rollbacks. Preview-only in this post.
- systems/slack-bedrock (existing, extended) — the container-based platform Shipyard is positioned against as the preferred target; extends the existing page's lineage section.
- systems/aws-s3 (existing, extended) — canonical first
wiki instance of S3 as a fleet-configuration signal
fanout bus (distinct from S3-as-object-store, S3-as-CDC-
log-store, S3-as-tiered-backing-store). The
chef-run-triggers/<stack>/<env>layout + per-env JSON manifest + TTL is the pattern shape.
Concepts¶
- concepts/az-bucketed-environment-split (new) — the axis-choice of splitting one production environment into N buckets keyed by AZ, so that a bad config promotion in one bucket is contained to that AZ's nodes (both existing nodes at next-Chef-run and new nodes that provision mid- incident). The specific load-bearing insight is that newly provisioned nodes were the worst-case blast-radius axis prior to the split — they'd unavoidably pull the latest- possibly-bad config from the shared environment. AZ-bucketing moves the blast-radius boundary upstream of the provisioning path.
- concepts/splay-randomised-run-jitter (new) — the
per-node randomised delay (Chef calls it
Splay) between signal-received and Chef-run-starts. Prevents thundering- herd load spikes when a new cookbook version is promoted to an environment. Sibling to patterns/jittered-job-scheduling at the fleet-config altitude; the delay is explicitly configurable per-signal for operational tuning. - concepts/signal-driven-chef-trigger (new) — the architectural shift from fixed-cron ("Chef runs every N hours on a schedule") to signal-driven ("Chef runs when Librarian emits a new-version signal"). Load-bearing property: Chef runs when there is actual work to do, eliminating the between-promotion no-op runs that contribute load without reducing config drift.
- concepts/s3-signal-bucket-as-config-fanout (new) — S3
(object storage) as the fanout substrate for a
configuration-management signal bus. The two-level prefix
layout (
chef-run-triggers/<stack>/<env>) is the routing shape; per-env JSON manifests are the payload; TTL fields inside the manifest handle signal expiry. Why S3 works: N producers (Librarian instances) write to a small number of keys; M consumers (every node running Chef Summoner) poll their specific key; the intersection of producer-writes-all- keys + each-consumer-reads-one-key is O(N+M) not O(N·M) fanout cost. - concepts/fallback-cron-for-self-update-safety (new) — the general pattern: when system X is responsible for updating itself ("subsequent Chef runs are responsible for keeping Chef Summoner itself up to date"), a broken X cannot fix itself. The safety net is a second, independent trigger mechanism with a looser SLA — here, a fallback cron running every 12 hours that checks whether X has run recently and triggers X's core action if it hasn't. The key architectural constraint: the fallback must not depend on X at all.
- concepts/cookbook-artifact-versioning (new) — the unit of rollout in the Chef ecosystem: a pinned-version cookbook artifact is promoted per-environment; rollback is re- promoting the previous version. Canonical instance of artifact-as-promotion-unit at the fleet-configuration altitude (contrast to Kubernetes-Deployment-rev-as-unit in container deploys, ECS-task-def-rev in ECS deploys, Lambda- version-alias in Lambda deploys).
- concepts/blast-radius (existing, extended) — adds the fleet-configuration substrate altitude (AZ-bucketed Chef environment as the boundary). Sits one step below availability-zone (AWS-provided) and one step above per-instance cron schedule (pre-existing Slack staggering).
Patterns¶
- patterns/split-environment-per-az-for-blast-radius
(new) — engineering-activity pattern: instead of one
fleet-wide
prodenvironment, bucket instances by AZ into N parallelprod-1…prod-Nenvironments and map at boot time. The load-bearing decision is that the boot-path is where the mapping happens (so new instances inherit the boundary from day zero, without needing a later Chef run to move them into a bucket). - patterns/release-train-rollout-with-canary (new) —
engineering-activity pattern: one bucket (
prod-1) runs as a hot canary receiving every new version immediately; the remaining buckets run as a release train advancing one version at a time throughprod-2→prod-N, with the next version gated on the previous version completing the train. The asymmetry is load-bearing — ifprod-1were also on the train, Slack would be testing cumulative-change artifacts at the canary position, not incremental changes. - patterns/signal-triggered-fleet-config-apply (new) — engineering-activity pattern: replace fixed-cron config- management triggers with signal-driven pulls from a shared signal bus. The three required components: (1) a signal-producing service that knows when there's new work (Librarian writes to S3 on promote); (2) a signal bus that the fleet can efficiently poll (S3 with per-env keys); (3) a signal-consuming agent on every node that polls, deduplicates against local state, and applies (Chef Summoner). Sibling to patterns/read-through-object-store-volume at a different altitude (config-management vs file-system).
- patterns/self-update-with-independent-fallback-cron (new) — engineering-activity pattern: when a system is responsible for its own updates, pair it with a fallback trigger whose code path does not depend on the self- updating system's correctness. Chef Summoner + 12-hour fallback cron is the canonical instance. General shape: primary self-update path (fast, signal-driven) + fallback path (slow, time-driven, independent).
Numbers and scale¶
- 6 production environments (
prod-1…prod-6) after the split; was 1. - 2 Chef stacks disclosed:
basaltandironstone. - 11 environment subkeys per stack:
ami-dev,ami-prod,dev,prod,prod-1throughprod-6,sandbox. - Hourly promotion cadence — Librarian promotes latest
cookbook to Sandbox/Dev on the hour, and to
prod-1at:30. Remaining production envs advance on:30during release-train cycles. - 12-hour compliance SLA — Chef must run at least once every 12 hours on every node. Enforced by both Chef Summoner and the fallback cron.
- Splay = 15 in the example JSON; unit (seconds / minutes) not disclosed.
Numbers not disclosed¶
- Fleet size (# of EC2 nodes under Chef management).
- Cookbook count, cookbook version count, promotion rate.
- S3 bucket poll interval (how often Summoner checks the key).
- Concrete Splay units.
- Whether Summoner uses S3 event notifications (SNS/SQS/ EventBridge) or polling.
- Historical incident counts pre- and post-split.
- % of fleet on
prod-1vs each ofprod-2…prod-6. - How AZ-to-environment mapping is decided (round-robin by AZ? hash of AZ ID? explicit per-AZ pin?).
- Failure semantics: what happens if Librarian writes to S3 but Summoner never reads? What if Summoner reads but Chef run fails? No retry/backoff disclosed.
Caveats¶
- Feature-complete, not architecturally complete. The post itself acknowledges that service-level deploys still aren't possible on the legacy EC2 platform — "One major limitation is that we still can't easily support service-level deployments. In theory, we could create a dedicated set of Chef environments for each service and promote artifacts individually — but with the hundreds of services we operate at Slack, this quickly becomes unmanageable at scale." The architectural ceiling of the AZ-bucketed-environment approach is per-service isolation, which is why Shipyard is being built.
- No production-incident retrospective. The post does not
disclose specific incidents caught by
prod-1canary or prevented by the AZ bucketing — the value claim is mechanism-level ("significantly reducing risk and blast radius during deployments") not numbers-level. - No Shipyard architecture disclosure. Shipyard is preview-only; the next post will disclose the mechanisms. systems/slack-shipyard is a stub.
- Chef Summoner's S3 access pattern not disclosed. Whether Summoner polls or subscribes to events, and at what cadence, affects cost and latency. Not stated.
- Release-train progression automation. The post says "a Kubernetes cron job" rolls out cookbook changes, but does not disclose what the cron job actually does (compute next version? call Librarian's promote API? make a git commit?). The promotion-orchestration layer is the inverse of the pull-model substrate described, and is less well- documented.
- The 2024 predecessor post is not ingested on the wiki. The 2024-era Slack post introduced Chef Librarian and the Poptart Bootstrap tool; this 2025 post assumes that context. A future ingest of the 2024 post would extend coverage of those primitives.
Source¶
- Original: https://slack.engineering/advancing-our-chef-infrastructure-safety-without-disruption/
- Predecessor (not yet ingested): Advancing Our Chef Infrastructure (2024)
- Raw markdown:
raw/slack/2025-10-23-advancing-our-chef-infrastructure-safety-without-disruption-481e9f78.md
Related¶
- companies/slack
- sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change — program-level companion
- systems/chef / systems/chef-policyfiles / systems/chef-librarian / systems/chef-summoner / systems/poptart-bootstrap
- systems/slack-shipyard / systems/slack-bedrock / systems/slack-releasebot / systems/slack-deploy-safety-program
- concepts/az-bucketed-environment-split / concepts/splay-randomised-run-jitter / concepts/signal-driven-chef-trigger / concepts/s3-signal-bucket-as-config-fanout / concepts/fallback-cron-for-self-update-safety / concepts/cookbook-artifact-versioning / concepts/blast-radius
- patterns/split-environment-per-az-for-blast-radius / patterns/release-train-rollout-with-canary / patterns/signal-triggered-fleet-config-apply / patterns/self-update-with-independent-fallback-cron / patterns/staged-rollout / patterns/fast-rollback / patterns/cell-based-architecture-for-blast-radius-reduction
- concepts/feedback-control-loop-for-rollouts / patterns/jittered-job-scheduling