SYSTEM Cited by 1 source
Chef¶
What it is¶
Chef is a configuration-management substrate for declaring
and enforcing the desired state of servers (files, packages,
users, services, network configuration, etc.). A node runs the
chef-client agent which fetches a pinned set of cookbooks
(Ruby DSL definitions of how to reach a state) from a Chef
server, compiles them against the node's attributes, and
converges the node to the declared state. An environment
pins cookbook versions; roles group recipes; data bags
hold shared data.
This page is a stub anchoring Chef as the named substrate in Slack's EC2 / configuration-management stack; it is not a full canonicalisation of the Chef ecosystem.
Named role in Slack's Chef stack¶
Chef is the legacy EC2 configuration-management substrate at
Slack ("at Slack, keeping our service reliable is always the
top priority" — first-person Slack Engineering voice). Slack's
historical shape: one shared prod Chef environment, cron-driven
chef-client runs every few hours per node, with cron timing
staggered across AZs for minimal blast-radius. Slack's 2024 and
2025 Chef posts extended this shape in two phases:
- Phase 1 (2024) — migrated from single Chef stack to multi-stack; re-worked cookbook upload + handled Chef-search limitations. Post: Advancing Our Chef Infrastructure (not yet ingested).
- Phase 2 (2025-10-23) — split
prodinto six AZ-bucketed environments, replaced cron with signal-driven pull via Chef Summoner. Post: sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption.
At publish date of the 2025-10-23 post, the legacy Chef-based EC2 platform is marked feature-complete + maintenance-mode, with Shipyard as the upcoming EC2 successor for teams that can't yet move to Bedrock.
Key primitives named in Slack's usage¶
- Cookbook — a versioned artifact pinned per-environment; see concepts/cookbook-artifact-versioning.
- Environment — a version-pin set. Slack split the single
prodenvironment intoprod-1…prod-6in phase 2; see concepts/az-bucketed-environment-split. - Role — a named bundle of recipes; Slack chose not to migrate to Policyfiles (which would have required all service teams to rewrite their roles), on blast-radius-of-change grounds.
chef-clientrun — the agent invocation that converges a node. Slack switched from fixed-cron-triggered runs to signal-triggered runs via systems/chef-summoner, keeping a 12-hour fallback cron for compliance.- Splay — Chef's native per-run randomised jitter; Slack exposes it explicitly in the signal payload for operational tuning; see concepts/splay-randomised-run-jitter.
Architectural alternative rejected¶
Slack considered migrating to Chef Policyfiles (roles + environments replaced with a single policy file per node) — which would have made many of the phase-2 improvements easier — but rejected it because "it would have meant replacing roles and environments and asking dozens of teams to change their cookbooks. In the long run, it might have made things safer, but in the short term it would have been a huge effort and added more risk than it solved." A canonical incremental-over-greenfield trade-off at the fleet-configuration-management altitude.
Caveats¶
- Stub-level. Chef's own architecture (server, client, attribute system, compile phase, converge phase, resource model, handler model, Knife CLI, Ohai node-attribute collector, etc.) is not canonicalised here.
- Slack-specific lens. This page documents Chef through Slack's usage pattern; generic Chef usage will differ.
- Vendor context. Chef the company was acquired by Progress
Software in 2020; the
chef-clientis open source.
Seen in¶
- sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption — the named configuration-management substrate under Slack's EC2 deploy-safety phase-2 work.