Skip to content

SYSTEM Cited by 1 source

Chef

What it is

Chef is a configuration-management substrate for declaring and enforcing the desired state of servers (files, packages, users, services, network configuration, etc.). A node runs the chef-client agent which fetches a pinned set of cookbooks (Ruby DSL definitions of how to reach a state) from a Chef server, compiles them against the node's attributes, and converges the node to the declared state. An environment pins cookbook versions; roles group recipes; data bags hold shared data.

This page is a stub anchoring Chef as the named substrate in Slack's EC2 / configuration-management stack; it is not a full canonicalisation of the Chef ecosystem.

Named role in Slack's Chef stack

Chef is the legacy EC2 configuration-management substrate at Slack ("at Slack, keeping our service reliable is always the top priority" — first-person Slack Engineering voice). Slack's historical shape: one shared prod Chef environment, cron-driven chef-client runs every few hours per node, with cron timing staggered across AZs for minimal blast-radius. Slack's 2024 and 2025 Chef posts extended this shape in two phases:

At publish date of the 2025-10-23 post, the legacy Chef-based EC2 platform is marked feature-complete + maintenance-mode, with Shipyard as the upcoming EC2 successor for teams that can't yet move to Bedrock.

Key primitives named in Slack's usage

  • Cookbook — a versioned artifact pinned per-environment; see concepts/cookbook-artifact-versioning.
  • Environment — a version-pin set. Slack split the single prod environment into prod-1prod-6 in phase 2; see concepts/az-bucketed-environment-split.
  • Role — a named bundle of recipes; Slack chose not to migrate to Policyfiles (which would have required all service teams to rewrite their roles), on blast-radius-of-change grounds.
  • chef-client run — the agent invocation that converges a node. Slack switched from fixed-cron-triggered runs to signal-triggered runs via systems/chef-summoner, keeping a 12-hour fallback cron for compliance.
  • Splay — Chef's native per-run randomised jitter; Slack exposes it explicitly in the signal payload for operational tuning; see concepts/splay-randomised-run-jitter.

Architectural alternative rejected

Slack considered migrating to Chef Policyfiles (roles + environments replaced with a single policy file per node) — which would have made many of the phase-2 improvements easier — but rejected it because "it would have meant replacing roles and environments and asking dozens of teams to change their cookbooks. In the long run, it might have made things safer, but in the short term it would have been a huge effort and added more risk than it solved." A canonical incremental-over-greenfield trade-off at the fleet-configuration-management altitude.

Caveats

  • Stub-level. Chef's own architecture (server, client, attribute system, compile phase, converge phase, resource model, handler model, Knife CLI, Ohai node-attribute collector, etc.) is not canonicalised here.
  • Slack-specific lens. This page documents Chef through Slack's usage pattern; generic Chef usage will differ.
  • Vendor context. Chef the company was acquired by Progress Software in 2020; the chef-client is open source.

Seen in

Last updated · 470 distilled / 1,213 read