SYSTEM Cited by 1 source
Chef Summoner (Slack)¶
What it is¶
Chef Summoner is Slack's per-node service (introduced in
the 2025-10-23 phase-2 rollout) that replaces the fixed-cron
Chef trigger with a signal-driven pull from an S3 signal bucket.
It runs on every Slack-EC2 node, watches the S3 key
corresponding to the node's Chef stack + Chef environment, and
schedules a chef-client run when a new artifact version
appears. Keeps its own local state (last-run-time + artifact
version) for deduplication.
Summoner is the consumer side of Slack's signal-driven fleet-configuration fanout; the producer side is systems/chef-librarian.
Core responsibilities¶
- Watch one S3 key. The key is
chef-run-triggers/<stack>/<env>where<stack>and<env>are attributes of the node (set at boot by Poptart Bootstrap — e.g.basalt/prod-3). - Deduplicate against local state. If the signal's
ManifestRecord.versionmatches what's already been run on this node (stored locally by Summoner), skip. Otherwise, proceed. - Apply Splay. Read the
Splayfield from the signal and wait a randomised delay up to that value before starting the Chef run. Prevents thundering-herd load on the Chef server when a promotion lands and all nodes in an environment wake up simultaneously. See concepts/splay-randomised-run-jitter. - Trigger the Chef run. Invoke
chef-clientagainst the node's configured stack + environment; the chef-client resolves the version pin from the promoted environment on the Chef server. - Enforce the 12-hour compliance SLA. Even if no new signals arrive, Summoner triggers a Chef run if the last successful run was more than 12 hours ago — this is the compliance floor (nodes must stay in their defined configuration state). Signal-driven + time-driven coexist.
- Update local state on success. Record the new last-run-time + artifact version so subsequent cycles can deduplicate.
Position in the pipeline¶
┌─────────────┐ promote ┌──────────────┐ write signal ┌────────┐
│ release │ ───────────▶ │ Chef Librarian│ ──────────────▶ │ S3 │
│ train cron │ (API call) │ │ │ bucket │
└─────────────┘ └──────────────┘ └───┬────┘
│ poll /
│ watch
▼
┌──────────────┐ dedup + splay ┌──────────────┐
│ Chef Summoner│ ──────────────▶ │ chef-client │
│ (on every │ │ run │
│ node) │ │ │
└──────┬───────┘ └──────────────┘
│
▼ (fallback)
┌──────────────┐
│ 12h fallback │ ──── triggers chef-client directly
│ cron (per │ if Summoner hasn't run in 12h
│ node, baked │
│ into AMI) │
└──────────────┘
Why Summoner is architecturally load-bearing¶
Why signal-driven replaced fixed cron¶
Phase 1 of Slack's Chef work (2024) already staggered per-node
cron-driven Chef runs across AZs to bound blast-radius. Phase 2
split the shared prod environment into six AZ-bucketed
environments (prod-1 … prod-6) that now receive updates at
different times (via
release train). A fixed cron is no longer operationally
meaningful when the promotion cadence varies per environment:
prod-1gets new versions every hour (it's the canary).prod-2throughprod-6get new versions only when the release train advances, which depends on whether the previous version has completed the cycle.- No promotions → no new work for Chef to do on most nodes → cron runs become no-ops that consume Chef-server load without reducing drift.
Signal-driven runs only when there's actual new work, plus a 12-hour compliance floor to guarantee drift detection even when no promotions occur.
Why Summoner needs a fallback cron¶
From the post (verbatim): "Now that Chef Summoner is the primary mechanism we rely on to trigger Chef runs, it becomes a critical piece of infrastructure. After a node is provisioned, subsequent Chef runs are responsible for keeping Chef Summoner itself up to date with the latest changes. But if we accidentally roll out a broken version of Chef Summoner, it may stop triggering Chef runs altogether — making it impossible to roll out a fixed version using our normal deployment flow."
Slack's mitigation is a fallback cron baked into every AMI that runs independently of Summoner, checks the local state (last-run-time + artifact-version), and triggers a Chef run directly if Summoner hasn't run Chef in the last 12 hours. This is the independent-of-self-updating-subsystem path that gives Slack a recovery route if Summoner itself breaks. See concepts/fallback-cron-for-self-update-safety and patterns/self-update-with-independent-fallback-cron.
Why local state matters¶
Summoner keeps track of its own state locally — last-run-time + artifact-version — for two reasons:
- Deduplication of already-applied signals. If Summoner restarts or polls the same signal twice, it must not run Chef twice for the same version.
- Fallback-cron observability. The fallback cron reads Summoner's local state to decide whether to trigger a direct Chef run; if Summoner's state isn't observable from outside Summoner, the fallback can't decide safely.
Numbers and scale disclosed¶
- Runs on every Slack EC2 node (total count not disclosed).
- Watches one S3 key per node (stack + env).
- Triggers
chef-clientat least every 12 hours. - Example
Splay= 15 (units not disclosed).
Numbers not disclosed¶
- S3 poll interval — whether Summoner polls (and at what cadence) or subscribes to S3 event notifications.
- Concrete Splay units (seconds? minutes?).
- Fallback cron's own interval (only the 12-hour threshold for triggering action is stated).
- Summoner's local-state storage format / location.
- Failure semantics — retry cadence, backoff, error escalation, alerting.
- Resource footprint (CPU / memory / network) per node.
Design trade-offs¶
- Pull model (Summoner polls S3) over push model (S3 notifies Summoner via SNS/SQS). Pull scales trivially to arbitrary numbers of nodes without per-node subscription management; push has lower steady-state latency. Slack's disclosure does not confirm which model Summoner uses — the post says "checking the S3 key corresponding to the node's Chef stack and environment" which suggests polling, but doesn't exclude event-driven.
- Per-environment key vs per-node key. Per-env key means all nodes in the same environment read the same signal — cheap fanout (O(env_count) keys, one read per node per poll). Per-node key would allow node-specific targeting but multiply the producer-side cost by fleet size.
- 12-hour compliance floor. Chosen to match Slack's compliance policy (Chef must run at least once every 12 hours); shorter floor would increase Chef-server load without proportional compliance benefit, longer floor would violate policy.
Caveats¶
- Only disclosed in the 2025-10-23 phase-2 post. Prior to that post, cron was canonical; Summoner is new as of this post.
- S3 access pattern undocumented. Whether Summoner polls or uses S3 event notifications affects cost (list/get requests) and latency (minutes vs seconds).
- Relationship with Librarian is producer-consumer, not bidirectional. Summoner doesn't report back to Librarian; Librarian has no awareness of how many nodes have actually consumed a given signal.
- No integration with the separate ReleaseBot / Webapp-backend deploy path. Summoner is specifically for the EC2 / Chef substrate; ReleaseBot is specifically for Webapp backend. Both feed the broader Deploy Safety Program's reliability targets.
Seen in¶
- sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption — canonical: Chef Summoner introduced as the signal-driven Chef trigger that replaced fixed cron.
Related¶
- companies/slack
- systems/chef
- systems/chef-librarian
- systems/aws-s3
- systems/poptart-bootstrap
- systems/slack-deploy-safety-program
- systems/slack-releasebot
- concepts/signal-driven-chef-trigger
- concepts/splay-randomised-run-jitter
- concepts/fallback-cron-for-self-update-safety
- concepts/s3-signal-bucket-as-config-fanout
- patterns/signal-triggered-fleet-config-apply
- patterns/self-update-with-independent-fallback-cron