Skip to content

SYSTEM Cited by 1 source

Chef Summoner (Slack)

What it is

Chef Summoner is Slack's per-node service (introduced in the 2025-10-23 phase-2 rollout) that replaces the fixed-cron Chef trigger with a signal-driven pull from an S3 signal bucket. It runs on every Slack-EC2 node, watches the S3 key corresponding to the node's Chef stack + Chef environment, and schedules a chef-client run when a new artifact version appears. Keeps its own local state (last-run-time + artifact version) for deduplication.

Summoner is the consumer side of Slack's signal-driven fleet-configuration fanout; the producer side is systems/chef-librarian.

Core responsibilities

  1. Watch one S3 key. The key is chef-run-triggers/<stack>/<env> where <stack> and <env> are attributes of the node (set at boot by Poptart Bootstrap — e.g. basalt/prod-3).
  2. Deduplicate against local state. If the signal's ManifestRecord.version matches what's already been run on this node (stored locally by Summoner), skip. Otherwise, proceed.
  3. Apply Splay. Read the Splay field from the signal and wait a randomised delay up to that value before starting the Chef run. Prevents thundering-herd load on the Chef server when a promotion lands and all nodes in an environment wake up simultaneously. See concepts/splay-randomised-run-jitter.
  4. Trigger the Chef run. Invoke chef-client against the node's configured stack + environment; the chef-client resolves the version pin from the promoted environment on the Chef server.
  5. Enforce the 12-hour compliance SLA. Even if no new signals arrive, Summoner triggers a Chef run if the last successful run was more than 12 hours ago — this is the compliance floor (nodes must stay in their defined configuration state). Signal-driven + time-driven coexist.
  6. Update local state on success. Record the new last-run-time + artifact version so subsequent cycles can deduplicate.

Position in the pipeline

┌─────────────┐  promote     ┌──────────────┐  write signal   ┌────────┐
│ release     │ ───────────▶ │ Chef Librarian│ ──────────────▶ │  S3    │
│ train cron  │  (API call)  │              │                  │ bucket │
└─────────────┘               └──────────────┘                  └───┬────┘
                                                                    │ poll /
                                                                    │ watch
                          ┌──────────────┐  dedup + splay  ┌──────────────┐
                          │ Chef Summoner│ ──────────────▶ │ chef-client  │
                          │  (on every   │                  │  run         │
                          │   node)      │                  │              │
                          └──────┬───────┘                  └──────────────┘
                                 ▼ (fallback)
                          ┌──────────────┐
                          │ 12h fallback │ ──── triggers chef-client directly
                          │ cron (per    │      if Summoner hasn't run in 12h
                          │ node, baked  │
                          │ into AMI)    │
                          └──────────────┘

Why Summoner is architecturally load-bearing

Why signal-driven replaced fixed cron

Phase 1 of Slack's Chef work (2024) already staggered per-node cron-driven Chef runs across AZs to bound blast-radius. Phase 2 split the shared prod environment into six AZ-bucketed environments (prod-1prod-6) that now receive updates at different times (via release train). A fixed cron is no longer operationally meaningful when the promotion cadence varies per environment:

  • prod-1 gets new versions every hour (it's the canary).
  • prod-2 through prod-6 get new versions only when the release train advances, which depends on whether the previous version has completed the cycle.
  • No promotions → no new work for Chef to do on most nodes → cron runs become no-ops that consume Chef-server load without reducing drift.

Signal-driven runs only when there's actual new work, plus a 12-hour compliance floor to guarantee drift detection even when no promotions occur.

Why Summoner needs a fallback cron

From the post (verbatim): "Now that Chef Summoner is the primary mechanism we rely on to trigger Chef runs, it becomes a critical piece of infrastructure. After a node is provisioned, subsequent Chef runs are responsible for keeping Chef Summoner itself up to date with the latest changes. But if we accidentally roll out a broken version of Chef Summoner, it may stop triggering Chef runs altogether — making it impossible to roll out a fixed version using our normal deployment flow."

Slack's mitigation is a fallback cron baked into every AMI that runs independently of Summoner, checks the local state (last-run-time + artifact-version), and triggers a Chef run directly if Summoner hasn't run Chef in the last 12 hours. This is the independent-of-self-updating-subsystem path that gives Slack a recovery route if Summoner itself breaks. See concepts/fallback-cron-for-self-update-safety and patterns/self-update-with-independent-fallback-cron.

Why local state matters

Summoner keeps track of its own state locally — last-run-time + artifact-version — for two reasons:

  1. Deduplication of already-applied signals. If Summoner restarts or polls the same signal twice, it must not run Chef twice for the same version.
  2. Fallback-cron observability. The fallback cron reads Summoner's local state to decide whether to trigger a direct Chef run; if Summoner's state isn't observable from outside Summoner, the fallback can't decide safely.

Numbers and scale disclosed

  • Runs on every Slack EC2 node (total count not disclosed).
  • Watches one S3 key per node (stack + env).
  • Triggers chef-client at least every 12 hours.
  • Example Splay = 15 (units not disclosed).

Numbers not disclosed

  • S3 poll interval — whether Summoner polls (and at what cadence) or subscribes to S3 event notifications.
  • Concrete Splay units (seconds? minutes?).
  • Fallback cron's own interval (only the 12-hour threshold for triggering action is stated).
  • Summoner's local-state storage format / location.
  • Failure semantics — retry cadence, backoff, error escalation, alerting.
  • Resource footprint (CPU / memory / network) per node.

Design trade-offs

  • Pull model (Summoner polls S3) over push model (S3 notifies Summoner via SNS/SQS). Pull scales trivially to arbitrary numbers of nodes without per-node subscription management; push has lower steady-state latency. Slack's disclosure does not confirm which model Summoner uses — the post says "checking the S3 key corresponding to the node's Chef stack and environment" which suggests polling, but doesn't exclude event-driven.
  • Per-environment key vs per-node key. Per-env key means all nodes in the same environment read the same signal — cheap fanout (O(env_count) keys, one read per node per poll). Per-node key would allow node-specific targeting but multiply the producer-side cost by fleet size.
  • 12-hour compliance floor. Chosen to match Slack's compliance policy (Chef must run at least once every 12 hours); shorter floor would increase Chef-server load without proportional compliance benefit, longer floor would violate policy.

Caveats

  • Only disclosed in the 2025-10-23 phase-2 post. Prior to that post, cron was canonical; Summoner is new as of this post.
  • S3 access pattern undocumented. Whether Summoner polls or uses S3 event notifications affects cost (list/get requests) and latency (minutes vs seconds).
  • Relationship with Librarian is producer-consumer, not bidirectional. Summoner doesn't report back to Librarian; Librarian has no awareness of how many nodes have actually consumed a given signal.
  • No integration with the separate ReleaseBot / Webapp-backend deploy path. Summoner is specifically for the EC2 / Chef substrate; ReleaseBot is specifically for Webapp backend. Both feed the broader Deploy Safety Program's reliability targets.

Seen in

Last updated · 470 distilled / 1,213 read