Skip to content

CONCEPT Cited by 1 source

Signal-driven Chef trigger

Definition

Signal-driven Chef trigger is the architectural choice of running chef-client only when a new cookbook version is available (via an out-of-band signal), rather than on a fixed time schedule. Replaces the classic cron-driven model ("run Chef every N hours on every node") with a pull from a shared signal bus that tells each node whether there is new work to do.

The canonical instance is Slack's 2025-10-23 phase-2 design (Source: sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption): Chef Librarian writes a JSON signal to an S3 bucket on every cookbook-version promotion; the per-node service Chef Summoner watches the signal key for its (stack, env) and runs chef-client when a new version appears.

Why replace cron

The cron-driven model was operationally predictable

Historically, Chef users ran chef-client on a cron schedule (every 30 minutes / every hour / every few hours). Staggering per-AZ or per-host prevented simultaneous fleet-wide load. Slack had done this: "the timing of these cron jobs was staggered across availability zones, helping us avoid running Chef on all nodes simultaneously."

The model had two load-bearing properties: 1. Every node runs Chef eventually — compliance-friendly. 2. When a promotion lands, Chef eventually picks it up — cookbook changes propagate within the cron interval.

Fixed cron breaks when promotion cadence varies

When Slack split the single prod environment into six AZ-bucketed prod-1prod-6 environments (see concepts/az-bucketed-environment-split), each environment began receiving promotions at different rates through a release-train rollout:

  • prod-1 gets a new version hourly (it's the canary).
  • prod-2 through prod-6 get new versions only when the release train advances, which is gated on the previous version making it through the whole train.

Under this regime, a fixed cron is no longer operationally meaningful: verbatim, "we can't reliably predict when a given Chef environment will receive new changes. As a result, we've moved away from scheduled runs and instead built a new service that triggers Chef runs on nodes based on signals."

The load-bearing property: run when there's work

Signal-driven triggers change the contract from "run Chef on a schedule, probably do nothing, sometimes do new work" to "run Chef only when there's new work, plus a compliance floor to catch drift."

This has three second-order benefits: - Less Chef-server load. No more cron-driven no-op runs. - Lower per-node overhead. Nodes don't spin up the Chef subsystem to do nothing. - Shorter propagation latency when work is there. The signal fires immediately on promotion; the node picks it up at its next poll (or subscription event), not at the next cron boundary.

The signal-driven trigger is composite

Slack's Chef Summoner combines two triggers, not just one:

  1. Signal-driven (fast path): when a new artifact version appears on the S3 signal bucket, run Chef with the signal's configured Splay.
  2. Time-floored (compliance path): if the last Chef run was

    12 hours ago, run Chef anyway to enforce compliance even when no promotions have occurred.

Both are needed. Signal-only would violate the 12-hour compliance SLA during a long promotion-drought; time-only reverts to the classic cron model.

Local state enables deduplication

Chef Summoner keeps local state on every node — last-run-time + last-applied-artifact-version — so polling the signal bus doesn't re-run Chef for versions that have already been applied. This makes the signal bus at-least-once semantics via consumer-side dedup rather than forcing exactly-once semantics on the producer.

Design trade-offs vs cron

  • Signal bus cost. Chef Server load goes down, but an S3 signal bus adds operational cost (bucket, IAM policy, poll traffic from every node).
  • New critical path. The signal bus + Chef Summoner is a new must-work subsystem. Breakage is mitigated by the fallback cron.
  • New latency surface. If Summoner is slow to poll, nodes converge slower than with fixed cron. Polling cadence must be tuned to balance cost against latency.

Caveats

  • Polling vs push undocumented. Slack's post does not specify whether Summoner polls S3 on a timer or subscribes to S3 event notifications. Each has different cost / latency / reliability trade-offs.
  • Signal schema evolution is a concern. Adding new fields to the Librarian-written JSON signal is straightforward, but removing or renaming fields requires careful rollout — the signal bus is a cross-version contract between Librarian and Summoner.
  • Pull model assumes S3 availability at the signal-read path. An S3 regional outage means no new Chef runs until S3 recovers; the fallback cron catches this after 12 hours (compliance floor) but not before.

Seen in

Last updated · 470 distilled / 1,213 read