CONCEPT Cited by 1 source
Signal-driven Chef trigger¶
Definition¶
Signal-driven Chef trigger is the architectural choice of
running chef-client only when a new cookbook version is
available (via an out-of-band signal), rather than on a fixed
time schedule. Replaces the classic cron-driven model ("run
Chef every N hours on every node") with a pull from a shared
signal bus that tells each node whether there is new work to
do.
The canonical instance is Slack's 2025-10-23 phase-2 design
(Source:
sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption):
Chef Librarian writes a JSON signal
to an S3 bucket on every cookbook-version promotion; the per-node
service Chef Summoner watches the
signal key for its (stack, env) and runs chef-client when a
new version appears.
Why replace cron¶
The cron-driven model was operationally predictable¶
Historically, Chef users ran chef-client on a cron schedule
(every 30 minutes / every hour / every few hours). Staggering
per-AZ or per-host prevented simultaneous fleet-wide load. Slack
had done this: "the timing of these cron jobs was staggered
across availability zones, helping us avoid running Chef on all
nodes simultaneously."
The model had two load-bearing properties: 1. Every node runs Chef eventually — compliance-friendly. 2. When a promotion lands, Chef eventually picks it up — cookbook changes propagate within the cron interval.
Fixed cron breaks when promotion cadence varies¶
When Slack split the single prod environment into six
AZ-bucketed prod-1 … prod-6 environments (see
concepts/az-bucketed-environment-split), each environment
began receiving promotions at different rates through a
release-train rollout:
prod-1gets a new version hourly (it's the canary).prod-2throughprod-6get new versions only when the release train advances, which is gated on the previous version making it through the whole train.
Under this regime, a fixed cron is no longer operationally meaningful: verbatim, "we can't reliably predict when a given Chef environment will receive new changes. As a result, we've moved away from scheduled runs and instead built a new service that triggers Chef runs on nodes based on signals."
The load-bearing property: run when there's work¶
Signal-driven triggers change the contract from "run Chef on a schedule, probably do nothing, sometimes do new work" to "run Chef only when there's new work, plus a compliance floor to catch drift."
This has three second-order benefits: - Less Chef-server load. No more cron-driven no-op runs. - Lower per-node overhead. Nodes don't spin up the Chef subsystem to do nothing. - Shorter propagation latency when work is there. The signal fires immediately on promotion; the node picks it up at its next poll (or subscription event), not at the next cron boundary.
The signal-driven trigger is composite¶
Slack's Chef Summoner combines two triggers, not just one:
- Signal-driven (fast path): when a new artifact version appears on the S3 signal bucket, run Chef with the signal's configured Splay.
- Time-floored (compliance path): if the last Chef run was
12 hours ago, run Chef anyway to enforce compliance even when no promotions have occurred.
Both are needed. Signal-only would violate the 12-hour compliance SLA during a long promotion-drought; time-only reverts to the classic cron model.
Local state enables deduplication¶
Chef Summoner keeps local state on every node — last-run-time + last-applied-artifact-version — so polling the signal bus doesn't re-run Chef for versions that have already been applied. This makes the signal bus at-least-once semantics via consumer-side dedup rather than forcing exactly-once semantics on the producer.
Design trade-offs vs cron¶
- Signal bus cost. Chef Server load goes down, but an S3 signal bus adds operational cost (bucket, IAM policy, poll traffic from every node).
- New critical path. The signal bus + Chef Summoner is a new must-work subsystem. Breakage is mitigated by the fallback cron.
- New latency surface. If Summoner is slow to poll, nodes converge slower than with fixed cron. Polling cadence must be tuned to balance cost against latency.
Caveats¶
- Polling vs push undocumented. Slack's post does not specify whether Summoner polls S3 on a timer or subscribes to S3 event notifications. Each has different cost / latency / reliability trade-offs.
- Signal schema evolution is a concern. Adding new fields to the Librarian-written JSON signal is straightforward, but removing or renaming fields requires careful rollout — the signal bus is a cross-version contract between Librarian and Summoner.
- Pull model assumes S3 availability at the signal-read path. An S3 regional outage means no new Chef runs until S3 recovers; the fallback cron catches this after 12 hours (compliance floor) but not before.
Seen in¶
- sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption — canonical: Slack's phase-2 Chef design replaces fixed per-node cron with Chef Summoner reading a signal from an S3 bucket populated by Chef Librarian on every promotion.
Related¶
- concepts/s3-signal-bucket-as-config-fanout
- concepts/splay-randomised-run-jitter
- concepts/az-bucketed-environment-split
- concepts/fallback-cron-for-self-update-safety
- patterns/signal-triggered-fleet-config-apply
- patterns/release-train-rollout-with-canary
- systems/chef
- systems/chef-summoner
- systems/chef-librarian