CONCEPT Cited by 1 source

Signal-driven Chef trigger¶

Definition¶

Signal-driven Chef trigger is the architectural choice of running chef-client only when a new cookbook version is available (via an out-of-band signal), rather than on a fixed time schedule. Replaces the classic cron-driven model ("run Chef every N hours on every node") with a pull from a shared signal bus that tells each node whether there is new work to do.

The canonical instance is Slack's 2025-10-23 phase-2 design (Source: sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption): Chef Librarian writes a JSON signal to an S3 bucket on every cookbook-version promotion; the per-node service Chef Summoner watches the signal key for its (stack, env) and runs chef-client when a new version appears.

Why replace cron¶

The cron-driven model was operationally predictable¶

Historically, Chef users ran chef-client on a cron schedule (every 30 minutes / every hour / every few hours). Staggering per-AZ or per-host prevented simultaneous fleet-wide load. Slack had done this: "the timing of these cron jobs was staggered across availability zones, helping us avoid running Chef on all nodes simultaneously."

The model had two load-bearing properties: 1. Every node runs Chef eventually — compliance-friendly. 2. When a promotion lands, Chef eventually picks it up — cookbook changes propagate within the cron interval.

Fixed cron breaks when promotion cadence varies¶

When Slack split the single prod environment into six AZ-bucketed prod-1 … prod-6 environments (see concepts/az-bucketed-environment-split), each environment began receiving promotions at different rates through a release-train rollout:

prod-1 gets a new version hourly (it's the canary).
prod-2 through prod-6 get new versions only when the release train advances, which is gated on the previous version making it through the whole train.

Under this regime, a fixed cron is no longer operationally meaningful: verbatim, "we can't reliably predict when a given Chef environment will receive new changes. As a result, we've moved away from scheduled runs and instead built a new service that triggers Chef runs on nodes based on signals."

The load-bearing property: run when there's work¶

Signal-driven triggers change the contract from "run Chef on a schedule, probably do nothing, sometimes do new work" to "run Chef only when there's new work, plus a compliance floor to catch drift."

This has three second-order benefits: - Less Chef-server load. No more cron-driven no-op runs. - Lower per-node overhead. Nodes don't spin up the Chef subsystem to do nothing. - Shorter propagation latency when work is there. The signal fires immediately on promotion; the node picks it up at its next poll (or subscription event), not at the next cron boundary.

The signal-driven trigger is composite¶

Slack's Chef Summoner combines two triggers, not just one:

Signal-driven (fast path): when a new artifact version appears on the S3 signal bucket, run Chef with the signal's configured Splay.
Time-floored (compliance path): if the last Chef run was

12 hours ago, run Chef anyway to enforce compliance even when no promotions have occurred.

Both are needed. Signal-only would violate the 12-hour compliance SLA during a long promotion-drought; time-only reverts to the classic cron model.

Local state enables deduplication¶

Chef Summoner keeps local state on every node — last-run-time + last-applied-artifact-version — so polling the signal bus doesn't re-run Chef for versions that have already been applied. This makes the signal bus at-least-once semantics via consumer-side dedup rather than forcing exactly-once semantics on the producer.

Design trade-offs vs cron¶

Signal bus cost. Chef Server load goes down, but an S3 signal bus adds operational cost (bucket, IAM policy, poll traffic from every node).
New critical path. The signal bus + Chef Summoner is a new must-work subsystem. Breakage is mitigated by the fallback cron.
New latency surface. If Summoner is slow to poll, nodes converge slower than with fixed cron. Polling cadence must be tuned to balance cost against latency.

Caveats¶

Polling vs push undocumented. Slack's post does not specify whether Summoner polls S3 on a timer or subscribes to S3 event notifications. Each has different cost / latency / reliability trade-offs.
Signal schema evolution is a concern. Adding new fields to the Librarian-written JSON signal is straightforward, but removing or renaming fields requires careful rollout — the signal bus is a cross-version contract between Librarian and Summoner.
Pull model assumes S3 availability at the signal-read path. An S3 regional outage means no new Chef runs until S3 recovers; the fallback cron catches this after 12 hours (compliance floor) but not before.

Seen in¶

sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption — canonical: Slack's phase-2 Chef design replaces fixed per-node cron with Chef Summoner reading a signal from an S3 bucket populated by Chef Librarian on every promotion.