PATTERN Cited by 1 source
Signal-triggered fleet config apply¶
Problem¶
Fleet-wide configuration-management substrates (Chef / Puppet / Salt / Ansible-pull) traditionally run on a fixed cron — every node runs the config agent every N hours, regardless of whether there is new work. The model has three baked-in limitations:
- Predictable but inefficient. Most cron-triggered runs do nothing (no new cookbook version since last run), but consume node CPU, Chef-server bandwidth, and Chef-server CPU to reach that conclusion.
- Loose coupling to rollout cadence. When a new version is promoted, all nodes pick it up at the next cron boundary — latency is bounded above by the cron interval.
- Breaks when promotion cadence varies per environment. If different environments receive promotions at different rates (e.g., one hourly canary environment + a multi-hour release train for the rest — see patterns/release-train-rollout-with-canary), a fixed cron is no longer operationally meaningful.
Solution¶
Replace the per-node cron with a signal-driven pull architecture where three components collaborate:
- Signal-producing service — knows when new work is available (a new version has been promoted to an environment) and writes a signal to a shared signal bus.
- Signal bus — a substrate the fleet can efficiently poll. Object storage (S3, GCS) is a good fit for the write-few-read-many shape.
- Signal-consuming agent on every node — polls the specific key matching the node's environment, deduplicates against local state, applies Splay jitter, and triggers the config-management run.
Slack's Chef instantiation¶
The canonical wiki instance (Source: sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption):
┌──────────────┐ ┌──────────────┐
│ Release train│ promote via API │ Chef │
│ k8s cron │ ──────────────▶ │ Librarian │
│ (producer) │ │ (producer) │
└──────────────┘ └──────┬───────┘
│ write JSON signal to
│ chef-run-triggers/<stack>/<env>
▼
┌──────────────────┐
│ S3 bucket │ ◀── signal bus
│ (durable, cheap) │
└───────┬──────────┘
│ poll matching key
▼
┌──────────────────┐
│ Chef Summoner │
│ (on every │ ◀── consumer agent
│ node) │
└───────┬──────────┘
│ dedup + splay
▼
┌──────────────────┐
│ chef-client │ ◀── config apply
│ (runs cookbook) │
└──────────────────┘
Four design decisions¶
1. Signal granularity¶
Slack chose per-(stack, env) keys — all nodes in the same stack + environment poll the same signal. This scales to large fleets without multiplying producer cost per node, at the cost that signals can't target individual nodes.
Alternatives: - Per-node keys. Enables per-node targeting but multiplies producer cost by fleet size. - Per-service keys. Enables per-service targeting but requires the signal producer to know the full service-to-node mapping.
Slack's choice is consistent with "configuration applies uniformly within an environment" — the load-bearing property of environments themselves.
2. Polling vs push¶
Slack's post doesn't disclose whether Chef Summoner polls S3 on a timer or subscribes to S3 event notifications.
- Polling: simpler, works with any object storage, per-node poll rate determines fanout cost. Easy failure model (at most the poll interval's worth of staleness).
- Event notifications: lower latency but more operational surface (SNS topics or SQS queues, one per consumer or shared with filtering). Per-notification delivery guarantees vary.
3. Deduplication location¶
Consumer-side deduplication via local state (last-applied version) is the standard shape. The alternative — exactly-once delivery from the bus — is expensive and unnecessary.
4. Compliance floor alongside signal path¶
The signal path is the fast path: run when there's work. But the substrate still needs a compliance floor to catch configuration drift on nodes that receive no new work for extended periods. Slack's design: if Summoner hasn't run Chef in 12 hours, it triggers a run regardless of signal state.
This coexists with the fallback cron (see concepts/fallback-cron-for-self-update-safety) — the in-Summoner compliance floor handles "no promotion in 12 hours", the baked-in-AMI fallback cron handles "Summoner itself is broken."
Payload shape¶
Slack's signal payload (verbatim from the post) includes:
- Splay — per-run randomised jitter (see
concepts/splay-randomised-run-jitter).
- Timestamp — when the signal was written.
- ManifestRecord — full artifact manifest with version,
cookbook-version map, S3 artifact pointer, upload-complete
flag (producer-side ordering barrier).
Two signal-design disciplines are visible: 1. Include everything the consumer needs to decide. Summoner doesn't need to make an API call back to Librarian — the signal is self-contained. 2. Include operational tuning knobs in the signal. Splay isn't hard-coded in Summoner; it's specified per-signal so operators can increase it for custom operations.
Composes with¶
- patterns/split-environment-per-az-for-blast-radius — the environments that define the signal key space.
- patterns/release-train-rollout-with-canary — the upstream rollout strategy that drives the producer side.
- patterns/self-update-with-independent-fallback-cron — the fallback path that protects against the signal-driven primary breaking.
- concepts/splay-randomised-run-jitter — the thundering- herd mitigation.
- concepts/s3-signal-bucket-as-config-fanout — the signal-bus substrate.
Caveats¶
- New substrate, new failure modes. Previously, a broken Chef server was the single failure point; now the S3 signal bus is also critical. Mitigation: fallback cron.
- Signal schema evolution. Adding fields is easy; removing or renaming is a cross-version contract change. Producer and consumer must coordinate.
- Cost implications depend on poll cadence. At fleet size M and poll interval T, S3 request rate is M/T per second on a single key. Tune T to balance request cost vs propagation latency.
- Doesn't handle all Chef run scenarios. Manual operator runs (e.g., ad-hoc fleet-wide Chef) still need a separate path. Slack's post: "In addition to this safety net, we also have tooling that allows us to trigger ad hoc Chef runs across the fleet or a subset of nodes when needed."
Sibling patterns¶
- patterns/central-proxy-choke-point — a sibling pattern at the request-flow altitude; similar "single choke point" shape but different consumer model.
- patterns/bootstrap-then-auto-follow — a related pattern for one-shot agent-join-then-passive-follow.
Seen in¶
- sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption — canonical: Slack's phase-2 Chef design replaces fixed per-node cron with a Librarian → S3 → Summoner signal- driven pipeline, plus in-Summoner compliance floor and a baked-in-AMI fallback cron.
Related¶
- concepts/signal-driven-chef-trigger
- concepts/s3-signal-bucket-as-config-fanout
- concepts/splay-randomised-run-jitter
- concepts/fallback-cron-for-self-update-safety
- patterns/self-update-with-independent-fallback-cron
- patterns/release-train-rollout-with-canary
- patterns/split-environment-per-az-for-blast-radius
- systems/chef
- systems/chef-librarian
- systems/chef-summoner