Skip to content

PATTERN Cited by 1 source

Signal-triggered fleet config apply

Problem

Fleet-wide configuration-management substrates (Chef / Puppet / Salt / Ansible-pull) traditionally run on a fixed cron — every node runs the config agent every N hours, regardless of whether there is new work. The model has three baked-in limitations:

  1. Predictable but inefficient. Most cron-triggered runs do nothing (no new cookbook version since last run), but consume node CPU, Chef-server bandwidth, and Chef-server CPU to reach that conclusion.
  2. Loose coupling to rollout cadence. When a new version is promoted, all nodes pick it up at the next cron boundary — latency is bounded above by the cron interval.
  3. Breaks when promotion cadence varies per environment. If different environments receive promotions at different rates (e.g., one hourly canary environment + a multi-hour release train for the rest — see patterns/release-train-rollout-with-canary), a fixed cron is no longer operationally meaningful.

Solution

Replace the per-node cron with a signal-driven pull architecture where three components collaborate:

  1. Signal-producing service — knows when new work is available (a new version has been promoted to an environment) and writes a signal to a shared signal bus.
  2. Signal bus — a substrate the fleet can efficiently poll. Object storage (S3, GCS) is a good fit for the write-few-read-many shape.
  3. Signal-consuming agent on every node — polls the specific key matching the node's environment, deduplicates against local state, applies Splay jitter, and triggers the config-management run.

Slack's Chef instantiation

The canonical wiki instance (Source: sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption):

┌──────────────┐                 ┌──────────────┐
│ Release train│ promote via API │ Chef         │
│  k8s cron    │ ──────────────▶ │ Librarian    │
│ (producer)   │                 │ (producer)   │
└──────────────┘                 └──────┬───────┘
                                        │ write JSON signal to
                                        │ chef-run-triggers/<stack>/<env>
                              ┌──────────────────┐
                              │     S3 bucket    │ ◀── signal bus
                              │ (durable, cheap) │
                              └───────┬──────────┘
                                      │ poll matching key
                              ┌──────────────────┐
                              │  Chef Summoner   │
                              │   (on every      │ ◀── consumer agent
                              │    node)         │
                              └───────┬──────────┘
                                      │ dedup + splay
                              ┌──────────────────┐
                              │  chef-client     │ ◀── config apply
                              │  (runs cookbook) │
                              └──────────────────┘

Four design decisions

1. Signal granularity

Slack chose per-(stack, env) keys — all nodes in the same stack + environment poll the same signal. This scales to large fleets without multiplying producer cost per node, at the cost that signals can't target individual nodes.

Alternatives: - Per-node keys. Enables per-node targeting but multiplies producer cost by fleet size. - Per-service keys. Enables per-service targeting but requires the signal producer to know the full service-to-node mapping.

Slack's choice is consistent with "configuration applies uniformly within an environment" — the load-bearing property of environments themselves.

2. Polling vs push

Slack's post doesn't disclose whether Chef Summoner polls S3 on a timer or subscribes to S3 event notifications.

  • Polling: simpler, works with any object storage, per-node poll rate determines fanout cost. Easy failure model (at most the poll interval's worth of staleness).
  • Event notifications: lower latency but more operational surface (SNS topics or SQS queues, one per consumer or shared with filtering). Per-notification delivery guarantees vary.

3. Deduplication location

Consumer-side deduplication via local state (last-applied version) is the standard shape. The alternative — exactly-once delivery from the bus — is expensive and unnecessary.

4. Compliance floor alongside signal path

The signal path is the fast path: run when there's work. But the substrate still needs a compliance floor to catch configuration drift on nodes that receive no new work for extended periods. Slack's design: if Summoner hasn't run Chef in 12 hours, it triggers a run regardless of signal state.

This coexists with the fallback cron (see concepts/fallback-cron-for-self-update-safety) — the in-Summoner compliance floor handles "no promotion in 12 hours", the baked-in-AMI fallback cron handles "Summoner itself is broken."

Payload shape

Slack's signal payload (verbatim from the post) includes: - Splay — per-run randomised jitter (see concepts/splay-randomised-run-jitter). - Timestamp — when the signal was written. - ManifestRecord — full artifact manifest with version, cookbook-version map, S3 artifact pointer, upload-complete flag (producer-side ordering barrier).

Two signal-design disciplines are visible: 1. Include everything the consumer needs to decide. Summoner doesn't need to make an API call back to Librarian — the signal is self-contained. 2. Include operational tuning knobs in the signal. Splay isn't hard-coded in Summoner; it's specified per-signal so operators can increase it for custom operations.

Composes with

Caveats

  • New substrate, new failure modes. Previously, a broken Chef server was the single failure point; now the S3 signal bus is also critical. Mitigation: fallback cron.
  • Signal schema evolution. Adding fields is easy; removing or renaming is a cross-version contract change. Producer and consumer must coordinate.
  • Cost implications depend on poll cadence. At fleet size M and poll interval T, S3 request rate is M/T per second on a single key. Tune T to balance request cost vs propagation latency.
  • Doesn't handle all Chef run scenarios. Manual operator runs (e.g., ad-hoc fleet-wide Chef) still need a separate path. Slack's post: "In addition to this safety net, we also have tooling that allows us to trigger ad hoc Chef runs across the fleet or a subset of nodes when needed."

Sibling patterns

Seen in

Last updated · 470 distilled / 1,213 read