CONCEPT Cited by 1 source

S3 signal bucket as config fanout¶

Definition¶

S3 (object storage) as the fanout substrate for configuration-management signals — a small set of well-known keys, each representing "is there new work for consumers in category X?", with the payload carrying the work identifier and any per-trigger operational knobs. N producers write to the keys; M consumers on every fleet node poll the specific key that matches their category.

The canonical instance is Slack's Chef stack (Source: sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption). Chef Librarian writes to a bucket under chef-run-triggers/<stack>/<env> on every cookbook- version promotion; Chef Summoner on every node polls its (stack, env) key and consumes the signal to trigger Chef runs.

Why S3 works here¶

The fanout shape¶

Producers (N ≈ small)                         Consumers (M ≈ fleet-size)
──────────────────                             ─────────────────────
Chef Librarian ──────write signal──────▶ S3 key ◀──poll──── Node-A
                                                        ◀──poll──── Node-B
                                                        ◀──poll──── Node-C
                                                        ◀──poll──── (...)

Producer cost: O(N × env_count) — each producer writes to a small number of well-known keys.
Consumer cost: O(M × poll_rate) — each consumer reads its one key at whatever cadence it polls.
Fanout cost: O(N + M), not O(N × M) — producers don't enumerate consumers and vice versa.

This is cheaper than pub/sub systems for the write-few-read-many shape, because there's no per-consumer subscription state to manage on the producer side.

S3-specific properties that matter¶

Durability and availability. S3's 11-nines durability and 4-nines availability are higher than most self-operated substrates, and the operational overhead is externalised to AWS.
ACLs map to IAM. Access control is federated into the existing identity substrate — Librarian has write perms, Summoner has read perms.
Versioning available. If you need signal history, S3 bucket versioning records every overwrite; useful for audit and replay debugging.
Cheap. Small JSON payloads; list+get requests; very low cost at fleet scale.
Naturally multi-tenant. Prefix-based key layout scopes signal isolation.

S3 limitations to work around¶

Not a push substrate by default — consumers must poll, or the bucket must be configured to emit S3 Event Notifications (SNS/SQS/EventBridge) that push-forward to consumers.
Eventual consistency for list operations (historically; now read-after-write for GET/PUT is strongly consistent in modern S3, but LIST remains eventually consistent). Consumers should poll by GET on known keys, not LIST.
At-least-once delivery at best. Consumers must deduplicate against local state.

The Slack signal schema¶

The key layout is two-level: chef-run-triggers/<stack>/<env> with the producer writing a JSON payload that carries:

Splay — per-signal jitter value (see concepts/splay-randomised-run-jitter).
Timestamp — RFC 3339 time of the promotion.
ManifestRecord — the full artifact manifest:
version — the version promoted.
chef_shard — the Chef stack this signal is for (duplicated in the key path for routing convenience).
datetime — Unix time.
latest_commit_hash — git commit the cookbook was built from.
manifest_content — nested:
- base_version
- latest_commit_hash (dup)
- author
- cookbook_versions — map of cookbook name → version
- site_cookbook_versions — site-specific cookbooks (apache2, squid, etc.)
s3_bucket + s3_key — where the .tar.gz artifact itself lives (a separate S3 object, pointed to from the signal).
ttl — expiry for the signal.
upload_complete — boolean confirming the artifact is fully uploaded before the signal was written (producer- side ordering guarantee).

The upload_complete flag is the producer-side ordering discipline — write the artifact to S3 first, then write the signal with upload_complete: true. Without this, a consumer could read the signal, pull the artifact, and find it partially uploaded.

Design trade-offs¶

Per-category key (Slack's choice) vs per-consumer key. Per-category keys scale to M consumers at O(1) producer cost, but consumers in the same category read the same signal. Per- consumer keys permit node-specific targeting but multiply producer cost by fleet size. Slack uses per-category because all nodes in an environment are supposed to apply the same cookbook version.
Key naming reflects routing. Slack's two-level layout <stack>/<env> encodes both axes a consumer uses to pick its signal. Deeper hierarchies (e.g., <stack>/<env>/<service>/) would permit finer-grain signals but increase producer complexity.
JSON vs binary payload. JSON is human-readable, self- describing, easy to extend; binary (protobuf / MessagePack) is smaller but needs a schema registry. For small fleet- config signals, JSON's cost is negligible.
Polling vs S3 Event Notifications. Polling is simpler (Summoner on every node just GETs the key on a timer) but adds per-node per-poll latency. Event notifications fan out through SNS or SQS to subscribed consumers, lower latency but more operational surface. Slack's post does not disclose which they use.

Sibling instances on the wiki at other altitudes¶

S3 at other altitudes of the wiki's corpus:

Primary CDC log store (Segment's change-data-capture for DynamoDB pipeline) — see patterns/object-store-as-cdc-log-store. S3 is the authoritative changelog, not a cold tier.
Cold tier under streaming brokers (Redpanda Tiered Storage) — patterns/tiered-storage-to-object-store. S3 is a cheap-and-slow backing store below the local-disk hot tier.
Primary data store for object uploads — the canonical S3 use case at infinite altitudes.

Slack's usage canonicalises S3 as a config-management signal bus — a new altitude for the wiki's S3 coverage: fleet-wide pull-model fanout at low write rate with small JSON payloads.

Caveats¶

Not a general pub/sub replacement. S3-as-signal-bus works for write-few-read-many, eventual-consistency- tolerant, small-payload workloads. High-throughput or low-latency pub/sub needs dedicated substrate (Kafka, SNS, etc.).
Polling cost can surprise. At fleet scale M, polling interval T, the GET-request rate is M/T per second on the signal key. Depending on T and M, S3 request costs can become material.
Signal idempotency is a consumer responsibility. S3 doesn't provide exactly-once semantics; consumers must deduplicate using local state.

Seen in¶

sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption — canonical: Chef Librarian writes per-env signals to chef-run-triggers/<stack>/<env>, Chef Summoner on every node consumes its specific key.