SYSTEM Cited by 1 source
Chef Librarian (Slack)¶
What it is¶
Chef Librarian is Slack's internal service responsible for uploading cookbook artifacts to every Chef stack and promoting specific cookbook versions to specific environments. Originally described in Slack Engineering's 2024 Advancing Our Chef Infrastructure post (not yet ingested); extended in the 2025-10-23 phase-2 post to write a JSON signal to an S3 bucket on every environment promotion, which Chef Summoner consumes on every node to trigger Chef runs.
Librarian is the producer side of Slack's signal-driven fleet-configuration fanout.
Core capabilities¶
Original (2024-era, pre-phase-2)¶
- Watches for new cookbook artifacts produced by the cookbook CI/build pipeline.
- Uploads artifacts to every Chef stack. Slack runs multiple
Chef stacks (at minimum two disclosed:
basaltandironstone); Librarian handles the N-stack fanout so operators don't upload per-stack. - Exposes a promote-version-to-environment API. Operators (and automation — including a Kubernetes cron job rolling out cookbook changes through the release train) call this to pin a specific cookbook version to a specific environment on a specific stack.
Extended (2025-10-23, phase-2)¶
- Writes a JSON signal to S3 on every promotion. The bucket
layout is two-level:
chef-run-triggers/<stack>/<env>, with per-env JSON payload. See concepts/s3-signal-bucket-as-config-fanout. - Signal payload includes:
Splay— per-run jitter value (example: 15; units not disclosed). See concepts/splay-randomised-run-jitter.Timestamp— promotion time in RFC 3339 format.ManifestRecord— full artifact metadata:version,chef_shard,datetime,latest_commit_hash,manifest_content(base version, commit hash, author, cookbook-versions map, site-cookbook-versions map),s3_bucket,s3_keyfor the.tar.gzartifact,ttl,upload_completeflag.
Bucket layout (disclosed)¶
chef-run-triggers/
├── basalt/
│ ├── ami-dev
│ ├── ami-prod
│ ├── dev
│ ├── prod
│ ├── prod-1 ... prod-6
│ └── sandbox
└── ironstone/
├── ami-dev
├── ami-prod
├── dev
├── prod
├── prod-1 ... prod-6
└── sandbox
Two stacks × eleven environments = 22 keys total at the disclosed scale.
Role in release-train rollout¶
The Kubernetes cron job that drives Slack's phase-2 release train calls Librarian's promote API on each tick:
- Top-of-hour: promote latest version to Sandbox + Dev.
:30: promote toprod-1(canary) unconditionally, and toprod-N+1iff the release-train conditions are met (see patterns/release-train-rollout-with-canary).
Each promotion triggers a signal write to the corresponding
chef-run-triggers/<stack>/<env> S3 key, which Chef Summoner
on all matching nodes will eventually pick up and act on.
Why this matters¶
Librarian is the decoupling point between cookbook-promotion orchestration and Chef-run execution. Without Librarian, each node would need to know about the release-train schedule directly (or, historically, run Chef on a fixed cron regardless of whether there was anything to apply). With Librarian writing signals to S3, the promotion orchestration is one concern (drive it from Kubernetes cron, a human operator, or a GitHub Action — Librarian doesn't care) and the fleet execution is a separate concern (Chef Summoner on every node).
Caveats¶
- Stub-level coverage. Librarian's own architecture (language, framework, deployment, scaling, failure model) is not canonicalised — the 2024 introductory post is not yet ingested on the wiki.
- Only the producer side of the signal protocol is disclosed. Error semantics (what if the S3 write fails after the promotion succeeds? what if Librarian crashes mid-promotion? what if two promotes race on the same env?) are not disclosed.
- API surface undisclosed. The promote endpoint is named but not specified (HTTP? gRPC? command-line?); the Kubernetes cron job that calls it is mentioned but not described.
Seen in¶
- sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption — phase-2 extension: Librarian writes to S3 on promotion, feeding the signal bus Chef Summoner consumes.