CONCEPT Cited by 1 source
Splay — randomised run jitter¶
Definition¶
Splay (named by Chef) is a per-node randomised delay
inserted between "I received a new config signal" and "I
start the config-management run". Each node draws a random
value in [0, splay) and sleeps for that long before starting,
staggering N nodes' runs across a time window instead of firing
them simultaneously when a new cookbook version lands.
The canonical instance on the wiki is Slack's Chef stack
(Source:
sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption):
the signal payload written by
Chef Librarian to S3 includes an
explicit "Splay": 15 field, which
Chef Summoner reads on every node
and applies before triggering the chef-client run.
Why it matters¶
Without splay, N nodes that all poll the same signal bus at (near) the same time will all trigger config runs at the same time. This creates thundering-herd load on any shared resource the run touches — the Chef server, the package repository, the artifact store, the cookbook storage — risking amplifying a routine promotion into an outage of the very substrate the promotion depends on.
Verbatim (from the Slack post): "The splay is used to stagger Chef runs so that not all nodes in a given environment and stack try to run Chef at the same time. This helps avoid spikes in load and resource contention."
Two distinct uses of splay in Slack's design¶
1. Steady-state jitter for every promotion¶
A standard splay (example value: 15) is applied whenever a new cookbook version is signalled via the normal release-train path. Nodes that wake up to the same new-version signal stagger their runs across the splay window.
2. Operational tuning for custom signals¶
Verbatim: "We can also customize the splay depending on our needs — for example, when we trigger a Chef run using a custom signal from Librarian and want to spread the runs out more intentionally."
Operators can explicitly set a larger splay when issuing a custom signal — e.g., for a fleet-wide forced run where the work per node is heavier than usual, or the target system is known to be fragile. The field is first-class in the signal payload, not hard-coded.
Relation to related concepts¶
- concepts/thundering-herd — splay is the general mitigation for the per-node "wake on the same signal" variant of the thundering herd problem.
- patterns/jittered-job-scheduling — splay is jittered- job-scheduling applied to per-node config runs specifically, with the delay carried in-band with the trigger signal.
- Chef's native
splayparameter — Chef has had achef-client --splayflag and equivalent config since its earliest releases; Slack's signal-carried Splay is a first-class exposure of that primitive for per-signal operational tuning. - AWS SDK exponential-backoff jitter — the same mathematical
primitive (uniform jitter in
[0, window)) applied at the retry-backoff axis; the Chef Splay is applied at the trigger-delay axis.
Design notes¶
- Splay value is a trade-off between detection speed and load smoothing. Small splay (seconds) → all nodes have converged quickly, but the Chef server sees a bigger spike. Large splay (minutes or more) → smoother server load, but some nodes remain on the old version for the splay duration.
- Splay matters less on pull-only systems with small read footprints. If the only shared resource is a CDN-cached S3 object, splay matters mainly for the downstream action (Chef server, package repo). For pure polling, splay mitigates list-call load.
- Splay should be bounded by operational SLAs. If the compliance-run SLA is 12 hours, the splay window must be much smaller than 12 hours to avoid missing the deadline. Slack's example (15) is consistent with a sub-minute or sub-hour window; the post does not disclose units.
Caveats¶
- Units not disclosed. Slack's example
"Splay": 15doesn't specify seconds vs minutes. Chef's native usage is typically seconds, but signal-carried values may vary. - Distribution not disclosed. The post implies uniform
randomisation in
[0, splay)(Chef's default), but doesn't explicitly state it. - Splay alone doesn't prevent all load spikes. A 1-second splay window with 10,000 nodes is still a 10,000-request/sec spike amortised over 1 second. Splay is one of several load- smoothing primitives.
Seen in¶
- sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption — canonical: Slack's Chef Summoner reads Splay from the S3 signal payload written by Chef Librarian; the field is first-class in the signal protocol for per-signal operational tuning.