Skip to content

CONCEPT Cited by 1 source

Splay — randomised run jitter

Definition

Splay (named by Chef) is a per-node randomised delay inserted between "I received a new config signal" and "I start the config-management run". Each node draws a random value in [0, splay) and sleeps for that long before starting, staggering N nodes' runs across a time window instead of firing them simultaneously when a new cookbook version lands.

The canonical instance on the wiki is Slack's Chef stack (Source: sources/2025-10-23-slack-advancing-our-chef-infrastructure-safety-without-disruption): the signal payload written by Chef Librarian to S3 includes an explicit "Splay": 15 field, which Chef Summoner reads on every node and applies before triggering the chef-client run.

Why it matters

Without splay, N nodes that all poll the same signal bus at (near) the same time will all trigger config runs at the same time. This creates thundering-herd load on any shared resource the run touches — the Chef server, the package repository, the artifact store, the cookbook storage — risking amplifying a routine promotion into an outage of the very substrate the promotion depends on.

Verbatim (from the Slack post): "The splay is used to stagger Chef runs so that not all nodes in a given environment and stack try to run Chef at the same time. This helps avoid spikes in load and resource contention."

Two distinct uses of splay in Slack's design

1. Steady-state jitter for every promotion

A standard splay (example value: 15) is applied whenever a new cookbook version is signalled via the normal release-train path. Nodes that wake up to the same new-version signal stagger their runs across the splay window.

2. Operational tuning for custom signals

Verbatim: "We can also customize the splay depending on our needs — for example, when we trigger a Chef run using a custom signal from Librarian and want to spread the runs out more intentionally."

Operators can explicitly set a larger splay when issuing a custom signal — e.g., for a fleet-wide forced run where the work per node is heavier than usual, or the target system is known to be fragile. The field is first-class in the signal payload, not hard-coded.

  • concepts/thundering-herd — splay is the general mitigation for the per-node "wake on the same signal" variant of the thundering herd problem.
  • patterns/jittered-job-scheduling — splay is jittered- job-scheduling applied to per-node config runs specifically, with the delay carried in-band with the trigger signal.
  • Chef's native splay parameter — Chef has had a chef-client --splay flag and equivalent config since its earliest releases; Slack's signal-carried Splay is a first-class exposure of that primitive for per-signal operational tuning.
  • AWS SDK exponential-backoff jitter — the same mathematical primitive (uniform jitter in [0, window)) applied at the retry-backoff axis; the Chef Splay is applied at the trigger-delay axis.

Design notes

  • Splay value is a trade-off between detection speed and load smoothing. Small splay (seconds) → all nodes have converged quickly, but the Chef server sees a bigger spike. Large splay (minutes or more) → smoother server load, but some nodes remain on the old version for the splay duration.
  • Splay matters less on pull-only systems with small read footprints. If the only shared resource is a CDN-cached S3 object, splay matters mainly for the downstream action (Chef server, package repo). For pure polling, splay mitigates list-call load.
  • Splay should be bounded by operational SLAs. If the compliance-run SLA is 12 hours, the splay window must be much smaller than 12 hours to avoid missing the deadline. Slack's example (15) is consistent with a sub-minute or sub-hour window; the post does not disclose units.

Caveats

  • Units not disclosed. Slack's example "Splay": 15 doesn't specify seconds vs minutes. Chef's native usage is typically seconds, but signal-carried values may vary.
  • Distribution not disclosed. The post implies uniform randomisation in [0, splay) (Chef's default), but doesn't explicitly state it.
  • Splay alone doesn't prevent all load spikes. A 1-second splay window with 10,000 nodes is still a 10,000-request/sec spike amortised over 1 second. Splay is one of several load- smoothing primitives.

Seen in

Last updated · 470 distilled / 1,213 read