Skip to content

PATTERN Cited by 1 source

Skip on missing allowlist for safety

Problem

A producer in a Send-What-You-Use serving path needs an allowlist (per consumer, per version) to trim its payload. What should happen when the allowlist is missing or unknown at request time — either because:

  • The model is newly deployed and its allowlist hasn't propagated yet (rollout gap).
  • The allowlist artefact was corrupt and the per-bundle map fell back to a stale version that doesn't know about this model.
  • The request names a model / version combination the producer has no record of.
  • A bug in the trimmer module itself failed to find the allowlist it should have found.

Two extreme options:

  1. Fail closed: reject / drop the request. Safe against wrong-features, but turns a trimmer-plane issue into a serving-plane outage.
  2. Fail open: pass the untrimmed payload through. Safe against serving-plane availability loss, but undoes the trim optimisation for that request (returning to the pre-trimmer network-bound state).

Solution

Pass the untrimmed payload through, unchanged, when the allowlist lookup misses. The trimmer is an optimisation on top of a working serving path — an optimisation cannot be allowed to break the underlying serving path.

Pinterest's articulation

From the 2026-05-01 Feature Trimmer post (Source: sources/2026-05-01-pinterest-optimizing-ml-workload-network-efficiency-part-i-feature-trimmer):

"If no feature allowlist exists for the model, the request proceeds untrimmed."

Later, on the specific rollout-gap case:

"We deploy root configs before rolling out new leaf model versions because the feature trimmer keys feature allowlists by model name + version. If a versioned request arrives without a matching allowlist, we skip trimming to avoid stale configs, which can cause a temporary rollout gap. To prevent this, we ship a backwards-compatible root artifact containing allowlists for both the current and pending versions."

Two layers of safety here:

  1. Operationally: ship both current and pending allowlists during rolling deploys, so version-specific misses are rare.
  2. Structurally: when a miss still happens, skip trimming rather than reject. The serving path stays up, just temporarily carrying extra bytes.

Why it's the right default for Send-What-You-Use

The trimmer sits on the critical request path. Pinterest's explicit framing:

"The adoption of the feature trimmer is expected to reduce network bandwidth consumption for root-leaf connections. This places the trimmer on the critical failure path: failure to trim score requests can cause a significant spike in network bandwidth, potentially leading to cascading failures. Therefore, robust handling of artifact corruption or deployment failures is essential."

A fail-closed trimmer that rejects requests it can't understand turns any config propagation failure into a partial serving outage. A fail-open trimmer degrades to the pre-trimmer network state for affected requests — painful but not broken. Pinterest picks the second trade-off because the trimmer exists to save cost / bandwidth, not to gate correctness.

This is structurally the same choice as Cloudflare's (concepts/fail-stale) ladder: use-last-known-good > fail-open > fail-closed when the system can be made to tolerate a temporary degradation. Feature Trimmer uses fail-open (untrimmed passthrough) because there's no "last known good allowlist for a model we've never heard of" to fall back to.

When it fits

  • The optimisation is non-essential to correctness. Trimming is about efficiency; inference correctness is determined by the leaf model's feature converter, not the root's trimmer.
  • The serving path can tolerate temporary degradation. Pinterest can absorb a rollout-gap minute of untrimmed requests without latency blowup, especially with the fbthrift lz4 compression lever still active.
  • Misses are rare and bounded. Well-sequenced staged deploys (patterns/artifact-rides-model-deploy-pipeline) keep genuine misses short-lived.
  • Observability catches persistent misses. On-call alerting on init-time parse failures + runtime miss-rate metrics would detect a stuck miss.

When it doesn't fit

  • The filtering is a security boundary. Skip-on-miss would then be skip-on-unknown-user, which is the wrong fail direction. Security filters should fail closed (concepts/fail-open-vs-fail-closed at security altitude).
  • The "untrimmed" fallback doesn't exist — if the trimming is structural (e.g. "extract the query predicate from this gRPC"), there's no untrimmed form to fall back to.
  • The downstream tier can't handle the untrimmed form. If leaves were strictly-typed to the trimmed allowlist and would reject unknown fields, passthrough would make things worse.

Three complementary safeguards in the Feature Trimmer design:

  1. Init-failure railguard — parse failures on boot alert on-call but don't block host launch. "This decision preserves our ability to respond to capacity-related incidents, especially if a deeper issue is affecting the Feature Trimmer module itself." Same fail-open posture at the module-lifecycle altitude.
  2. Per-bundle failure isolation — a bad bundle's stale map only affects its bundle, not the whole consolidated map. (See patterns/file-watcher-atomic-swap-consolidated-map.)
  3. Backward-compatible rolling-deploy artefact — current + pending-version allowlists both shipped to avoid the window where a pending leaf is receiving requests the root hasn't been told about.

Together these three turn a "trimmer blast radius" into a "trimmer availability story": each failure is localised, each failure is observable, and the worst case is lost trim savings, not lost requests.

Failure modes

  • Silent allowlist staleness masks a real bug. If a model's allowlist is wrong (not missing — present but mis-derived), skip-on-miss won't trigger and the leaf will see trimmed payloads missing features it actually needs. This is a deeper correctness failure the pattern does not catch. Mitigated by the "signatures are stable across versions" invariant + the leaf's own feature converter being the authoritative feature-selector.
  • Persistent miss not alerted — if a model is rolled out but its allowlist never shows up, the trimmer silently passes untrimmed forever. A runtime metric (trim_miss_rate per model) would catch this; Pinterest's post doesn't mention such a metric explicitly.

Sibling patterns

  • concepts/fail-open-vs-fail-closed — the general framing this pattern's choice lives inside.
  • concepts/fail-stale — Cloudflare's preferred posture: prefer last-known-good config to fail-open; here there's no "last known good" for an unknown model so fail-open is the next rung.
  • Seccomp / syscall allowlist with permissive mode — Linux-kernel altitude analogue; permissive mode logs violations without blocking.
  • Web-application-firewall in detection mode — same fail-open posture at WAF altitude during rollout.

Seen in

Last updated · 445 distilled / 1,275 read