PATTERN Cited by 1 source
Skip on missing allowlist for safety¶
Problem¶
A producer in a Send-What-You-Use serving path needs an allowlist (per consumer, per version) to trim its payload. What should happen when the allowlist is missing or unknown at request time — either because:
- The model is newly deployed and its allowlist hasn't propagated yet (rollout gap).
- The allowlist artefact was corrupt and the per-bundle map fell back to a stale version that doesn't know about this model.
- The request names a model / version combination the producer has no record of.
- A bug in the trimmer module itself failed to find the allowlist it should have found.
Two extreme options:
- Fail closed: reject / drop the request. Safe against wrong-features, but turns a trimmer-plane issue into a serving-plane outage.
- Fail open: pass the untrimmed payload through. Safe against serving-plane availability loss, but undoes the trim optimisation for that request (returning to the pre-trimmer network-bound state).
Solution¶
Pass the untrimmed payload through, unchanged, when the allowlist lookup misses. The trimmer is an optimisation on top of a working serving path — an optimisation cannot be allowed to break the underlying serving path.
Pinterest's articulation¶
From the 2026-05-01 Feature Trimmer post (Source: sources/2026-05-01-pinterest-optimizing-ml-workload-network-efficiency-part-i-feature-trimmer):
"If no feature allowlist exists for the model, the request proceeds untrimmed."
Later, on the specific rollout-gap case:
"We deploy root configs before rolling out new leaf model versions because the feature trimmer keys feature allowlists by model name + version. If a versioned request arrives without a matching allowlist, we skip trimming to avoid stale configs, which can cause a temporary rollout gap. To prevent this, we ship a backwards-compatible root artifact containing allowlists for both the current and pending versions."
Two layers of safety here:
- Operationally: ship both current and pending allowlists during rolling deploys, so version-specific misses are rare.
- Structurally: when a miss still happens, skip trimming rather than reject. The serving path stays up, just temporarily carrying extra bytes.
Why it's the right default for Send-What-You-Use¶
The trimmer sits on the critical request path. Pinterest's explicit framing:
"The adoption of the feature trimmer is expected to reduce network bandwidth consumption for root-leaf connections. This places the trimmer on the critical failure path: failure to trim score requests can cause a significant spike in network bandwidth, potentially leading to cascading failures. Therefore, robust handling of artifact corruption or deployment failures is essential."
A fail-closed trimmer that rejects requests it can't understand turns any config propagation failure into a partial serving outage. A fail-open trimmer degrades to the pre-trimmer network state for affected requests — painful but not broken. Pinterest picks the second trade-off because the trimmer exists to save cost / bandwidth, not to gate correctness.
This is structurally the same choice as Cloudflare's (concepts/fail-stale) ladder: use-last-known-good > fail-open > fail-closed when the system can be made to tolerate a temporary degradation. Feature Trimmer uses fail-open (untrimmed passthrough) because there's no "last known good allowlist for a model we've never heard of" to fall back to.
When it fits¶
- The optimisation is non-essential to correctness. Trimming is about efficiency; inference correctness is determined by the leaf model's feature converter, not the root's trimmer.
- The serving path can tolerate temporary degradation. Pinterest can absorb a rollout-gap minute of untrimmed requests without latency blowup, especially with the fbthrift lz4 compression lever still active.
- Misses are rare and bounded. Well-sequenced staged deploys (patterns/artifact-rides-model-deploy-pipeline) keep genuine misses short-lived.
- Observability catches persistent misses. On-call alerting on init-time parse failures + runtime miss-rate metrics would detect a stuck miss.
When it doesn't fit¶
- The filtering is a security boundary. Skip-on-miss would then be skip-on-unknown-user, which is the wrong fail direction. Security filters should fail closed (concepts/fail-open-vs-fail-closed at security altitude).
- The "untrimmed" fallback doesn't exist — if the trimming is structural (e.g. "extract the query predicate from this gRPC"), there's no untrimmed form to fall back to.
- The downstream tier can't handle the untrimmed form. If leaves were strictly-typed to the trimmed allowlist and would reject unknown fields, passthrough would make things worse.
Related invariants at Pinterest¶
Three complementary safeguards in the Feature Trimmer design:
- Init-failure railguard — parse failures on boot alert on-call but don't block host launch. "This decision preserves our ability to respond to capacity-related incidents, especially if a deeper issue is affecting the Feature Trimmer module itself." Same fail-open posture at the module-lifecycle altitude.
- Per-bundle failure isolation — a bad bundle's stale map only affects its bundle, not the whole consolidated map. (See patterns/file-watcher-atomic-swap-consolidated-map.)
- Backward-compatible rolling-deploy artefact — current + pending-version allowlists both shipped to avoid the window where a pending leaf is receiving requests the root hasn't been told about.
Together these three turn a "trimmer blast radius" into a "trimmer availability story": each failure is localised, each failure is observable, and the worst case is lost trim savings, not lost requests.
Failure modes¶
- Silent allowlist staleness masks a real bug. If a model's allowlist is wrong (not missing — present but mis-derived), skip-on-miss won't trigger and the leaf will see trimmed payloads missing features it actually needs. This is a deeper correctness failure the pattern does not catch. Mitigated by the "signatures are stable across versions" invariant + the leaf's own feature converter being the authoritative feature-selector.
- Persistent miss not alerted — if a model is rolled out but its allowlist never shows up, the trimmer silently passes untrimmed forever. A runtime metric
(trim_miss_rate per model)would catch this; Pinterest's post doesn't mention such a metric explicitly.
Sibling patterns¶
- concepts/fail-open-vs-fail-closed — the general framing this pattern's choice lives inside.
- concepts/fail-stale — Cloudflare's preferred posture: prefer last-known-good config to fail-open; here there's no "last known good" for an unknown model so fail-open is the next rung.
- Seccomp / syscall allowlist with permissive mode — Linux-kernel altitude analogue; permissive mode logs violations without blocking.
- Web-application-firewall in detection mode — same fail-open posture at WAF altitude during rollout.
Seen in¶
- 2026-05-01 Pinterest — Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer (sources/2026-05-01-pinterest-optimizing-ml-workload-network-efficiency-part-i-feature-trimmer) — canonical; three layers of skip-on-miss (no model allowlist → untrimmed; version unknown → latest-version fallback; init parse failure → alert but don't block launch); explicit reasoning about keeping the trimmer off the failure-availability path.
Related¶
- concepts/send-what-you-use — the optimisation this pattern protects the availability of.
- concepts/fail-open-vs-fail-closed — the general framing.
- concepts/fail-stale — sibling preferred-posture where last-known-good is available.
- systems/pinterest-feature-trimmer — the canonical production instance.
- patterns/feature-allowlist-over-blocklist — the allowlist representation this pattern protects.
- patterns/artifact-rides-model-deploy-pipeline — the operational prevention layer (minimise misses).