PATTERN Cited by 1 source

Skip on missing allowlist for safety¶

Problem¶

A producer in a Send-What-You-Use serving path needs an allowlist (per consumer, per version) to trim its payload. What should happen when the allowlist is missing or unknown at request time — either because:

The model is newly deployed and its allowlist hasn't propagated yet (rollout gap).
The allowlist artefact was corrupt and the per-bundle map fell back to a stale version that doesn't know about this model.
The request names a model / version combination the producer has no record of.
A bug in the trimmer module itself failed to find the allowlist it should have found.

Two extreme options:

Fail closed: reject / drop the request. Safe against wrong-features, but turns a trimmer-plane issue into a serving-plane outage.
Fail open: pass the untrimmed payload through. Safe against serving-plane availability loss, but undoes the trim optimisation for that request (returning to the pre-trimmer network-bound state).

Solution¶

Pass the untrimmed payload through, unchanged, when the allowlist lookup misses. The trimmer is an optimisation on top of a working serving path — an optimisation cannot be allowed to break the underlying serving path.

Pinterest's articulation¶

From the 2026-05-01 Feature Trimmer post (Source: sources/2026-05-01-pinterest-optimizing-ml-workload-network-efficiency-part-i-feature-trimmer):

"If no feature allowlist exists for the model, the request proceeds untrimmed."

Later, on the specific rollout-gap case:

"We deploy root configs before rolling out new leaf model versions because the feature trimmer keys feature allowlists by model name + version. If a versioned request arrives without a matching allowlist, we skip trimming to avoid stale configs, which can cause a temporary rollout gap. To prevent this, we ship a backwards-compatible root artifact containing allowlists for both the current and pending versions."

Two layers of safety here:

Operationally: ship both current and pending allowlists during rolling deploys, so version-specific misses are rare.
Structurally: when a miss still happens, skip trimming rather than reject. The serving path stays up, just temporarily carrying extra bytes.

Why it's the right default for Send-What-You-Use¶

The trimmer sits on the critical request path. Pinterest's explicit framing:

"The adoption of the feature trimmer is expected to reduce network bandwidth consumption for root-leaf connections. This places the trimmer on the critical failure path: failure to trim score requests can cause a significant spike in network bandwidth, potentially leading to cascading failures. Therefore, robust handling of artifact corruption or deployment failures is essential."

A fail-closed trimmer that rejects requests it can't understand turns any config propagation failure into a partial serving outage. A fail-open trimmer degrades to the pre-trimmer network state for affected requests — painful but not broken. Pinterest picks the second trade-off because the trimmer exists to save cost / bandwidth, not to gate correctness.

This is structurally the same choice as Cloudflare's (concepts/fail-stale) ladder: use-last-known-good > fail-open > fail-closed when the system can be made to tolerate a temporary degradation. Feature Trimmer uses fail-open (untrimmed passthrough) because there's no "last known good allowlist for a model we've never heard of" to fall back to.

When it fits¶

The optimisation is non-essential to correctness. Trimming is about efficiency; inference correctness is determined by the leaf model's feature converter, not the root's trimmer.
The serving path can tolerate temporary degradation. Pinterest can absorb a rollout-gap minute of untrimmed requests without latency blowup, especially with the fbthrift lz4 compression lever still active.
Misses are rare and bounded. Well-sequenced staged deploys (patterns/artifact-rides-model-deploy-pipeline) keep genuine misses short-lived.
Observability catches persistent misses. On-call alerting on init-time parse failures + runtime miss-rate metrics would detect a stuck miss.

When it doesn't fit¶

The filtering is a security boundary. Skip-on-miss would then be skip-on-unknown-user, which is the wrong fail direction. Security filters should fail closed (concepts/fail-open-vs-fail-closed at security altitude).
The "untrimmed" fallback doesn't exist — if the trimming is structural (e.g. "extract the query predicate from this gRPC"), there's no untrimmed form to fall back to.
The downstream tier can't handle the untrimmed form. If leaves were strictly-typed to the trimmed allowlist and would reject unknown fields, passthrough would make things worse.

Three complementary safeguards in the Feature Trimmer design:

Init-failure railguard — parse failures on boot alert on-call but don't block host launch. "This decision preserves our ability to respond to capacity-related incidents, especially if a deeper issue is affecting the Feature Trimmer module itself." Same fail-open posture at the module-lifecycle altitude.
Per-bundle failure isolation — a bad bundle's stale map only affects its bundle, not the whole consolidated map. (See patterns/file-watcher-atomic-swap-consolidated-map.)
Backward-compatible rolling-deploy artefact — current + pending-version allowlists both shipped to avoid the window where a pending leaf is receiving requests the root hasn't been told about.

Together these three turn a "trimmer blast radius" into a "trimmer availability story": each failure is localised, each failure is observable, and the worst case is lost trim savings, not lost requests.

Failure modes¶

Silent allowlist staleness masks a real bug. If a model's allowlist is wrong (not missing — present but mis-derived), skip-on-miss won't trigger and the leaf will see trimmed payloads missing features it actually needs. This is a deeper correctness failure the pattern does not catch. Mitigated by the "signatures are stable across versions" invariant + the leaf's own feature converter being the authoritative feature-selector.
Persistent miss not alerted — if a model is rolled out but its allowlist never shows up, the trimmer silently passes untrimmed forever. A runtime metric (trim_miss_rate per model) would catch this; Pinterest's post doesn't mention such a metric explicitly.

Sibling patterns¶

concepts/fail-open-vs-fail-closed — the general framing this pattern's choice lives inside.
concepts/fail-stale — Cloudflare's preferred posture: prefer last-known-good config to fail-open; here there's no "last known good" for an unknown model so fail-open is the next rung.
Seccomp / syscall allowlist with permissive mode — Linux-kernel altitude analogue; permissive mode logs violations without blocking.
Web-application-firewall in detection mode — same fail-open posture at WAF altitude during rollout.

Seen in¶

2026-05-01 Pinterest — Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer (sources/2026-05-01-pinterest-optimizing-ml-workload-network-efficiency-part-i-feature-trimmer) — canonical; three layers of skip-on-miss (no model allowlist → untrimmed; version unknown → latest-version fallback; init parse failure → alert but don't block launch); explicit reasoning about keeping the trimmer off the failure-availability path.

concepts/send-what-you-use — the optimisation this pattern protects the availability of.
concepts/fail-open-vs-fail-closed — the general framing.
concepts/fail-stale — sibling preferred-posture where last-known-good is available.
systems/pinterest-feature-trimmer — the canonical production instance.
patterns/feature-allowlist-over-blocklist — the allowlist representation this pattern protects.
patterns/artifact-rides-model-deploy-pipeline — the operational prevention layer (minimise misses).