PATTERN Cited by 1 source
High dropout on augmented feature layer¶
Pattern¶
When a model architecture has a layer that consumes a feature derived from synthetic / augmented data (especially when the augmentation has a structural label-leakage risk, e.g., pseudo-context derived from the positive label), apply high dropout to that layer during training so the model cannot shortcut to the augmented feature and instead must keep using the rest of its inputs.
Pinterest's canonical formulation in the Contextual Sequential CG (sources/2026-05-08-pinterest-enhancing-ad-relevance-integrating-real-time-context-into-sequential-recommender-models):
"A high dropout rate is used in the context layer during training to ensure the model still relies on the user's historical event sequence (the Transformer output)."
Why use it¶
When you inject a synthesised feature that is, by construction, correlated with the label (e.g., pseudo-context from the positive label), gradient descent will happily learn to read the leaked signal. Without intervention, the model:
- Over-weights the augmented feature during training (its gradient is the cleanest path to the label).
- Under-uses the legitimate features (history, demographics, other inputs).
- Degrades at serving time, when the real (non-leaked) feature replaces the synthetic one and the model has no other learned consumers of the legitimate features to fall back on.
High dropout on the consuming layer forces the model to route gradient through the legitimate features too — because the augmented-feature path is randomly zeroed out at training time, the model can't rely on it exclusively. At inference time, dropout is disabled, so the layer reads the real feature with its full (regularised) capacity.
Mechanism¶
During training:
┌─ dropout(p=high) ─┐
augmented features ──►│ │── ► layer output ── ► downstream
└───────────────────┘
│
▼
random zero-out of activations
forces model to find gradient
through legitimate features
During inference:
real-time features ── ► layer output ── ► downstream
(no dropout; full activation)
The asymmetry — heavy dropout in training, none in inference — is exactly the standard dropout protocol; what's distinctive is the deliberate use of high dropout as a leakage mitigation, not as generic overfitting prevention.
When to use¶
- A layer in your model consumes a synthesised / augmented feature. Especially if the augmentation is derived from labels or otherwise structurally correlated with the prediction target.
- The model has alternative legitimate features for the same prediction task. Dropout works because there are other paths gradient can take. If the augmented feature is the only way the model can predict, dropout just hurts performance.
- Training-serving parity is the goal. The augmented feature exists in training; the real (non-leaked) feature replaces it at serving time. You want the model to behave well with the real feature, not the augmented one.
When not to use¶
- The augmented feature is the model's primary learning signal. Dropping it heavily means the model can't learn what it needs to. This pattern assumes the augmentation is supplementary, not load-bearing.
- No label leakage in the augmentation. If your augmentation is genuinely independent of labels (data augmentation in vision, paraphrasing in NLP), standard dropout regularisation is sufficient — you don't need the deliberately high rate.
- Layer is small and dropout instability is a problem. Very high dropout on a thin layer can make training unstable. Pinterest doesn't disclose the rate; the named "high" implies meaningful enough to force redundancy without breaking training.
Companion patterns¶
This pattern is structurally inseparable from synthetic pseudo-context from label — they're always shipped together when the augmentation has label-leakage risk. Without high dropout, pseudo-context augmentation degrades at serving time. Without pseudo-context augmentation, there's nothing to dropout-regularise away from.
Related patterns at adjacent altitudes:
- Standard dropout as overfitting regulariser — same mechanism, different motivation. Standard dropout uses moderate rates (~0.1–0.5) to prevent over-fitting on small training sets; this pattern uses higher rates explicitly to force the model to ignore a label-leaked feature.
- Feature masking / random feature dropout — randomly zero entire features (not activations) during training. Stronger version of the same idea; appropriate when you want to test the model's robustness to missing features at serving time.
- Mixup / label-mixing augmentations — different family of regularisation; not specifically a leakage mitigation.
Hazards¶
Dropout rate is a tuning surface¶
Too low: leakage isn't mitigated; model still shortcuts. Too high: the augmented feature contributes nothing during training, so the model never learns to use the real feature at serving time. Pinterest's "high dropout rate" is qualitative.
Dropout doesn't eliminate leakage, just attenuates it¶
The model can still learn shortcut patterns when the dropout mask doesn't zero out the leaked path on a given batch. The empirical question is whether serving-time performance with the real feature is strong; Pinterest's online wins suggest yes for this case.
Loss curves can mask the problem¶
Training loss may improve smoothly even when the model is over-relying on leakage (the leaked path keeps producing strong gradient through the dropout-survivor activations). The diagnostic is offline evaluation on real (non-augmented) feature distributions, not training loss.
Caveats¶
- Single named instance on the wiki. Pinterest is the only documented case under this exact framing. Similar regularisation techniques likely exist in other systems but the named pattern is not standard nomenclature.
- Dropout rate value undisclosed. Pinterest doesn't quantify "high."
- Complement, not substitute, for evaluation rigour. Even with dropout, the model should be evaluated on real (non-pseudo) feature distributions to confirm the leakage didn't break serving-time performance.
Seen in¶
- 2026-05-08 Pinterest — Enhancing Ad Relevance (sources/2026-05-08-pinterest-enhancing-ad-relevance-integrating-real-time-context-into-sequential-recommender-models) — canonical wiki instance. High dropout on the context layer during training, paired with pseudo-context augmentation derived from positive labels, so the model continues relying on the historical-sequence Transformer encoder rather than shortcutting to the leaked context signal.