Skip to content

CONCEPT Cited by 1 source

Pseudo-context augmentation

Definition

Pseudo-context augmentation is a training-time data-augmentation technique used when a model needs to learn to consume a request-time-only feature that is, by construction, unavailable in training data. The training pipeline synthesises a pseudo-version of the feature — typically derived from the positive label or from contextually-related artefacts already in the training example — so the model can develop a learned consumer of the feature, which is then swapped out for the real feature at serving time.

Pinterest's canonical instantiation in its Contextual Sequential CG (sources/2026-05-08-pinterest-enhancing-ad-relevance-integrating-real-time-context-into-sequential-recommender-models):

"During model training, we artificially inject pseudo-context information derived from the positive label (the conversion event) into the input sequence. For example, by projecting the interest category features from the positive item, we encourage the model to retrieve items that are semantically related to the context associated with that user session."

Why pseudo-context is sometimes necessary

Real-time context (the page the user is on, the query they typed, the song they're playing) only exists at serving time. Logged training events (offsite conversions, server-side action records) typically don't have an attached "what the user was looking at on our platform during this event" signal — and even when they do, the join is sparse:

"We cannot guarantee that a user has viewed ad impressions on Related Pins between two sequential offsite events."

Pinterest considered using real onsite context data merged with offsite history but rejected it — the join is technically hard and the data is too sparse. The remaining options are:

  1. Don't use the real-time feature. Loses the real-time intent signal — the model is stuck with stale embeddings.
  2. Build the real-context training pipeline. Expensive, often impractical, and produces sparse data.
  3. Synthesise pseudo-context from training-time-available data. Cheap, dense, but introduces a label-leakage hazard.

Pseudo-context augmentation is option 3.

How pseudo-context is generated

The pseudo-context is generated to share the input shape with the real serving-time context, so the same model architecture works in both phases:

Training time:
  positive label → feature projection → pseudo-context (same shape as real context)

Serving time:
  real-time observation → feature projection → real context

In Pinterest's case, the "feature projection" is "projecting the interest category features from the positive item" — interest-category embeddings work for both the subject Pin at serving time and the conversion item at training time, so the context layer sees the same shape.

The label-leakage hazard

Pseudo-context derived from the positive label is structurally a form of information leakage during training: the model can shortcut by reading the pseudo-context as a hint about the label. Without mitigation, the model learns to over-weight the (label-leaked) context and ignore the historical-sequence signal — and then at serving time, when context is just a real subject Pin (not a positive-label projection), the model's predictions degrade.

The structural mitigation is regularisation against over-reliance:

"A high dropout rate is used in the context layer during training to ensure the model still relies on the user's historical event sequence (the Transformer output)."

See patterns/high-dropout-on-augmented-feature-layer for the named pattern. The combination — pseudo-context augmentation + high dropout on the consuming layer — is what makes the technique tractable.

Where pseudo-context augmentation generalises

Any setting with a request-time-only feature that the model needs to learn to consume but can't be logged in training data:

  • Recommendation systems with on-platform context (Pinterest, e-commerce product detail pages, music streaming queue position).
  • Ad ranking with session context when the ad-impression-time session state isn't logged densely enough.
  • Search ranking with query context when training data is built from offline judgements rather than logged sessions.
  • Retrieval models that want to use "current state" features but only have point-in-time historical data.

In each case the same triplet applies: training-time pseudo-feature derived from artefacts already present, regularisation against over-reliance, shared input shape with the real serving-time feature.

Comparison to other training-serving-parity techniques

  • Batch embedding for index consistency: addresses embedding version skew within two-tower — different problem (which checkpoint produced the index vs the query tower).
  • Online-offline discrepancy: the umbrella issue; Pinterest's L1 CVR post documents discrepancy from feature-pipeline divergence and version skew. Pseudo-context augmentation is one of several mitigations under the same umbrella.
  • Counterfactual / off-policy training: used in contextual-bandit settings to handle the data-collection-policy mismatch. Different problem (policy mismatch, not feature-availability mismatch) but similar shape (synthesise the missing piece from what you have).

Caveats

  • Single named instance on the wiki. Pinterest is the only documented case under this name. Similar techniques likely exist in other large-scale recsys / ads-ML systems but the named primitive is not standard nomenclature.
  • Projection function unspecified. Pinterest doesn't name the projection function used to generate pseudo-context from the positive item's interest categories.
  • Dropout rate not disclosed. "High dropout" — Pinterest doesn't quantify the rate.
  • Label leakage is structural, not eliminated. High dropout reduces but does not eliminate the leakage; the model still learns some shortcut. The empirical question is whether the shortcut is small enough that serving-time performance with real context (not pseudo-context) is still strong. Pinterest's online wins suggest yes.
  • Doesn't help when serving feature has no analogue in training data. If the real-time feature has no plausible projection from training-time artefacts, pseudo-context isn't an option — you'd need a different approach (e.g., dual-train one model with context, one without, or use a separate online model).

Seen in

Last updated · 542 distilled / 1,571 read