CONCEPT Cited by 1 source

Projection layer for latency¶

Definition¶

A projection layer for latency is a learned layer inserted between an expensive upstream encoder (typically a wide Transformer) and expensive downstream layers (feature crossing, tower trees, ranking heads) that reduces the representation width — trading some representational capacity for serving latency, while preserving most of the predictive signal.

Unlike a pure linear projection (W · x), projection-for-latency layers typically use architectures that preserve feature-interaction signal during compression — DCNv2 (deep cross network), learned attention-pooling, or gated pooling — because naive linear projection risks discarding high-order interactions the downstream layers rely on.

Architectural position¶

  (expensive upstream encoder — Transformer over long user sequences)
                      │   output: wide embedding
                      ▼
            [ projection layer for latency ]
                      │   output: compressed embedding
                      ▼
  (expensive downstream layers — feature crossing, tower trees)
                      │
                      ▼
                   ranking head

The projection is a compression bridge. Its purpose is structural: the upstream encoder's output is too wide for the downstream layers' compute budget, and the projection narrows the representation before handing off.

Canonical wiki instance — Pinterest ads engagement model¶

Pinterest uses DCNv2 as a projection layer between the long-sequence Transformer output and downstream crossing + tower-tree layers (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):

"We simplified the expensive compute paths by using DCNv2 to project the Transformer outputs into a smaller representation before downstream crossing and tower tree layers, which reduced serving latency while preserving signal."

The choice of DCNv2 specifically (rather than linear projection) is telling — DCNv2's cross layers preserve explicit feature-interaction signal during the compression, so the downstream crossing layers still receive feature crosses from the upstream Transformer output, just in a narrower representation.

Why projection layers matter¶

Cost multiplies through depth. A wide upstream output makes every downstream layer more expensive. Projecting once trades a small fixed cost for ongoing savings at every subsequent layer.
Unified models produce wide outputs. When merging feature maps from multiple surface-specific models into a unified trunk, the union-of-features representation is naturally wider than any individual surface's needs. A projection layer compresses back to a workable width.
Long-sequence Transformers produce wide pooled outputs. Transformer pooling typically yields high-dim embeddings (thousands of dimensions); downstream crossing layers don't need that width.

Bottleneck layers in CNNs (ResNet's 1x1 conv bottlenecks) — same structural move at a different layer type.
PCA as pre-processing — dimensionality reduction before downstream work; projection layer is the learned version.
Attention pooling — a specific projection variant using attention to compress sequence dimension.
Distinct from quantisation — projection narrows the representation; quantisation lowers the precision. Can be stacked.

Caveats¶

"Preserving signal" is qualitative — Pinterest doesn't ablate projection vs no-projection vs linear-projection vs DCNv2-projection; the claim is directional.
Projection dimensionality trade-off not disclosed — input/output dim of Pinterest's DCNv2 projection not specified.
Risk: under-projecting. Too-aggressive compression loses signal the downstream layers need; too-conservative compression saves less latency. Calibration is empirical.

Seen in¶

2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — canonical: DCNv2 projection layer between Transformer output and downstream crossing/tower-tree layers in a unified ads CTR model.