CONCEPT Cited by 1 source
Projection layer for latency¶
Definition¶
A projection layer for latency is a learned layer inserted between an expensive upstream encoder (typically a wide Transformer) and expensive downstream layers (feature crossing, tower trees, ranking heads) that reduces the representation width — trading some representational capacity for serving latency, while preserving most of the predictive signal.
Unlike a pure linear projection (W · x), projection-for-latency layers typically use architectures that preserve feature-interaction signal during compression — DCNv2 (deep cross network), learned attention-pooling, or gated pooling — because naive linear projection risks discarding high-order interactions the downstream layers rely on.
Architectural position¶
(expensive upstream encoder — Transformer over long user sequences)
│ output: wide embedding
▼
[ projection layer for latency ]
│ output: compressed embedding
▼
(expensive downstream layers — feature crossing, tower trees)
│
▼
ranking head
The projection is a compression bridge. Its purpose is structural: the upstream encoder's output is too wide for the downstream layers' compute budget, and the projection narrows the representation before handing off.
Canonical wiki instance — Pinterest ads engagement model¶
Pinterest uses DCNv2 as a projection layer between the long-sequence Transformer output and downstream crossing + tower-tree layers (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):
"We simplified the expensive compute paths by using DCNv2 to project the Transformer outputs into a smaller representation before downstream crossing and tower tree layers, which reduced serving latency while preserving signal."
The choice of DCNv2 specifically (rather than linear projection) is telling — DCNv2's cross layers preserve explicit feature-interaction signal during the compression, so the downstream crossing layers still receive feature crosses from the upstream Transformer output, just in a narrower representation.
Why projection layers matter¶
- Cost multiplies through depth. A wide upstream output makes every downstream layer more expensive. Projecting once trades a small fixed cost for ongoing savings at every subsequent layer.
- Unified models produce wide outputs. When merging feature maps from multiple surface-specific models into a unified trunk, the union-of-features representation is naturally wider than any individual surface's needs. A projection layer compresses back to a workable width.
- Long-sequence Transformers produce wide pooled outputs. Transformer pooling typically yields high-dim embeddings (thousands of dimensions); downstream crossing layers don't need that width.
Related ideas¶
- Bottleneck layers in CNNs (ResNet's 1x1 conv bottlenecks) — same structural move at a different layer type.
- PCA as pre-processing — dimensionality reduction before downstream work; projection layer is the learned version.
- Attention pooling — a specific projection variant using attention to compress sequence dimension.
- Distinct from quantisation — projection narrows the representation; quantisation lowers the precision. Can be stacked.
Caveats¶
- "Preserving signal" is qualitative — Pinterest doesn't ablate projection vs no-projection vs linear-projection vs DCNv2-projection; the claim is directional.
- Projection dimensionality trade-off not disclosed — input/output dim of Pinterest's DCNv2 projection not specified.
- Risk: under-projecting. Too-aggressive compression loses signal the downstream layers need; too-conservative compression saves less latency. Calibration is empirical.
Seen in¶
- 2026-03-03 Pinterest — Unifying Ads Engagement Modeling (sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces) — canonical: DCNv2 projection layer between Transformer output and downstream crossing/tower-tree layers in a unified ads CTR model.