Skip to content

CONCEPT Cited by 1 source

Projection layer for latency

Definition

A projection layer for latency is a learned layer inserted between an expensive upstream encoder (typically a wide Transformer) and expensive downstream layers (feature crossing, tower trees, ranking heads) that reduces the representation width — trading some representational capacity for serving latency, while preserving most of the predictive signal.

Unlike a pure linear projection (W · x), projection-for-latency layers typically use architectures that preserve feature-interaction signal during compressionDCNv2 (deep cross network), learned attention-pooling, or gated pooling — because naive linear projection risks discarding high-order interactions the downstream layers rely on.

Architectural position

  (expensive upstream encoder — Transformer over long user sequences)
                      │   output: wide embedding
            [ projection layer for latency ]
                      │   output: compressed embedding
  (expensive downstream layers — feature crossing, tower trees)
                   ranking head

The projection is a compression bridge. Its purpose is structural: the upstream encoder's output is too wide for the downstream layers' compute budget, and the projection narrows the representation before handing off.

Canonical wiki instance — Pinterest ads engagement model

Pinterest uses DCNv2 as a projection layer between the long-sequence Transformer output and downstream crossing + tower-tree layers (Source: sources/2026-03-03-pinterest-unifying-ads-engagement-modeling-across-pinterest-surfaces):

"We simplified the expensive compute paths by using DCNv2 to project the Transformer outputs into a smaller representation before downstream crossing and tower tree layers, which reduced serving latency while preserving signal."

The choice of DCNv2 specifically (rather than linear projection) is telling — DCNv2's cross layers preserve explicit feature-interaction signal during the compression, so the downstream crossing layers still receive feature crosses from the upstream Transformer output, just in a narrower representation.

Why projection layers matter

  • Cost multiplies through depth. A wide upstream output makes every downstream layer more expensive. Projecting once trades a small fixed cost for ongoing savings at every subsequent layer.
  • Unified models produce wide outputs. When merging feature maps from multiple surface-specific models into a unified trunk, the union-of-features representation is naturally wider than any individual surface's needs. A projection layer compresses back to a workable width.
  • Long-sequence Transformers produce wide pooled outputs. Transformer pooling typically yields high-dim embeddings (thousands of dimensions); downstream crossing layers don't need that width.
  • Bottleneck layers in CNNs (ResNet's 1x1 conv bottlenecks) — same structural move at a different layer type.
  • PCA as pre-processing — dimensionality reduction before downstream work; projection layer is the learned version.
  • Attention pooling — a specific projection variant using attention to compress sequence dimension.
  • Distinct from quantisation — projection narrows the representation; quantisation lowers the precision. Can be stacked.

Caveats

  • "Preserving signal" is qualitative — Pinterest doesn't ablate projection vs no-projection vs linear-projection vs DCNv2-projection; the claim is directional.
  • Projection dimensionality trade-off not disclosed — input/output dim of Pinterest's DCNv2 projection not specified.
  • Risk: under-projecting. Too-aggressive compression loses signal the downstream layers need; too-conservative compression saves less latency. Calibration is empirical.

Seen in

Last updated · 319 distilled / 1,201 read