Skip to content

SYSTEM Cited by 1 source

Wukong Turbo

Definition

Wukong Turbo is the optimised runtime evolution of Meta Ads's internal Wukong recommendation architecture, serving inside Meta Adaptive Ranking Model. Wukong Turbo layers runtime refinements on Wukong's existing stackable factorisation machines + sequence learning + cross-layer attention, specifically targeting numerical instability and network overhead that emerge when scaling deep ranking models to LLM-scale complexity (Source: sources/2026-03-31-meta-adaptive-ranking-model-bending-the-inference-scaling-curve).

Three refinements over Wukong

No-bias approach (numerical stability)

Wukong Turbo removes unstable bias terms from the scaled architecture, boosting throughput "without increasing FLOPs or parameter counts." This is a stability-targeted refinement: deep ranking models at LLM scale produce numerically unstable terms that inflate computation or destabilise training; dropping the offending terms is the win.

Small-parameter delegation (FSDP → DDP)

Standard practice for large models is FSDP (Fully Sharded Data Parallel), which shards parameters across workers to fit giant models — at the cost of all-gather network traffic. Wukong Turbo offloads small parameters from FSDP to DDP (Distributed Data Parallel), which replicates full parameters per worker — eliminating the all-gather for parameters that don't need to be sharded. This is a workload-aware placement decision: FSDP for parameters that exceed worker memory, DDP for parameters small enough to replicate.

Sparsity-based simplification (linear-layer pruning)

Redundant components in linear layers are pruned via sparsity-based simplification, reducing computation without changing the architectural surface.

Architectural lineage

Wukong Turbo builds on the Wukong paper (arXiv:2403.02545) — Meta Ads's 2024 published recsys architecture featuring:

  • Stackable factorisation machines for cross-feature interactions.
  • Sequence learning over user behaviour histories.
  • Cross-layer attention for richer signal interaction.

Wukong Turbo is not a new ranking architecture; it is the serving-side runtime that lets Wukong scale deeper without breaking the sub-second latency budget.

Relationship to the containing system

Wukong Turbo is one of three pillars of Meta Adaptive Ranking Model's inference-efficient model scaling:

  • Request-oriented computation sharing optimises the computation flow across candidates.
  • Wukong Turbo optimises the model runtime itself for numerical stability + network efficiency.
  • The feature-preprocessing offload layer optimises the pipeline around the model (CPU → GPU, GPU-native kernels).

Caveats

  • No absolute numbers disclosed — qualitative refinement, no throughput delta, FLOPs count, or parameter count.
  • The FSDP→DDP threshold — which parameters qualify as "small" enough to delegate — is not specified.
  • Sparsity structure for the linear-layer simplification is not described (block sparsity? unstructured? magnitude-based?).

Seen in

Last updated · 319 distilled / 1,201 read