Skip to content

SYSTEM Cited by 1 source

Netflix Ranker

Ranker is "one of the largest and most complex services at Netflix" — it powers "the personalized rows you see on the Netflix homepage" and "runs at an enormous scale" (Source: sources/2026-03-03-netflix-optimizing-recommendation-systems-with-jdks-vector-api).

This page is a stub; Netflix hasn't publicly described Ranker's end-to-end architecture on the TechBlog. What the 2026-03-03 post does document is a single feature inside Ranker — video serendipity scoring — and the optimization journey that took it from 7.5% → ~1% of node CPU.

What Ranker does (from the 2026-03-03 post)

  • Generates the personalized homepage rows for Netflix subscribers.
  • For each candidate title, scores it against the member's viewing history across multiple features; the downstream recommendation logic consumes those feature values.
  • The scored feature covered in the post is video serendipity: "How different is this new title from what you've been watching so far?" — max cosine similarity between the candidate embedding and each history-item embedding, subtracted from 1 to give a "novelty" score.

Request shape

The post reports an asymmetric request distribution that shaped Ranker's optimization priorities:

  • ~98% of requests are single-video (one candidate scored at a time).
  • ~2% of requests are large batches (many candidates scored together).
  • By total video volume processed, the split is roughly 50:50 between single and batch — the 2% of batch requests each carry many candidates, so they account for half the fleet compute cost.

This makes batch-path optimization worthwhile even though it can't move the p50: it halves the fleet cost, not the median latency.

The 2026-03-03 optimization

Ranker's serendipity scoring operator was consuming 7.5% of total CPU on every Ranker node. Netflix drove it down to ~1% via a five-step sequence:

  1. Batched matrix multiply. Reshape the M×N nested-loop cosine-similarity computation into C = A × Bᵀ (patterns/batched-matmul-for-pairwise-similarity).
  2. Fix memory layout and allocation. Flat double[] row-major buffers + per-thread ThreadLocal<BufferHolder> with grow-but-never-shrink buffers (patterns/flat-buffer-threadlocal-reuse).
  3. Kernel swap: BLAS. Tried netlib-java
  4. native BLAS; lost in production to JNI overhead + row/column-major translation costs.
  5. Kernel swap: JDK Vector API. Pure-Java SIMD via systems/jdk-vector-api with fused multiply-add accumulation; runtime-dispatched with scalar fallback (patterns/runtime-capability-dispatch-pure-java-simd).

Canary + full-rollout results: ~7% drop in CPU utilization, ~12% drop in average latency, ~10% improvement in CPU/RPS.

Stub caveats

Not covered on the wiki:

  • Ranker's full request pipeline (retrieval / candidate gen / feature compute / ranking model / serving path).
  • The ranking model itself (architecture, training, experiment infrastructure).
  • How Ranker composes with the other personalization services at Netflix (retrieval, A/B, content signals).
  • Per-region deployment, instance counts, fleet CPU absolute numbers.
  • How Ranker's features flow from Amber or other feature stores.

Seen in

Last updated · 319 distilled / 1,201 read