Skip to content

CONCEPT Cited by 1 source

Query vs document embedding (two distinct serving problems)

Definition

In retrieval / search / recommendation systems that use vector embeddings, the embedding- inference workload splits into two sub-populations with fundamentally different serving shapes:

  • Query embeddings — short, latency-sensitive, spiky, online-per-request. Typical length: a few to a few-hundred tokens; typical latency budget: 100–300 ms.
  • Document (or corpus) embeddings — long, batch-ingested, latency-tolerant, predominantly offline. Typical length: up to thousands of tokens per document; typical latency budget: minutes to hours per batch.

The two have different compute regimes, different scheduler goals, and different optimal batching disciplines — so the serving stack should recognise them as two workloads, not one (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

How they differ on each serving axis

Axis Queries Documents
Token length Short (≤ few-hundred) Long (hundreds to thousands)
Length distribution Highly skewed Approximately uniform within a corpus
Compute regime Memory-bound (far below saturation point) Compute-bound (near or above saturation point)
Traffic pattern Spiky Scheduled / backfill-driven
Latency budget 100–300 ms (hard) Minutes–hours (soft)
Optimal batching Token-count batching to saturation point Large fixed batches, already saturation-adjacent
Error semantics Retry-tolerant, user-visible 503 Job-retry with checkpointing
Freshness Per-request Periodic reindex

Why the distinction matters for the scheduler

A single embedding-serving stack that treats all requests identically runs the wrong optimisation for at least one sub-population:

  • If configured for queries: document embedding jobs hit latency-optimised batch sizes, waste capacity they don't need, and pay unnecessary queueing cost.
  • If configured for documents: query requests wait in document-oriented batches (long time windows, large request counts), blowing the 100–300 ms latency budget.
  • If not separated: the scheduler oscillates — document requests push the batching regime toward larger / longer, then a query spike arrives to an over-loaded stage.

Voyage AI names this directly:

"At Voyage AI by MongoDB, we call these short requests queries, and other requests are called documents. Queries typically must be served with very low latency (typically 100–300 ms)."

The 2025-12-18 post is scoped explicitly to the query side and is silent on document-ingestion serving — precisely because it is a different problem.

Serving-stack implications

  • Two model-server deployments (or two request classes with separate scheduler queues) — one tuned for queries (token-count batching to saturation point, padding removal, aggressive autoscaling floors), one for documents (large fixed batches, throughput-optimised GPUs, cost-optimised scale-to-low-instance- types).
  • Two SLOs — query p99 in milliseconds vs document job completion-time SLO in minutes.
  • Different queue substrates — query substrate needs peek + atomic claim for token-count batching; document substrate can use standard offline-job queues (SQS, Kafka, batch frameworks).
  • Different model choices possible — queries sometimes served by smaller distilled variants; documents embedded with the full model for index quality. Embedding ingestion modes (batch / insert-API / on-the-fly) in Expedia's platform framing composes naturally with this split.
  • Different padding / batching primitives — query-side padding-removal is critical for skewed distributions; document- side can afford classical padded batching if corpus lengths cluster tightly.

Why the distinction is often missed in early serving stacks

Teams typically stand up a single embedding service using an off-the-shelf inference engine (e.g. Hugging Face Inference) sized for the dominant workload — usually documents during initial corpus build. When query traffic arrives, the service is sized + tuned for long sequences and absorbs the query workload at poor MFU / high latency / spike-sensitive tail. The retrofit is either separate deployments (Voyage AI's implicit route) or request-class-aware scheduling inside one deployment.

Seen in

Last updated · 200 distilled / 1,178 read