CONCEPT Cited by 1 source

Query vs document embedding (two distinct serving problems)¶

Definition¶

In retrieval / search / recommendation systems that use vector embeddings, the embedding- inference workload splits into two sub-populations with fundamentally different serving shapes:

Query embeddings — short, latency-sensitive, spiky, online-per-request. Typical length: a few to a few-hundred tokens; typical latency budget: 100–300 ms.
Document (or corpus) embeddings — long, batch-ingested, latency-tolerant, predominantly offline. Typical length: up to thousands of tokens per document; typical latency budget: minutes to hours per batch.

The two have different compute regimes, different scheduler goals, and different optimal batching disciplines — so the serving stack should recognise them as two workloads, not one (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

How they differ on each serving axis¶

Axis	Queries	Documents
Token length	Short (≤ few-hundred)	Long (hundreds to thousands)
Length distribution	Highly skewed	Approximately uniform within a corpus
Compute regime	Memory-bound (far below saturation point)	Compute-bound (near or above saturation point)
Traffic pattern	Spiky	Scheduled / backfill-driven
Latency budget	100–300 ms (hard)	Minutes–hours (soft)
Optimal batching	Token-count batching to saturation point	Large fixed batches, already saturation-adjacent
Error semantics	Retry-tolerant, user-visible 503	Job-retry with checkpointing
Freshness	Per-request	Periodic reindex

Why the distinction matters for the scheduler¶

A single embedding-serving stack that treats all requests identically runs the wrong optimisation for at least one sub-population:

If configured for queries: document embedding jobs hit latency-optimised batch sizes, waste capacity they don't need, and pay unnecessary queueing cost.
If configured for documents: query requests wait in document-oriented batches (long time windows, large request counts), blowing the 100–300 ms latency budget.
If not separated: the scheduler oscillates — document requests push the batching regime toward larger / longer, then a query spike arrives to an over-loaded stage.

Voyage AI names this directly:

"At Voyage AI by MongoDB, we call these short requests queries, and other requests are called documents. Queries typically must be served with very low latency (typically 100–300 ms)."

The 2025-12-18 post is scoped explicitly to the query side and is silent on document-ingestion serving — precisely because it is a different problem.

Serving-stack implications¶

Two model-server deployments (or two request classes with separate scheduler queues) — one tuned for queries (token-count batching to saturation point, padding removal, aggressive autoscaling floors), one for documents (large fixed batches, throughput-optimised GPUs, cost-optimised scale-to-low-instance- types).
Two SLOs — query p99 in milliseconds vs document job completion-time SLO in minutes.
Different queue substrates — query substrate needs peek + atomic claim for token-count batching; document substrate can use standard offline-job queues (SQS, Kafka, batch frameworks).
Different model choices possible — queries sometimes served by smaller distilled variants; documents embedded with the full model for index quality. Embedding ingestion modes (batch / insert-API / on-the-fly) in Expedia's platform framing composes naturally with this split.
Different padding / batching primitives — query-side padding-removal is critical for skewed distributions; document- side can afford classical padded batching if corpus lengths cluster tightly.

Why the distinction is often missed in early serving stacks¶

Teams typically stand up a single embedding service using an off-the-shelf inference engine (e.g. Hugging Face Inference) sized for the dominant workload — usually documents during initial corpus build. When query traffic arrives, the service is sized + tuned for long sequences and absorbs the query workload at poor MFU / high latency / spike-sensitive tail. The retrofit is either separate deployments (Voyage AI's implicit route) or request-class-aware scheduling inside one deployment.

Seen in¶

2025-12-18 Voyage AI / MongoDB — Token-count-based batching — canonical wiki instance; the entire post is framed around the query side of embedding serving; document side explicitly scoped out in the opening definition (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).