CONCEPT Cited by 1 source
Query vs document embedding (two distinct serving problems)¶
Definition¶
In retrieval / search / recommendation systems that use vector embeddings, the embedding- inference workload splits into two sub-populations with fundamentally different serving shapes:
- Query embeddings — short, latency-sensitive, spiky, online-per-request. Typical length: a few to a few-hundred tokens; typical latency budget: 100–300 ms.
- Document (or corpus) embeddings — long, batch-ingested, latency-tolerant, predominantly offline. Typical length: up to thousands of tokens per document; typical latency budget: minutes to hours per batch.
The two have different compute regimes, different scheduler goals, and different optimal batching disciplines — so the serving stack should recognise them as two workloads, not one (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).
How they differ on each serving axis¶
| Axis | Queries | Documents |
|---|---|---|
| Token length | Short (≤ few-hundred) | Long (hundreds to thousands) |
| Length distribution | Highly skewed | Approximately uniform within a corpus |
| Compute regime | Memory-bound (far below saturation point) | Compute-bound (near or above saturation point) |
| Traffic pattern | Spiky | Scheduled / backfill-driven |
| Latency budget | 100–300 ms (hard) | Minutes–hours (soft) |
| Optimal batching | Token-count batching to saturation point | Large fixed batches, already saturation-adjacent |
| Error semantics | Retry-tolerant, user-visible 503 | Job-retry with checkpointing |
| Freshness | Per-request | Periodic reindex |
Why the distinction matters for the scheduler¶
A single embedding-serving stack that treats all requests identically runs the wrong optimisation for at least one sub-population:
- If configured for queries: document embedding jobs hit latency-optimised batch sizes, waste capacity they don't need, and pay unnecessary queueing cost.
- If configured for documents: query requests wait in document-oriented batches (long time windows, large request counts), blowing the 100–300 ms latency budget.
- If not separated: the scheduler oscillates — document requests push the batching regime toward larger / longer, then a query spike arrives to an over-loaded stage.
Voyage AI names this directly:
"At Voyage AI by MongoDB, we call these short requests queries, and other requests are called documents. Queries typically must be served with very low latency (typically 100–300 ms)."
The 2025-12-18 post is scoped explicitly to the query side and is silent on document-ingestion serving — precisely because it is a different problem.
Serving-stack implications¶
- Two model-server deployments (or two request classes with separate scheduler queues) — one tuned for queries (token-count batching to saturation point, padding removal, aggressive autoscaling floors), one for documents (large fixed batches, throughput-optimised GPUs, cost-optimised scale-to-low-instance- types).
- Two SLOs — query p99 in milliseconds vs document job completion-time SLO in minutes.
- Different queue substrates — query substrate needs peek + atomic claim for token-count batching; document substrate can use standard offline-job queues (SQS, Kafka, batch frameworks).
- Different model choices possible — queries sometimes served by smaller distilled variants; documents embedded with the full model for index quality. Embedding ingestion modes (batch / insert-API / on-the-fly) in Expedia's platform framing composes naturally with this split.
- Different padding / batching primitives — query-side padding-removal is critical for skewed distributions; document- side can afford classical padded batching if corpus lengths cluster tightly.
Why the distinction is often missed in early serving stacks¶
Teams typically stand up a single embedding service using an off-the-shelf inference engine (e.g. Hugging Face Inference) sized for the dominant workload — usually documents during initial corpus build. When query traffic arrives, the service is sized + tuned for long sequences and absorbs the query workload at poor MFU / high latency / spike-sensitive tail. The retrofit is either separate deployments (Voyage AI's implicit route) or request-class-aware scheduling inside one deployment.
Seen in¶
- 2025-12-18 Voyage AI / MongoDB — Token-count-based batching — canonical wiki instance; the entire post is framed around the query side of embedding serving; document side explicitly scoped out in the opening definition (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).