Skip to content

PATTERN Cited by 1 source

Embedding Ingestion Modes (batch + Insert API + on-the-fly)

Definition

Embedding ingestion modes is the pattern of a centralized embedding platform exposing three complementary ingestion paths — matching different producer shapes, volumes, and freshness needs — rather than forcing every embedding producer through one pipeline:

  1. Batch materialization — large-volume, feature-engineered embeddings loaded through a Spark-backed materialization job pulling from one or more offline sources.
  2. Insert API — synchronous real-time writes for small batches or streaming producers that already have their embeddings in hand.
  3. On-the-fly embedding generation — the platform itself calls named embedding models to produce vectors dynamically for producers who don't own an inference path.

(Source: sources/2026-01-06-expedia-powering-vector-embedding-capabilities)

The three lanes

Lane Producer shape Volume Freshness Typical substrate
Batch materialization Offline feature-engineering pipeline High Hours–minutes systems/feast materialization + Spark
Insert API Application writing embeddings it already has Low–medium Seconds–minutes Synchronous REST/RPC
On-the-fly generation Caller who wants the platform to embed for them Any Sub-second for the embed step Platform → embedding model

Expedia Embedding Store realization

The Expedia Embedding Store Service names exactly this triad:

  1. Batch Ingestion"For large volumes of embeddings generated through feature engineering processes, the Embedding Store Service provides a batch ingestion mechanism utilizing Feast materialization. This uses a Spark-based process to efficiently load data from one or more offline sources."
  2. Insert API for Small Batches or Real-Time Data"When working with smaller batches of embeddings or handling real-time embedding generation, users can use the standard Insert API to load data directly into the service."
  3. On-the-Fly Embedding Generation"For scenarios where embedding generation needs to be offloaded, the Embedding Store Service can generate embeddings dynamically by calling specific models to generate embeddings on the fly."

All three modes terminate in the same dual-write discipline — every ingested embedding lands simultaneously in both online (vector DB) and offline (historical repository) stores, regardless of which lane produced it. This keeps the storage tier uniform behind heterogeneous producers.

(Source: sources/2026-01-06-expedia-powering-vector-embedding-capabilities)

The on-the-fly lane is the distinctive one

Batch and Insert API are the usual ingestion primitives any data platform exposes. The on-the-fly lane is the one that's specific to embedding ingestion:

  • It collapses "embedding-generation service" + "embedding-store service" into a single caller-facing API.
  • It serves the small-producer / experimentation case — a team with a few thousand docs and no inference pipeline can just send text + collection name and get vectors stored.
  • It centralises model access control + version pinning — the collection metadata tells the platform which model version to call, so a producer can never accidentally mix models within a collection.
  • It moves embedding-inference cost onto the platform's billed infrastructure rather than the caller's, which is either a feature or a footgun depending on funding model.

The cost: the platform now has an inference SLO too, not just a storage / retrieval SLO.

Why three lanes and not one

Every single-lane ingestion design eventually fails:

  • Batch-only — fine for initial backfills, misses anything needing seconds of freshness; overkill for small producers.
  • Insert-only — pushes the "generate embeddings somewhere" problem back on every caller; batch producers with hundreds of millions of rows can't reasonably call a synchronous API for each.
  • On-the-fly-only — collapses the producer side but now every write is paying inference cost at request time; batch paths with offline sources can't route through it efficiently.

The three-lane design lets each producer hit the lane that matches its shape, without compromising the platform's uniform storage + serving tier below.

Relation to feature-store ingestion

This pattern is structurally parallel to hybrid batch + streaming + direct-write ingestion in feature stores:

Feature-store lane Embedding-store analogue
Batch (Spark, historical) Batch materialization
Streaming (near-real-time signals) Insert API
Direct writes (downstream pipeline's output) Insert API + on-the-fly

The embedding-specific wrinkle is the on-the-fly lane — a feature store rarely needs it, because features are by definition computed by upstream pipelines. Embeddings are produced by models, and "have the platform call the model for me" is genuinely useful.

When to use it

Apply when:

  • Your embedding producers are heterogeneous — batch feature-engineering + real-time apps + experimental small teams all exist in the same org.
  • You want to centralize embedding-model access + versioning at the platform layer.
  • You're already committed to a dual-write storage tier and can treat three lanes as three ingress funnels onto one storage substrate.

Don't apply when:

  • Only one producer shape exists (e.g. only batch) — run one lane.
  • On-the-fly generation is infeasible (tight SLO, policy reasons, embedding-model availability) — expose just batch + Insert API.

Seen in

Last updated · 200 distilled / 1,178 read