PATTERN Cited by 1 source

Embedding Ingestion Modes (batch + Insert API + on-the-fly)¶

Definition¶

Embedding ingestion modes is the pattern of a centralized embedding platform exposing three complementary ingestion paths — matching different producer shapes, volumes, and freshness needs — rather than forcing every embedding producer through one pipeline:

Batch materialization — large-volume, feature-engineered embeddings loaded through a Spark-backed materialization job pulling from one or more offline sources.
Insert API — synchronous real-time writes for small batches or streaming producers that already have their embeddings in hand.
On-the-fly embedding generation — the platform itself calls named embedding models to produce vectors dynamically for producers who don't own an inference path.

(Source: sources/2026-01-06-expedia-powering-vector-embedding-capabilities)

The three lanes¶

Lane	Producer shape	Volume	Freshness	Typical substrate
Batch materialization	Offline feature-engineering pipeline	High	Hours–minutes	systems/feast materialization + Spark
Insert API	Application writing embeddings it already has	Low–medium	Seconds–minutes	Synchronous REST/RPC
On-the-fly generation	Caller who wants the platform to embed for them	Any	Sub-second for the embed step	Platform → embedding model

Expedia Embedding Store realization¶

The Expedia Embedding Store Service names exactly this triad:

Batch Ingestion — "For large volumes of embeddings generated through feature engineering processes, the Embedding Store Service provides a batch ingestion mechanism utilizing Feast materialization. This uses a Spark-based process to efficiently load data from one or more offline sources."
Insert API for Small Batches or Real-Time Data — "When working with smaller batches of embeddings or handling real-time embedding generation, users can use the standard Insert API to load data directly into the service."
On-the-Fly Embedding Generation — "For scenarios where embedding generation needs to be offloaded, the Embedding Store Service can generate embeddings dynamically by calling specific models to generate embeddings on the fly."

All three modes terminate in the same dual-write discipline — every ingested embedding lands simultaneously in both online (vector DB) and offline (historical repository) stores, regardless of which lane produced it. This keeps the storage tier uniform behind heterogeneous producers.

(Source: sources/2026-01-06-expedia-powering-vector-embedding-capabilities)

The on-the-fly lane is the distinctive one¶

Batch and Insert API are the usual ingestion primitives any data platform exposes. The on-the-fly lane is the one that's specific to embedding ingestion:

It collapses "embedding-generation service" + "embedding-store service" into a single caller-facing API.
It serves the small-producer / experimentation case — a team with a few thousand docs and no inference pipeline can just send text + collection name and get vectors stored.
It centralises model access control + version pinning — the collection metadata tells the platform which model version to call, so a producer can never accidentally mix models within a collection.
It moves embedding-inference cost onto the platform's billed infrastructure rather than the caller's, which is either a feature or a footgun depending on funding model.

The cost: the platform now has an inference SLO too, not just a storage / retrieval SLO.

Why three lanes and not one¶

Every single-lane ingestion design eventually fails:

Batch-only — fine for initial backfills, misses anything needing seconds of freshness; overkill for small producers.
Insert-only — pushes the "generate embeddings somewhere" problem back on every caller; batch producers with hundreds of millions of rows can't reasonably call a synchronous API for each.
On-the-fly-only — collapses the producer side but now every write is paying inference cost at request time; batch paths with offline sources can't route through it efficiently.

The three-lane design lets each producer hit the lane that matches its shape, without compromising the platform's uniform storage + serving tier below.

Relation to feature-store ingestion¶

This pattern is structurally parallel to hybrid batch + streaming + direct-write ingestion in feature stores:

Feature-store lane	Embedding-store analogue
Batch (Spark, historical)	Batch materialization
Streaming (near-real-time signals)	Insert API
Direct writes (downstream pipeline's output)	Insert API + on-the-fly

The embedding-specific wrinkle is the on-the-fly lane — a feature store rarely needs it, because features are by definition computed by upstream pipelines. Embeddings are produced by models, and "have the platform call the model for me" is genuinely useful.

When to use it¶

Apply when:

Your embedding producers are heterogeneous — batch feature-engineering + real-time apps + experimental small teams all exist in the same org.
You want to centralize embedding-model access + versioning at the platform layer.
You're already committed to a dual-write storage tier and can treat three lanes as three ingress funnels onto one storage substrate.

Don't apply when:

Only one producer shape exists (e.g. only batch) — run one lane.
On-the-fly generation is infeasible (tight SLO, policy reasons, embedding-model availability) — expose just batch + Insert API.

patterns/centralized-embedding-platform — the parent pattern this ingestion triad serves.
patterns/dual-write-online-offline — the storage-tier write discipline downstream of all three lanes.
patterns/hybrid-batch-streaming-ingestion — the feature-store analogue.
systems/feast — materialization engine for the batch lane.
systems/apache-spark — batch-lane compute.
concepts/embedding-collection — pins model + version so the on-the-fly lane knows what model to call.
systems/expedia-embedding-store — canonical instance.

Seen in¶

sources/2026-01-06-expedia-powering-vector-embedding-capabilities — canonical wiki introduction; Expedia Embedding Store exposes batch (Feast materialization on Spark), Insert API, and on-the-fly generation as the three explicit ingestion modes, all landing in the same dual-written storage tier.