Skip to content

PATTERN Cited by 1 source

_source field slimming with external re-fetch on update

Definition

_source field slimming with external re-fetch is the OpenSearch / Elasticsearch-specific pattern of:

  1. Removing large fields from the _source (the stored original-document JSON) to save index-storage cost and query latency, while still indexing the fields so they participate in search.
  2. On document update, re-fetching the removed fields from an external KV store (typically DynamoDB) and re-injecting them into the update, because OpenSearch's update path reconstructs the document from _source, silently dropping anything not present.

The pattern exists because of an update-path gotcha: removing a large field from _source saves storage and returns smaller payloads on search, but the next update of the document (e.g. changing the file name) will re-index the document using only what's in _source — which means the removed field is wiped from the index.

(Source: sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma)

The _source story

In OpenSearch / Elasticsearch:

  • _source = the original JSON document passed to the indexer, stored verbatim alongside the index.
  • _source is not searchable; it's returned to the client on search responses so clients can render / reason about hits.
  • Field indexing is separate from _source storage. A vector field, a keyword field, a numeric field — all have their own index structures (k-NN graph, inverted index, BKD tree). These can exist with or without the field being present in _source.

Therefore you can:

  • Index the vector field into k-NN so queries work.
  • Exclude the vector field from _source so it's not stored twice and not returned on response.

This is the "slimming" step. For a 1024-dim float32 vector at 4 KB, removing it from _source cuts per-document storage by ~4 KB and network payload on non-vector-fetching search responses.

The update footgun

OpenSearch / Elasticsearch update semantics:

"On document updates, OpenSearch relies on _source to diff and write the updated document, instead of updating each index in-place. It grabs the existing fields from _source, applies the update, and then writes the document."

So:

  1. You store a document with a vector field → _source excludes the vector → vector lives in the k-NN index.
  2. You update the file name → update path reads _source (no vector) → applies file-name change → writes the new document (no vector) → k-NN index's vector is now gone.

Result: silent data loss on every update. The search pipeline keeps working structurally — OpenSearch accepts the update — but the vector is wiped and the document no longer ranks via semantic search.

Figma found this the hard way during development. Fix: re-fetch the vector from DynamoDB (the KV store where they persist embeddings alongside metadata) and include it in the update payload.

Mechanism (Figma realisation)

def index_frame(frame_id, metadata, embedding):
    opensearch.index(
        index="frames",
        id=frame_id,
        body={
            "metadata": metadata,
            "embedding": embedding,
        },
        # exclude embedding from _source on the mapping:
        # "_source": {"excludes": ["embedding"]}
    )

def update_frame_metadata(frame_id, new_metadata):
    # the footgun: don't just call update() with new_metadata
    # fetch-and-inject the slimmed field so it survives the rewrite
    embedding = dynamodb.get_embedding(frame_id)
    opensearch.update(
        index="frames",
        id=frame_id,
        body={
            "doc": {
                "metadata": new_metadata,
                "embedding": embedding,   # re-inject
            }
        }
    )

Essentials:

  • Mapping-level _source.excludes declaration tells OpenSearch not to store the vector field in _source.
  • On every update, re-fetch and re-inject the slimmed fields.
  • The KV store (DynamoDB) is the authoritative copy of the slimmed fields; OpenSearch holds the index but not the source of truth.

Why this is better than not slimming

Alternative: keep the vector in _source. Costs:

  • Storage. Vector stored twice (once in k-NN, once in _source). For a billion-doc index that's billions × 4 KB = big.
  • Query payload. Even if clients don't want the vector back on response, returning it inflates response size.
  • Network egress. More bytes per search hit.

For cost-sensitive vector workloads at scale (Figma's OpenSearch was the second-biggest cost driver after thumbnailing — both memory- bound), slimming _source is meaningful. The re-fetch pattern is the price of paying once-for-storage.

Alternatives

  • Use update_by_query with the full document. Rebuild the full document (including the fetched vector) and push. Same idea, different API.
  • Avoid updates; use full re-indexing. If the update frequency is low and idempotency is cheap, just re-index the document from authoritative state each time. Figma's design implies frequent updates (file-name changes propagate), so this doesn't apply.
  • Keep vectors in _source. Don't slim; pay for the duplicate storage. Fine at small scale; painful at billions of vectors.
  • Avoid the vector field entirely; use external k-NN per shard. Niche; increases orchestration complexity.

When to apply

  • OpenSearch / Elasticsearch k-NN deployments at scale where vector _source storage is material.
  • External KV already in the architecture (DynamoDB, Redis) that can serve as the re-fetch source cheaply.
  • Updates are common enough that you'd lose vectors regularly if the re-fetch discipline isn't in place.

Don't apply when:

  • You're not on OpenSearch / Elasticsearch (the footgun is stack- specific).
  • No external KV exists; re-fetching means round-tripping the authoritative DB, which might be slow.
  • Document update rate is low enough that full re-indexing is the simpler discipline.

Caveats

  • Code-path discipline. Every update path in the service has to re-fetch. A single "forgotten" update path wipes vectors silently. Linting / wrapper libraries help.
  • KV consistency matters. The KV must have the same vector OpenSearch had. If the vector is regenerated by an async job, ordering matters — update can't happen before the KV has the new vector.
  • Monitoring. Silent data loss is the worst failure mode. Monitor "rate of documents with non-null vector in k-NN index vs _source claims vector is excluded" sanity.
  • Does not apply to delete. Deletes don't need re-fetch; they remove the document entirely.
  • Applies to other slimmed fields too. Vectors are the canonical example because they're big. Binary blobs, long text, anything slimmed from _source has the same footgun.

Seen in

Last updated · 200 distilled / 1,178 read