PATTERN Cited by 1 source
_source field slimming with external re-fetch on update¶
Definition¶
_source field slimming with external re-fetch is the
OpenSearch / Elasticsearch-specific pattern of:
- Removing large fields from the
_source(the stored original-document JSON) to save index-storage cost and query latency, while still indexing the fields so they participate in search. - On document update, re-fetching the removed fields from an
external KV store (typically DynamoDB) and re-injecting them
into the update, because OpenSearch's update path reconstructs
the document from
_source, silently dropping anything not present.
The pattern exists because of an update-path gotcha: removing a
large field from _source saves storage and returns smaller payloads
on search, but the next update of the document (e.g. changing the
file name) will re-index the document using only what's in _source
— which means the removed field is wiped from the index.
(Source: sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma)
The _source story¶
In OpenSearch / Elasticsearch:
_source= the original JSON document passed to the indexer, stored verbatim alongside the index._sourceis not searchable; it's returned to the client on search responses so clients can render / reason about hits.- Field indexing is separate from
_sourcestorage. A vector field, a keyword field, a numeric field — all have their own index structures (k-NN graph, inverted index, BKD tree). These can exist with or without the field being present in_source.
Therefore you can:
- Index the vector field into k-NN so queries work.
- Exclude the vector field from
_sourceso it's not stored twice and not returned on response.
This is the "slimming" step. For a 1024-dim float32 vector at 4 KB,
removing it from _source cuts per-document storage by ~4 KB and
network payload on non-vector-fetching search responses.
The update footgun¶
OpenSearch / Elasticsearch update semantics:
"On document updates, OpenSearch relies on
_sourceto diff and write the updated document, instead of updating each index in-place. It grabs the existing fields from_source, applies the update, and then writes the document."
So:
- You store a document with a vector field →
_sourceexcludes the vector → vector lives in the k-NN index. - You update the file name → update path reads
_source(no vector) → applies file-name change → writes the new document (no vector) → k-NN index's vector is now gone.
Result: silent data loss on every update. The search pipeline keeps working structurally — OpenSearch accepts the update — but the vector is wiped and the document no longer ranks via semantic search.
Figma found this the hard way during development. Fix: re-fetch the vector from DynamoDB (the KV store where they persist embeddings alongside metadata) and include it in the update payload.
Mechanism (Figma realisation)¶
def index_frame(frame_id, metadata, embedding):
opensearch.index(
index="frames",
id=frame_id,
body={
"metadata": metadata,
"embedding": embedding,
},
# exclude embedding from _source on the mapping:
# "_source": {"excludes": ["embedding"]}
)
def update_frame_metadata(frame_id, new_metadata):
# the footgun: don't just call update() with new_metadata
# fetch-and-inject the slimmed field so it survives the rewrite
embedding = dynamodb.get_embedding(frame_id)
opensearch.update(
index="frames",
id=frame_id,
body={
"doc": {
"metadata": new_metadata,
"embedding": embedding, # re-inject
}
}
)
Essentials:
- Mapping-level
_source.excludesdeclaration tells OpenSearch not to store the vector field in_source. - On every update, re-fetch and re-inject the slimmed fields.
- The KV store (DynamoDB) is the authoritative copy of the slimmed fields; OpenSearch holds the index but not the source of truth.
Why this is better than not slimming¶
Alternative: keep the vector in _source. Costs:
- Storage. Vector stored twice (once in k-NN, once in
_source). For a billion-doc index that's billions × 4 KB = big. - Query payload. Even if clients don't want the vector back on response, returning it inflates response size.
- Network egress. More bytes per search hit.
For cost-sensitive vector workloads at scale (Figma's OpenSearch was
the second-biggest cost driver after thumbnailing — both memory-
bound), slimming _source is meaningful. The re-fetch pattern is the
price of paying once-for-storage.
Alternatives¶
- Use
update_by_querywith the full document. Rebuild the full document (including the fetched vector) and push. Same idea, different API. - Avoid updates; use full re-indexing. If the update frequency is low and idempotency is cheap, just re-index the document from authoritative state each time. Figma's design implies frequent updates (file-name changes propagate), so this doesn't apply.
- Keep vectors in
_source. Don't slim; pay for the duplicate storage. Fine at small scale; painful at billions of vectors. - Avoid the vector field entirely; use external k-NN per shard. Niche; increases orchestration complexity.
When to apply¶
- OpenSearch / Elasticsearch k-NN deployments at scale where vector
_sourcestorage is material. - External KV already in the architecture (DynamoDB, Redis) that can serve as the re-fetch source cheaply.
- Updates are common enough that you'd lose vectors regularly if the re-fetch discipline isn't in place.
Don't apply when:
- You're not on OpenSearch / Elasticsearch (the footgun is stack- specific).
- No external KV exists; re-fetching means round-tripping the authoritative DB, which might be slow.
- Document update rate is low enough that full re-indexing is the simpler discipline.
Caveats¶
- Code-path discipline. Every update path in the service has to re-fetch. A single "forgotten" update path wipes vectors silently. Linting / wrapper libraries help.
- KV consistency matters. The KV must have the same vector OpenSearch had. If the vector is regenerated by an async job, ordering matters — update can't happen before the KV has the new vector.
- Monitoring. Silent data loss is the worst failure mode. Monitor
"rate of documents with non-null vector in k-NN index vs
_sourceclaims vector is excluded" sanity. - Does not apply to delete. Deletes don't need re-fetch; they remove the document entirely.
- Applies to other slimmed fields too. Vectors are the canonical
example because they're big. Binary blobs, long text, anything
slimmed from
_sourcehas the same footgun.
Related¶
- systems/amazon-opensearch-service — the substrate where this footgun lives.
- systems/figma-ai-search — canonical instance.
- systems/dynamodb — the re-fetch source.
- concepts/vector-embedding — what's being slimmed.
Seen in¶
- sources/2026-04-21-figma-the-infrastructure-behind-ai-search-in-figma
— Figma explicitly narrates the bug (updates wiping embeddings
from
_source-slimmed documents) and its fix (re-fetch from DynamoDB on update).