CONCEPT Cited by 1 source
Semantic search over agent memory¶
Definition¶
Semantic search over agent memory is the retrieval model where an AI agent's persistent memory store is accessed via natural-language queries resolved against vector similarity — the user's question (or a reformulated version of it) is embedded, compared against pre-embedded memory chunks, and the top-K closest memories are returned as context for the agent's response.
This contrasts with the alternative retrieval models:
- Structured query ("SELECT dependencies FROM services WHERE name='checkout-api'") — requires the user/agent to know the exact service name.
- Full-text / inverted index — requires a lexical term match; misses semantically-related queries phrased differently.
- Direct key lookup — requires the exact key; no fuzzy matching.
Semantic search sits between these: the user asks "what
depends on checkout?", the query is embedded, and the
checkout-api service group's dependency memory is returned
even though the user said "checkout" and the memory key is
checkout-api.
Canonical framing (Grafana Assistant, 2026-05-01)¶
(Source: sources/2026-05-01-grafana-how-grafana-assistant-learns-your-infrastructure-before-you-even-ask)
"This knowledge is stored as searchable chunks in a vector database, so when you or the assistant need information about a specific service, it can be retrieved in milliseconds through semantic search."
Two load-bearing properties:
- Chunked storage. Memories are stored as chunks (not monolithic documents per service). Chunk boundaries are presumably aligned with the five-category schema's axes — so a query about "dependencies" pulls the Dependencies chunk, not the whole service-group summary.
- Milliseconds retrieval. Vector-DB sublinear search (HNSW, SPANN, or similar — not disclosed) puts retrieval in the milliseconds class, well below LLM generation latency, so retrieval is not the bottleneck.
Why semantic search is the right retrieval shape for agent memory¶
Three shape-properties make semantic search the right fit:
- User phrasing varies. "What depends on payments?", "upstream of payments-api?", "who calls the payment service?" — all should resolve to the same memory. Only embedding-based retrieval handles this natively.
- The agent may reformulate the query. Internal reasoning steps can generate intermediate queries different from the user's literal phrasing. Semantic similarity absorbs this variance.
- Memories use the customer's terminology, queries use the
user's. Metric names in memory are the actual Prometheus
labels (
http_request_duration_seconds{service="checkout-api"}); the user says "latency for checkout." Embedding both in the same semantic space bridges the vocabulary gap.
Architectural composition¶
Semantic search over agent memory typically composes with:
- Hybrid retrieval — combining dense (vector) and sparse (keyword) retrieval for queries where exact term match matters (metric names, K8s object names). Not explicitly disclosed for Grafana Assistant.
- ACL filtering at retrieval — memories filtered by the user's data-source access list before similarity scoring (concepts/acl-propagated-agent-memory). Grafana Assistant applies this filter so users only see memories derived from data sources they can access.
- Top-K selection — retrieve N > K candidates, filter by ACL, return top K. K tuning balances context-window consumption against recall.
- RAG pipeline — retrieved memories are injected into the LLM's context as grounded reference material for response generation.
Contrast with retrieval models in other agent-memory¶
instances
| Instance | Substrate | Retrieval model |
|---|---|---|
| Meta precompute engine | Per-module markdown files | Agent opens specific file by path |
| Cloudflare agents-that-remember | Conversation thread summaries | Keyed by thread ID + semantic recall |
| Grafana Assistant (this page) | Per-service-group structured chunks | Semantic search over vector DB |
Grafana Assistant's bet is that semantic search is more
natural for the conversational infrastructure-query use case
than structured lookup — users don't know the internal
service-group identifier for checkout-api; they just know
they want to ask about checkout.
Failure modes¶
- Embedding drift. If the embedding model is updated, old memories must be re-embedded or a mismatch appears between query and memory vectors. Not disclosed how this is handled across weekly refresh cycles.
- Near-miss retrieval. Two service groups with
semantically similar names (
payments-apivspayments-v2-api) may both match the same query; disambiguation depends on chunk-level metadata. - Empty result. No memories match above a similarity threshold → agent must gracefully degrade to either generic-LLM-knowledge response or admission of ignorance. Failure-handling strategy undisclosed.
- Hallucinated recall. The agent may claim recall of a memory that wasn't retrieved (standard LLM hallucination risk); mitigated by keeping retrieval sources in the context and requiring citations.
Seen in¶
- sources/2026-05-01-grafana-how-grafana-assistant-learns-your-infrastructure-before-you-even-ask — canonical wiki instance of semantic search over agent memory at the observability-stack altitude. Embedding model + vector DB + top-K + chunking strategy are all undisclosed; what's disclosed is the retrieval-latency class (milliseconds) and the chunked-store shape.