SYSTEM Cited by 1 source
Cloudflare Agent Memory¶
Overview¶
Agent Memory is Cloudflare's managed memory service for AI agents, announced in private beta on 2026-04-17 (blog.cloudflare.com/introducing-agent-memory). It extracts information from agent conversations, stores it as classified memories outside the context window, and retrieves it via a multi-channel fusion pipeline — so agents "recall what matters, forget what doesn't, and get smarter over time" without filling up context.
Architecturally a Cloudflare Worker coordinating:
- a per-profile Durable Object (raw messages + classified memories in SQLite + FTS + supersession chains)
- a per-profile Vectorize index
- Workers AI models (Llama 4 Scout for extraction / verification / classification / query analysis; Nemotron 3 for synthesis)
- (future) R2 for snapshots + exports
Position in the Cloudflare agent stack¶
Agent Memory is explicitly distinguished from AI Search:
"While search is a component of memory, agent search and agent memory solve distinct problems. AI Search is our primitive for finding results across unstructured and structured files; Agent Memory is for context recall. The data in Agent Memory doesn't exist as files; it's derived from sessions."
It is the fifth substrate of agent memory in the 2026-04 Cloudflare arc:
- Project Think Persistent Sessions — tree-structured conversation memory (episodic).
- AI Search per-customer instances — semantic memory (knowledge base).
- Artifacts repo-per-session — filesystem + session-history memory.
- Email Service thread + DO state — email-channel memory.
- Agent Memory — session-conversation-derived recall.
Six-operation API surface¶
Deliberately narrow — the model never burns context on storage-strategy reasoning (the tool-surface-minimization posture applied to the memory tier).
| Operation | Caller | Purpose |
|---|---|---|
getProfile(name) |
harness | returns an isolated memory store addressed by name |
ingest(messages, {sessionId}) |
harness | bulk path at compaction: extract + classify + store |
remember({content, sessionId}) |
model | direct tool: store a single memory on the spot |
recall(query) |
model | run retrieval pipeline, return synthesised natural-language answer |
list() |
model | enumerate stored memories |
forget(memoryId) |
model | mark as no-longer-important/true |
Exposed both as a Workers binding (env.MEMORY.getProfile(...)) and a REST API for agents running outside Workers. Integrated into the Agents SDK as the reference implementation for the Sessions API memory portion.
Memory taxonomy¶
Each extracted memory is classified into one of four types:
| Type | Definition | Keyed? | Vector-indexed? | Lifecycle |
|---|---|---|---|---|
| Fact | atomic, stable knowledge ("the project uses GraphQL") | yes (normalised topic key) | yes | superseded on new fact with same key |
| Event | what happened at a specific time (deployment, decision) | no | yes | timestamped, retained |
| Instruction | procedure / workflow / runbook | yes (topic key) | yes | superseded on update |
| Task | what is being worked on right now | no | no (FTS-only) | ephemeral by design |
Facts and instructions supersede old values by the same topic key — old → new forward pointer (version chain), not delete. Vectors for superseded memories are deleted in parallel with new upserts. Tasks are excluded from the vector index to keep it lean but remain FTS-discoverable.
Ingestion pipeline¶
Multi-stage (see patterns/multi-stage-extraction-pipeline):
- Content-addressed ID per message:
SHA-256(sessionId + role + content)[:128 bits]. Re-ingesting the same conversation is a no-op viaINSERT OR IGNORE(concepts/content-addressed-id). - Extractor — two passes in parallel:
- Full pass chunks at ~10K characters with 2-message overlap, up to 4 chunks concurrent. Transcripts include role labels, relative-dates-resolved-to-absolutes ("yesterday" →
2026-04-14), and line indices for source provenance. - Detail pass (for conversations ≥ 9 messages) uses overlapping windows focused on concrete values — names, prices, version numbers, entity attributes — that broad extraction generalises away.
- Two result sets merge.
- Verifier — 8 checks per memory against the source transcript: entity identity, object identity, location context, temporal accuracy, organisational context, completeness, relational context, and whether inferred facts are supported by the conversation. Items pass, are corrected, or are dropped.
- Classifier — assigns type (fact / event / instruction / task) + a normalised topic key (facts + instructions) + 3-5 search queries used later for vectorisation.
- Storage —
INSERT OR IGNORE(content-addressed dedup); supersession chain for facts + instructions. - Response returned to harness — before vectorisation completes.
- Async vectorisation (background) — embeds the memory with 3-5 classifier-generated search queries prepended to the memory content before the embedding call. This bridges the declarative-write / interrogative-read gap. Superseded-memory vectors deleted in parallel with new upserts.
Retrieval pipeline¶
Five parallel channels fused with RRF (see patterns/parallel-retrieval-fusion):
| Channel | What it does | RRF weight |
|---|---|---|
| FTS with Porter stemming | keyword precision for queries where the exact term matters but the context doesn't | medium |
| Exact fact-key lookup | direct topic-key match | highest (strongest signal) |
| Raw-message FTS | searches stored conversation messages directly — safety net for verbatim detail the extractor generalised | lowest (safety net) |
| Direct vector search | embed query → ANN over memory vectors | medium |
| HyDE vector search | embed a declarative-answer-shaped synthesis of the query → ANN | medium (often catches what direct embedding misses on abstract / multi-hop) |
Stage 1 (query analysis + embedding) is concurrent. Stage 2 runs the five channels in parallel. Stage 3 does RRF with the above weights; ties broken by recency. Top candidates pass to a synthesis model; temporal computations are pre-computed deterministically via regex + arithmetic and injected into the synthesis prompt as facts — models are unreliable at date math, so the service doesn't ask.
Isolation and scaling¶
One Durable Object instance per memory profile, SQLite-backed, addressed via getByName() so any request globally reaches the right profile:
"DO's
getByName()addressing means any request, from anywhere, can reach the right memory profile by name, and ensures that sensitive memories are strongly isolated from other tenants."
Storage stratifies:
- DOs (SQLite) — messages + memories + FTS indexing + supersession chains + transactional writes.
- Vectorize — vectors over embedded memories (with classifier queries prepended to the text before embedding).
- R2 (future) — snapshots + exports.
All AI calls carry an x-session-affinity header routed to the memory profile name, so repeated requests hit the same backend for prompt-caching benefits — same pattern as Workers AI's session-affinity-prompt-caching, keyed here on memory-profile rather than user-session.
Canonical storage-tier realisation of concepts/one-to-one-agent-instance — same per-instance economics shown across the 2026-04 Cloudflare arc (Project Think DOs, AI Search instances, Artifacts repos, Email per-address DOs, now Agent Memory profiles).
Model selection¶
| Stage | Model | Rationale |
|---|---|---|
| extraction / verification / classification / query analysis | Llama 4 Scout (17B, 16-expert MoE) | structured classification sweet spot; cost/quality/latency tradeoff |
| synthesis | Nemotron 3 (120B MoE, 12B active) | larger reasoning capacity improves natural-language answer quality |
"The synthesizer is the only stage where throwing more parameters at the problem consistently helped. For everything else, the smaller model hit a better sweet spot of cost, quality, and latency."
Not monotonic with parameter count — a deliberate counter-thesis to "bigger is always better".
Internal dogfood workloads¶
- Coding-agent memory via an internal OpenCode plugin. Memory across compaction + across sessions. Less-obvious benefit: shared profile across a team — the agent knows what other teammates' agents have already learned; stops asking answered questions; stops repeating corrected mistakes.
- Agentic code reviewer. "Arguably the most useful thing it learned to do was stay quiet." Remembers that a particular comment wasn't relevant in a past review, that a specific pattern was flagged and the author chose to keep it. Reviews get less noisy over time.
- Chat bot. Ingests + lurks message history. Answers future questions based on past conversations.
Exportability posture¶
"Every memory is exportable, and we're committed to making sure the knowledge your agents accumulate on Cloudflare can leave with you if your needs change. We think the right way to earn long-term trust is to make leaving easy and to keep building something good enough that you don't want to."
Explicitly framed as reducing vendor lock-in for accumulated institutional knowledge.
Product iteration method¶
Agent-driven benchmark loop (patterns/agent-driven-benchmark-loop):
- Run benchmarks (LongMemEval, LoCoMo, BEAM — intentionally multiple to avoid overfitting).
- Analyse gaps.
- Agent proposes solutions.
- Human reviews proposals to select strategies that generalise rather than overfit.
- Agent makes the changes.
- Repeat.
Stochasticity (even at temperature 0) required multi-run averaging + trend analysis alongside raw scores. Explicitly guarded against benchmark-specific overfitting ("build systems that overfit for a specific evaluation and break down in production" — the stated failure mode the method avoids).
Seen in¶
- sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory — launch + architecture post.
Related¶
- companies/cloudflare
- systems/cloudflare-durable-objects — per-profile DO instance; the storage substrate.
- systems/cloudflare-vectorize — per-profile vector index.
- systems/workers-ai — LLM layer (Llama 4 Scout + Nemotron 3).
- systems/cloudflare-workers — coordinator Worker +
env.MEMORYbinding. - systems/cloudflare-r2 — future snapshot / export tier.
- systems/cloudflare-ai-search — sibling primitive for files, positioned as complementary.
- systems/cloudflare-agents-sdk — reference implementation for the Sessions API memory portion.
- concepts/agent-memory — the broader concept; Agent Memory is a canonical managed realisation.
- concepts/context-rot — the forcing function named in the launch post.
- concepts/memory-supersession — facts + instructions versioned by topic key with forward pointers.
- concepts/memory-compaction — the lifecycle moment Agent Memory hooks into.
- concepts/content-addressed-id — SHA-256-based message + memory IDs enabling idempotent re-ingestion.
- concepts/hyde-embedding — one of the five retrieval channels.
- concepts/reciprocal-rank-fusion — the multi-channel fusion algorithm.
- concepts/hybrid-retrieval-bm25-vectors — generalisation of FTS + vector; Agent Memory extends to 5 channels.
- concepts/session-affinity-prompt-caching — reused with memory-profile-name key.
- concepts/one-to-one-agent-instance — storage-tier realisation.
- patterns/constrained-memory-api — six-operation API as the canonical constrained-memory instance.
- patterns/multi-stage-extraction-pipeline — extract → verify → classify → store → async-vectorise.
- patterns/parallel-retrieval-fusion — five channels in parallel + RRF.
- patterns/agent-first-storage-primitive — per-profile isolation + agent-ergonomic API + scale-to-zero posture.
- patterns/agent-driven-benchmark-loop — the iteration method used to ship.
- patterns/tool-surface-minimization — applied to the memory tier.