Cloudflare — Agents that remember: introducing Agent Memory¶
Summary¶
Cloudflare's 2026-04-17 post launches Agent Memory (private beta) — an opinionated managed service that extracts information from agent conversations, stores it as classified memories outside the context window, and retrieves it via a multi-channel fusion pipeline. Architecturally: a Cloudflare Worker coordinates a per-profile Durable Object (raw messages + classified memories in SQLite + FTS), a per-profile Vectorize index, and Workers AI models (Llama 4 Scout for extraction / verification / classification / query analysis; Nemotron 3 for synthesis). The service exposes a deliberately narrow API — getProfile / ingest / remember / recall / list / forget — so the model never burns context on storage-strategy reasoning. Extraction classifies memories into facts / events / instructions / tasks; facts and instructions are keyed and superseded (old → new pointer, vectors re-upserted in parallel); tasks are FTS-only to keep the vector index lean. Retrieval fuses five channels (full-text with Porter stemming, exact fact-key lookup, raw-message FTS, direct-query vector search, HyDE vector search) via RRF with channel-specific weights. Every memory is content-addressed (SHA-256 of sessionId + role + content, truncated to 128 bits) so re-ingestion is idempotent via INSERT OR IGNORE. Post-API-response background vectorization prepends the 3-5 classifier-generated search queries to the embedded text, bridging declarative writes ("user prefers dark mode") with interrogative reads ("what theme does the user want?"). Productised internally on three workloads — a coding-agent plugin for OpenCode, an agentic code reviewer ("the most useful thing it learned to do was stay quiet"), and a message-history chat bot. Every memory is exportable; Cloudflare commits explicitly to making leaving easy as a trust-earning posture.
Key takeaways¶
-
Agent Memory is a retrieval-based managed primitive, not a filesystem exposed to the model. Cloudflare's explicit architectural choice: "Tighter ingestion and retrieval pipelines are superior to giving agents raw filesystem access. In addition to improved cost and performance, they provide a better foundation for complex reasoning tasks required in production, like temporal logic, supersession, and instruction following." The opposing design — giving the model raw DB / filesystem access and letting it design its own queries — is rejected as "burning tokens on storage and retrieval strategy instead of the actual task." Canonical wiki instance of patterns/constrained-memory-api + patterns/tool-surface-minimization applied to the memory tier. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory)
-
Context rot is named explicitly as the forcing function. "Even as context window sizes grow past one million (1M) tokens, context rot remains an unsolved problem. A natural tension emerges between two bad options: keep everything in context and watch quality degrade, or aggressively prune and risk losing information the agent needs later." Memory preservation at compaction is the third option: don't discard, don't keep in window — extract, classify, store, retrieve on demand. Same forcing function previously cited by Dropbox Dash. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory)
-
Six-operation API deliberately narrow.
getProfile(name)returns an isolated memory store.ingest(messages, {sessionId})is the bulk path called at compaction.remember({content, sessionId})is the direct model tool for on-the-spot storage.recall(query)runs the full retrieval pipeline and returns a synthesized natural-language answer.list()enumerates stored memories.forget(memoryId)marks memories as no-longer-important. "The tool surface it sees is deliberately constrained so that memory stays out of the way of the actual task." (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory) -
Content-addressed IDs make re-ingestion idempotent. Each message gets
id = SHA-256(sessionId + role + content)[:128 bits]. Storage writes useINSERT OR IGNOREso re-ingesting the same conversation is a no-op. Eliminates the deduplication-bookkeeping code path entirely. Canonical wiki instance of concepts/content-addressed-id applied to agent memories. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory) -
Extraction is two parallel passes with overlap. The full pass chunks at ~10K characters with 2-message overlap, up to 4 chunks concurrent. For conversations ≥ 9 messages a detail pass runs alongside, using overlapping windows to pull concrete values (names, prices, version numbers, entity attributes) that broad extraction generalises away. Results merge. Transcripts are structured — role labels, relative dates resolved to absolutes ("yesterday" →
2026-04-14), line indices for source provenance. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory) -
Verification is an 8-check filter between extraction and storage. Each extracted memory is checked against the transcript for entity identity, object identity, location context, temporal accuracy, organizational context, completeness, relational context, and whether inferred facts are actually supported by the conversation. Items are passed, corrected, or dropped. Pure LLM-in-the-loop verification applied at the memory-atom level, not at coarse aggregate output. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory)
-
Four memory types with different lifecycles. Facts = atomic, stable knowledge ("the project uses GraphQL"). Events = what happened at a specific time (deployment, decision). Instructions = procedures, workflows, runbooks. Tasks = what is being worked on now, ephemeral by design. Facts and instructions are keyed by a normalized topic key; a new memory with the same key supersedes the old one via a forward pointer (version chain, not delete). Tasks are excluded from the vector index to keep it lean — FTS-searchable only. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory)
-
Retrieval fuses five channels in parallel. Stage 1: query analysis + embedding concurrent. Analyzer emits ranked topic keys + FTS terms with synonyms + HyDE (declarative answer-shaped statement). Stage 2: five channels in parallel — (a) FTS with Porter stemming, (b) exact fact-key lookup, (c) raw-message FTS (safety net for verbatim detail the extraction generalised), (d) direct query vector search, (e) HyDE vector search (finds answer-shaped matches missed by direct embedding, especially for abstract / multi-hop). Stage 3: RRF with channel-specific weights (fact-key highest because exact-topic match is the strongest signal; raw-message lowest as safety net). Ties broken by recency. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory)
-
Declarative writes, interrogative reads — bridged by prepending classifier queries to the embedding text. Memories are written as declarative facts ("user prefers dark mode") but searched interrogatively ("what theme does the user want?"). The vectorization step prepends the 3-5 search queries generated during classification to the memory content itself before embedding. Background vectorization runs asynchronously post-response; superseded-memory vectors are deleted in parallel with new upserts. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory)
-
Temporal computation is deterministic, not LLM'd. "Temporal computation is handled deterministically via regex and arithmetic, not by the LLM. The results are injected into the synthesis prompt as pre-computed facts. Models are unreliable at things like date math, so we don't ask them to do it." Canonical wiki instance of the posture "use the LLM for what it's good at, and nothing else" applied at retrieval synthesis. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory)
-
Per-profile isolation via DO
getByName(). Each memory profile maps to its own Durable Object instance (SQLite-backed) and its own Vectorize index. "Sensitive memories are strongly isolated from other tenants"; "DO'sgetByName()addressing means any request, from anywhere, can reach the right memory profile by name". Canonical storage-tier realisation of concepts/one-to-one-agent-instance extended to the memory tier — same per-profile-instance economics shown for Project Think / AI Search / Artifacts / Email in the same 2026-04 Cloudflare arc. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory) -
Storage stratifies across the stack, each primitive purpose-built. Memory content in SQLite-backed DOs. Vectors in Vectorize. Future: snapshots + exports in R2. "Each primitive is purpose-built for its workload, we don't need to force everything into a single shape or database." DO handles FTS indexing, supersession chains, transactional writes. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory)
-
Session affinity for prompt caching. "All AI calls pass a session affinity header routed to the memory profile name, so repeated requests hit the same backend for prompt caching benefits." Extends the session-affinity-prompt-caching pattern previously seen in the XL-LLM post — same wire primitive (
x-session-affinity), now keyed on memory-profile rather than user-session. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory) -
Model selection is not monotonic with parameter count. Default: Llama 4 Scout (17B params, 16-expert MoE) for extraction / verification / classification / query analysis; Nemotron 3 (120B MoE, 12B active) for synthesis. "Scout handles the structured classification tasks efficiently, while Nemotron's larger reasoning capacity improves the quality of natural-language answers. The synthesizer is the only stage where throwing more parameters at the problem consistently helped. For everything else, the smaller model hit a better sweet spot of cost, quality, and latency." Structured classification is Scout's sweet spot; reasoning-heavy synthesis wants Nemotron. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory)
-
Product iteration via agent-driven benchmark loop. "We put it into an agent-driven loop and iterated. The cycle looked like this: run benchmarks, analyze where we had gaps, propose solutions, have a human review the proposals to select strategies that generalize rather than overfit, let the agent make the changes, repeat." Benchmark stack: LongMemEval, LoCoMo, BEAM. "LLMs are stochastic, even with temperature set to zero. This caused results to vary across runs, which meant we had to average multiple runs (time-consuming for large benchmarks) and rely on trend analysis alongside raw scores to understand what was actually working. Along the way we had to guard carefully against overfitting the benchmarks in ways that didn't genuinely make the product better for the general case." Canonical instance of patterns/agent-driven-benchmark-loop + structural-versus-overfitting discipline (sibling of concepts/benchmark-methodology-bias). (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory)
-
Three internal dogfood workloads, specific in their learned behaviours. (1) Coding-agent memory via an internal OpenCode plugin — "the less obvious benefit has been shared memory across a team: with a shared profile, the agent knows what other members of your team have already learned, which means it can stop asking questions that have already been answered and stop making mistakes that have already been corrected". (2) Agentic code review — "arguably the most useful thing it learned to do was stay quiet. The reviewer now remembers that a particular comment wasn't relevant in a past review, that a specific pattern was flagged, and the author chose to keep it for a good reason. Reviews get less noisy over time, not just smarter." (3) Chat-bot — ingests + lurks message history, answers from past conversations. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory)
-
Exportability as vendor-lock-in reduction is a first-class commitment. "As agents become more capable and more deeply embedded in business processes, the memory they accumulate becomes genuinely valuable — not just as an operational state, but as institutional knowledge that took real work to build. We're hearing growing concern from customers about what it means to tie that asset to a single vendor, which is reasonable. The more an agent learns, the higher the switching cost if that memory can't move with it. Agent Memory is a managed service, but your data is yours. Every memory is exportable, and we're committed to making sure the knowledge your agents accumulate on Cloudflare can leave with you if your needs change. We think the right way to earn long-term trust is to make leaving easy and to keep building something good enough that you don't want to." A rare explicit articulation of exportability as strategic posture. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory)
-
Memory and search are distinct primitives, designed to work together. "While search is a component of memory, agent search and agent memory solve distinct problems. AI Search is our primitive for finding results across unstructured and structured files; Agent Memory is for context recall. The data in Agent Memory doesn't exist as files; it's derived from sessions. An agent can use both, and they are designed to work together." Explicit positioning against the common industry shape of "just use a vector DB as memory" — AI Search's substrate is files, Agent Memory's substrate is conversations. (Source: sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory)
Architecture¶
Three agent components¶
The post opens with an abstract model of any agent:
- Harness — drives repeated calls to a model, facilitates tool calls, manages state.
- Model — takes context and returns completions.
- State — current context window + additional information outside it (conversation history, files, databases, memory).
The critical moment is compaction: the harness decides to shorten context to stay within model limits or avoid context rot. Today most agents discard information permanently at compaction. Agent Memory is the substrate that makes compaction preserve rather than destroy.
Two integration points:
- Bulk ingestion at compaction. Harness ships the conversation to Agent Memory for extraction.
- Direct tool use by the model.
recall/remember/forget/listas narrow tools.
Ingestion pipeline (multi-stage)¶
conversation messages
│
▼
(1) content-addressed ID
id = SHA-256(sessionId, role, content)[:128 bits]
│
▼
(2) extractor — two passes in parallel
├── full pass: chunk at ~10K chars, 2-msg overlap, 4 chunks concurrent
│ structured transcript: role labels + relative→absolute dates + line indices
└── detail pass (≥9 msgs): overlapping windows for names/prices/versions
│
▼ merge two result sets
(3) verifier — 8 checks per memory (entity/object/location/temporal/
organizational/completeness/relational/supported-by-transcript)
pass, correct, or drop
│
▼
(4) classifier — one of 4 types
├── fact (atomic, stable, keyed, vector-indexed)
├── event (timestamped, vector-indexed)
├── instruction (procedure, keyed, vector-indexed)
└── task (ephemeral, FTS-only, NOT vector-indexed)
│
▼
(5) storage
INSERT OR IGNORE (content-addressed → dedup for free)
supersession chain for facts + instructions (old → new forward pointer)
│
▼
(6) return response to harness ─────────────┐
│ background
(7) async vectorization │ (non-blocking)
embed(prepend(3-5 classifier queries) ⊕ memory content)
upsert new vector; delete superseded-memory vector
Retrieval pipeline (parallel fusion)¶
query
│
┌─────────────┴─────────────┐
▼ ▼
query analysis embedding of raw query
├── ranked topic keys
├── FTS terms + synonyms
└── HyDE (answer-shaped statement)
│
┌───────┼────────────────────────────────┐
▼ ▼ ▼ ▼ ▼
FTS fact-key raw-msg vector HyDE
Porter exact FTS (direct) vector
│
▼
RRF fusion with channel weights
(fact-key highest, raw-msg lowest = safety net)
ties broken by recency
│
▼
top candidates → synthesis model
(temporal computations pre-computed deterministically via regex/arithmetic
and injected into synthesis prompt as facts)
│
▼
natural-language answer
Platform substrate¶
Agent Memory = Cloudflare Worker ← coordinator
│
├── Durable Object (per memory profile)
│ ├── SQLite: raw messages + classified memories
│ ├── FTS indexing
│ ├── supersession chains
│ └── transactional writes
│
├── Vectorize index (per memory profile)
│ └── embedded memories (search-queries-prepended)
│
├── Workers AI models
│ ├── Llama 4 Scout (17B, 16-expert MoE)
│ │ extraction / verification / classification / query analysis
│ └── Nemotron 3 (120B MoE, 12B active)
│ synthesis
│ All AI calls carry x-session-affinity = memory profile name
│ → backend-stable routing → prompt-caching benefit
│
└── (future) R2: snapshots + exports
Narrow API surface¶
// Workers binding
const profile = await env.MEMORY.getProfile("my-project");
// Bulk path at compaction
await profile.ingest(
[
{ role: "user", content: "Set up the project with React and TypeScript." },
{ role: "assistant", content: "Done. Scaffolded a React + TS project targeting Workers." },
{ role: "user", content: "Use pnpm, not npm. And dark mode by default." },
{ role: "assistant", content: "Got it -- pnpm and dark mode as default." },
],
{ sessionId: "session-001" }
);
// Direct tool use by the model
const memory = await profile.remember({
content: "API rate limit was increased to 10,000 req/s per zone after the April 10 incident.",
sessionId: "session-001",
});
const results = await profile.recall("What package manager does the user prefer?");
console.log(results.result); // "The user prefers pnpm over npm."
// Also: await profile.list();
// await profile.forget(memoryId);
Accessible via a Worker binding or a REST API for non-Workers agents (same pattern as other Developer Platform APIs).
Integrates with the Agents SDK as the reference implementation for the memory portion of the Sessions API — the harness's compaction handler + the model's recall/remember tools are pre-wired.
Operational numbers¶
- Message chunks: ~10K characters with 2-message overlap.
- Chunk concurrency: 4 chunks processed in parallel.
- Detail-pass trigger: conversations with ≥ 9 messages.
- Content-addressed ID width: 128 bits (truncated SHA-256 over
sessionId + role + content). - Classifier search queries prepended before embedding: 3-5.
- Memory types: 4 (facts, events, instructions, tasks).
- Vector index exclusion: tasks (FTS-only).
- Verification checks per memory: 8.
- Retrieval channels fused: 5 (FTS-Porter, fact-key, raw-msg, direct-vector, HyDE-vector).
- RRF weight ordering: fact-key > FTS / HyDE / direct-vector > raw-msg.
- Default extraction model: Llama 4 Scout (17B, 16-expert MoE).
- Default synthesis model: Nemotron 3 (120B MoE, 12B active params).
- Benchmarks used: LongMemEval, LoCoMo, BEAM (testing against multiple intentionally).
- Fastest shipped-something-real iteration: first prototype in a weekend, productionised internal version in under a month.
- Internal dogfood workloads: 3 (coding-agent memory via OpenCode, agentic code reviewer, message-history chat bot) + several more planned.
Caveats & open questions¶
- Private beta — no public SLA, no general-availability pricing, no throughput / latency / accuracy numbers published (only directional claims like "benchmark scores improved consistently with each iteration").
- No absolute benchmark scores. Post is explicit that stochasticity + overfitting-avoidance meant they relied on trend analysis alongside raw scores; raw scores are not in the post.
- No embedder model named. The post specifies the LLMs for extraction / verification / classification / query analysis / synthesis but does not name the embedding model used for Vectorize upserts.
- Supersession-chain storage format unspecified. Post says "forward pointer from the old memory to the new memory" and "version chain" but does not detail row schema / tombstoning / retention.
- Topic-key normalisation algorithm unspecified. Facts and instructions are "keyed" with "a normalized topic key" but the normalisation rule is not published.
- RRF channel weights are described qualitatively, not numerically. "Fact-key matches get the highest weight" / "raw message matches are also included with low weight" — but no
kvalue, no channel-weight table. - Export format unspecified. "Every memory is exportable" — but the wire format, granularity (whole-profile / session-scoped / per-memory), and transport are not described.
- Per-profile Durable-Object + Vectorize-index cost model. Cloudflare's posture is that DO + Vectorize cost model supports one-per-profile economics; no cost-per-profile number disclosed.
- No failure / degradation semantics. What happens if Vectorize is unavailable at ingest time? Is ingest degraded to SQL-only and re-vectorized later, or does it fail? Not addressed.
- No privacy / PII handling story. "Sensitive memories are strongly isolated from other tenants" but no discussion of in-profile PII redaction, user-scoped subdivision inside a shared profile, or GDPR-style right-to-erasure semantics distinct from
forget. - No benchmark-against-external-competitors table. Competitors are named generically ("managed services that handle extraction and retrieval in the background" vs "self-hosted frameworks" vs "constrained purpose-built APIs" vs "raw filesystem") but the post doesn't compare numerically against Mem0 / Letta / Zep / MemGPT / Anthropic Managed Agents.
- Benchmark-overfitting risk acknowledged in principle, mitigation described procedurally. The "have a human review the proposals to select strategies that generalize rather than overfit" step is real but subjective; no held-out private eval is named.
Source¶
- Original: https://blog.cloudflare.com/introducing-agent-memory/
- Raw markdown:
raw/cloudflare/2026-04-17-agents-that-remember-introducing-agent-memory-e889befc.md
Related¶
- sources/2026-04-16-cloudflare-ai-search-the-search-primitive-for-your-agents — AI Search is the files retrieval primitive; Agent Memory is the sessions-derived recall primitive; explicitly positioned as complementary. Shares the per-profile DO + Vectorize-index substrate + hybrid BM25/vector retrieval shape.
- sources/2026-04-15-cloudflare-project-think-building-the-next-generation-of-ai-agents — Project Think's Persistent Sessions is the episodic-memory sibling (tree-structured conversation history with FTS). Agent Memory is the cross-session / cross-harness equivalent.
- sources/2026-04-16-cloudflare-email-service-public-beta-ready-for-agents — fourth substrate of agent memory in the 2026-04 Cloudflare arc (email thread + DO-embedded state). Agent Memory is the fifth.
- sources/2026-04-16-cloudflare-artifacts-versioned-storage-that-speaks-git — Artifacts is the filesystem + session-history memory substrate; Agent Memory is the conversation-derived substrate. Both realise concepts/agent-first-storage-primitive.
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — source of the
x-session-affinityheader pattern that Agent Memory reuses, keyed on memory-profile name rather than user session. - sources/2025-11-17-dropbox-how-dash-uses-context-engineering-for-smarter-ai — shares the context-rot framing and the "store tool results locally, not in the context window" move that Agent Memory operationalises at the memory tier.
- systems/cloudflare-agent-memory
- concepts/agent-memory
- concepts/context-rot
- concepts/memory-supersession
- concepts/memory-compaction
- concepts/hyde-embedding
- concepts/content-addressed-id
- patterns/constrained-memory-api
- patterns/multi-stage-extraction-pipeline
- patterns/parallel-retrieval-fusion
- patterns/agent-driven-benchmark-loop
- companies/cloudflare