Skip to content

SYSTEM Cited by 1 source

Cloudflare Agent Memory

Overview

Agent Memory is Cloudflare's managed memory service for AI agents, announced in private beta on 2026-04-17 (blog.cloudflare.com/introducing-agent-memory). It extracts information from agent conversations, stores it as classified memories outside the context window, and retrieves it via a multi-channel fusion pipeline — so agents "recall what matters, forget what doesn't, and get smarter over time" without filling up context.

Architecturally a Cloudflare Worker coordinating:

  • a per-profile Durable Object (raw messages + classified memories in SQLite + FTS + supersession chains)
  • a per-profile Vectorize index
  • Workers AI models (Llama 4 Scout for extraction / verification / classification / query analysis; Nemotron 3 for synthesis)
  • (future) R2 for snapshots + exports

Position in the Cloudflare agent stack

Agent Memory is explicitly distinguished from AI Search:

"While search is a component of memory, agent search and agent memory solve distinct problems. AI Search is our primitive for finding results across unstructured and structured files; Agent Memory is for context recall. The data in Agent Memory doesn't exist as files; it's derived from sessions."

— (Cloudflare, 2026-04-17)

It is the fifth substrate of agent memory in the 2026-04 Cloudflare arc:

  1. Project Think Persistent Sessions — tree-structured conversation memory (episodic).
  2. AI Search per-customer instances — semantic memory (knowledge base).
  3. Artifacts repo-per-session — filesystem + session-history memory.
  4. Email Service thread + DO state — email-channel memory.
  5. Agent Memory — session-conversation-derived recall.

Six-operation API surface

Deliberately narrow — the model never burns context on storage-strategy reasoning (the tool-surface-minimization posture applied to the memory tier).

Operation Caller Purpose
getProfile(name) harness returns an isolated memory store addressed by name
ingest(messages, {sessionId}) harness bulk path at compaction: extract + classify + store
remember({content, sessionId}) model direct tool: store a single memory on the spot
recall(query) model run retrieval pipeline, return synthesised natural-language answer
list() model enumerate stored memories
forget(memoryId) model mark as no-longer-important/true

Exposed both as a Workers binding (env.MEMORY.getProfile(...)) and a REST API for agents running outside Workers. Integrated into the Agents SDK as the reference implementation for the Sessions API memory portion.

Memory taxonomy

Each extracted memory is classified into one of four types:

Type Definition Keyed? Vector-indexed? Lifecycle
Fact atomic, stable knowledge ("the project uses GraphQL") yes (normalised topic key) yes superseded on new fact with same key
Event what happened at a specific time (deployment, decision) no yes timestamped, retained
Instruction procedure / workflow / runbook yes (topic key) yes superseded on update
Task what is being worked on right now no no (FTS-only) ephemeral by design

Facts and instructions supersede old values by the same topic key — old → new forward pointer (version chain), not delete. Vectors for superseded memories are deleted in parallel with new upserts. Tasks are excluded from the vector index to keep it lean but remain FTS-discoverable.

Ingestion pipeline

Multi-stage (see patterns/multi-stage-extraction-pipeline):

  1. Content-addressed ID per message: SHA-256(sessionId + role + content)[:128 bits]. Re-ingesting the same conversation is a no-op via INSERT OR IGNORE (concepts/content-addressed-id).
  2. Extractor — two passes in parallel:
  3. Full pass chunks at ~10K characters with 2-message overlap, up to 4 chunks concurrent. Transcripts include role labels, relative-dates-resolved-to-absolutes ("yesterday"2026-04-14), and line indices for source provenance.
  4. Detail pass (for conversations ≥ 9 messages) uses overlapping windows focused on concrete values — names, prices, version numbers, entity attributes — that broad extraction generalises away.
  5. Two result sets merge.
  6. Verifier — 8 checks per memory against the source transcript: entity identity, object identity, location context, temporal accuracy, organisational context, completeness, relational context, and whether inferred facts are supported by the conversation. Items pass, are corrected, or are dropped.
  7. Classifier — assigns type (fact / event / instruction / task) + a normalised topic key (facts + instructions) + 3-5 search queries used later for vectorisation.
  8. StorageINSERT OR IGNORE (content-addressed dedup); supersession chain for facts + instructions.
  9. Response returned to harness — before vectorisation completes.
  10. Async vectorisation (background) — embeds the memory with 3-5 classifier-generated search queries prepended to the memory content before the embedding call. This bridges the declarative-write / interrogative-read gap. Superseded-memory vectors deleted in parallel with new upserts.

Retrieval pipeline

Five parallel channels fused with RRF (see patterns/parallel-retrieval-fusion):

Channel What it does RRF weight
FTS with Porter stemming keyword precision for queries where the exact term matters but the context doesn't medium
Exact fact-key lookup direct topic-key match highest (strongest signal)
Raw-message FTS searches stored conversation messages directly — safety net for verbatim detail the extractor generalised lowest (safety net)
Direct vector search embed query → ANN over memory vectors medium
HyDE vector search embed a declarative-answer-shaped synthesis of the query → ANN medium (often catches what direct embedding misses on abstract / multi-hop)

Stage 1 (query analysis + embedding) is concurrent. Stage 2 runs the five channels in parallel. Stage 3 does RRF with the above weights; ties broken by recency. Top candidates pass to a synthesis model; temporal computations are pre-computed deterministically via regex + arithmetic and injected into the synthesis prompt as facts — models are unreliable at date math, so the service doesn't ask.

Isolation and scaling

One Durable Object instance per memory profile, SQLite-backed, addressed via getByName() so any request globally reaches the right profile:

"DO's getByName() addressing means any request, from anywhere, can reach the right memory profile by name, and ensures that sensitive memories are strongly isolated from other tenants."

Storage stratifies:

  • DOs (SQLite) — messages + memories + FTS indexing + supersession chains + transactional writes.
  • Vectorize — vectors over embedded memories (with classifier queries prepended to the text before embedding).
  • R2 (future) — snapshots + exports.

All AI calls carry an x-session-affinity header routed to the memory profile name, so repeated requests hit the same backend for prompt-caching benefits — same pattern as Workers AI's session-affinity-prompt-caching, keyed here on memory-profile rather than user-session.

Canonical storage-tier realisation of concepts/one-to-one-agent-instance — same per-instance economics shown across the 2026-04 Cloudflare arc (Project Think DOs, AI Search instances, Artifacts repos, Email per-address DOs, now Agent Memory profiles).

Model selection

Stage Model Rationale
extraction / verification / classification / query analysis Llama 4 Scout (17B, 16-expert MoE) structured classification sweet spot; cost/quality/latency tradeoff
synthesis Nemotron 3 (120B MoE, 12B active) larger reasoning capacity improves natural-language answer quality

"The synthesizer is the only stage where throwing more parameters at the problem consistently helped. For everything else, the smaller model hit a better sweet spot of cost, quality, and latency."

Not monotonic with parameter count — a deliberate counter-thesis to "bigger is always better".

Internal dogfood workloads

  1. Coding-agent memory via an internal OpenCode plugin. Memory across compaction + across sessions. Less-obvious benefit: shared profile across a team — the agent knows what other teammates' agents have already learned; stops asking answered questions; stops repeating corrected mistakes.
  2. Agentic code reviewer. "Arguably the most useful thing it learned to do was stay quiet." Remembers that a particular comment wasn't relevant in a past review, that a specific pattern was flagged and the author chose to keep it. Reviews get less noisy over time.
  3. Chat bot. Ingests + lurks message history. Answers future questions based on past conversations.

Exportability posture

"Every memory is exportable, and we're committed to making sure the knowledge your agents accumulate on Cloudflare can leave with you if your needs change. We think the right way to earn long-term trust is to make leaving easy and to keep building something good enough that you don't want to."

Explicitly framed as reducing vendor lock-in for accumulated institutional knowledge.

Product iteration method

Agent-driven benchmark loop (patterns/agent-driven-benchmark-loop):

  1. Run benchmarks (LongMemEval, LoCoMo, BEAM — intentionally multiple to avoid overfitting).
  2. Analyse gaps.
  3. Agent proposes solutions.
  4. Human reviews proposals to select strategies that generalise rather than overfit.
  5. Agent makes the changes.
  6. Repeat.

Stochasticity (even at temperature 0) required multi-run averaging + trend analysis alongside raw scores. Explicitly guarded against benchmark-specific overfitting ("build systems that overfit for a specific evaluation and break down in production" — the stated failure mode the method avoids).

Seen in

Last updated · 200 distilled / 1,178 read