Skip to content

CONCEPT Cited by 1 source

Content-addressed ID

Definition

A content-addressed ID is an identifier derived deterministically from the content of a record via a cryptographic hash, so that two records with identical content necessarily share the same ID. Storage writes become idempotent by construction: INSERT OR IGNORE silently skips duplicates; there is no separate dedup pass, no duplicate-detection bookkeeping, no risk of a race creating two rows for the same logical item.

The canonical hash choice is a truncated cryptographic hash (SHA-256, BLAKE3, etc.); truncation to 128 bits is sufficient for deduplication at realistic scales (collision probability far below real-world cardinality).

The pattern

id = HASH(stable-identity-fields)[:N bits]
writes = INSERT OR IGNORE (id, ...)

The load-bearing property: the writer doesn't coordinate with any other writer to avoid duplicates. Two parallel paths can both race to store the same message; SQLite / the storage engine sees the same id, takes the first, ignores the second.

Canonical wiki instance: Cloudflare Agent Memory

Agent Memory assigns every ingested message a content-addressed ID:

id = SHA-256(sessionId + role + content)[:128 bits]

"If the same conversation is ingested twice, every message resolves to the same ID, making re-ingestion idempotent."

— (Cloudflare, 2026-04-17)

At storage time, "everything is written to storage using INSERT OR IGNORE so that content-addressed duplicates are silently skipped."

The input fields are deliberate: - sessionId — scopes the same text spoken in two different sessions to different IDs (correct — they're different instances of the same sentence). - role — distinguishes "user: yes" from "assistant: yes". - content — the actual material.

Stripping any of these would either produce over-aggressive deduplication (two users saying "yes" collapsing to one row) or pointless differentiation (same message ingested twice producing two rows). The three-tuple is the minimum stable identity.

Why a truncated hash?

  • Collisions are still astronomically unlikely at 128 bits. For N = 10^15 records, collision probability ≈ N² / 2^129 ≈ 10^−9 — far below real-world concerns for per-profile memory stores.
  • Storage savings vs full 256-bit IDs. Each ID is half the bytes; indexes are narrower; cache footprint smaller.
  • Faster equality checks. Often fits in two machine words.

Content addressing vs sequence-number IDs

Property Content-addressed Sequence number
Dedup cost zero (constraint-based) separate dedup query per write
Distributed-write coordination none needed coordinator / counter
Idempotent retry yes no (retry = new ID)
Human-readable no (opaque hex) yes (monotonic integer)
Sortable by creation time no yes

Content-addressed IDs excel for idempotent ingest paths where the same logical item may be re-presented by different paths (harness retry, client-side crash + replay, two parallel workers consuming the same bus message).

  • concepts/content-addressed-caching — content-addressed cache keys so identical binary content serves one cache hit across independent producers (build systems, container registries).
  • concepts/immutable-object-storage — objects keyed by hash so a key change implies a content change.
  • Git's object model — files + trees + commits keyed by content hash; the origin idiom that popularised content addressing in software engineering.
  • Build-system reproducibility — Bazel / Nix content-addressed outputs.

Agent Memory is the canonical wiki instance of the idiom applied at the individual message / memory granularity for idempotent agent-loop ingestion.

Seen in

Last updated · 200 distilled / 1,178 read