CONCEPT Cited by 1 source
EAV schema for LLM consumption¶
Definition¶
An Entity-Attribute-Value (EAV) schema stores heterogeneous
attributes of entities as (entity_id, attribute_name, value,
…) rows — one row per attribute per entity — rather than
one row per entity with a wide typed schema. The EAV shape
sacrifices read-time type safety and schema rigidity in
exchange for ingestion-pipeline isolation (adding a new
attribute requires no DDL) and storage flexibility (no
sparse-column waste).
EAV is usually treated as an anti-pattern in OLTP + OLAP contexts — type erasure at read time makes joins ugly, hurts the query planner, and trades compile-time errors for run-time surprises. But the calculus changes when the downstream consumer is an LLM:
"Sacrificing strict schema was acceptable: Because the data is consumed by an LLM that already expects unstructured strings. EAV was practical for our use case with frequently changing fields: A
(business_id, field_name, field_group, value, updated_at)table lets us add new attributes without schema changes and keeps ingestion jobs isolated." (Source: sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product)
The LLM-downstream consumer permissions the EAV choice — the anti-pattern argument loses its force when type coercion is downstream-irrelevant.
Yelp's canonical schema¶
table: business_structured_facts
columns:
business_id -- partition key
field_name -- clustering key (e.g. "hours_tuesday",
-- "vegan_options",
-- "wifi")
field_group -- clustering key (taxonomy grouping)
value -- string (LLM-consumable, not typed)
update_ts -- recency + idempotency
All four clustering / data columns are LLM-friendly:
value is a string the LLM reads directly; field_name +
field_group provide enough context for the LLM to
disambiguate (hours_tuesday vs wifi); update_ts enables
LWW + replay-idempotency.
Why LLM consumers permission EAV¶
Three forces align:
- LLMs don't need typed values. Prices, hours, boolean amenities all flow to the LLM as strings; the LLM tokenises and reasons over strings directly.
- LLM prompt composition tolerates heterogeneity. Missing attributes, new attributes, and renamed attributes all surface harmlessly at prompt-assembly time (the caller just emits whatever rows exist). There's no compile-time error waiting for a DDL migration.
- Ingestion-pipeline isolation is a scaling multiplier. New attributes don't require a migration → new attributes don't require cross-team coordination → attribute coverage can grow continuously as upstream pipelines add fields.
Compared to adjacent stances¶
- Typed wide schema — classical OLTP. Adds DDL burden but gives read-time type safety + planner optimisation. Appropriate when consumers are other systems that need types.
- Document-per-entity JSON — a sibling schema-flexible shape. JSON documents work well when the entity has a deep nested structure; EAV works better when attributes are mostly flat + frequently new + per-attribute update-cadences differ.
- Schema-on-read data lake — similar philosophy applied to analytics; the difference is that data lakes require Spark / Presto / etc. to coerce types at query time, whereas EAV-for-LLM just reads strings straight through.
Tradeoffs / gotchas¶
- Query planners struggle. Queries like "all businesses with vegan options that are open after 10pm on Tuesday" require multi-row joins on attribute rows, not a single wide-row read. The EAV shape is not intended for ad-hoc analytical queries — it's intended for per-business LLM consumption.
- Read-path has to filter by
field_name. Unless the caller already knows which attributes to fetch, a scan-all-rows-for-this-business query becomes the read pattern. (Yelp's content-fetching engine compensates by having the caller specify a subset of sources / attributes.) - No enforcement of valid attribute values. A
valuefield with "maybe open — check website" is a valid string; an LLM handles it, but downstream non-LLM consumers would fail. If a non-LLM consumer emerges, either a schema layer has to be added or that consumer has to tolerate strings. - Attribute taxonomy drift. Without DDL-level
enforcement,
field_name = "opens_tues"vs"hours_tuesday"can both exist from different producers. A soft taxonomy (prefix conventions, ownership lists) is the mitigation. - Secondary indexes are expensive in Cassandra EAV.
Cassandra's secondary-index cost model disfavours
per-attribute indexing; the read pattern has to key off the
partition (
business_id) or accept secondary-index cost.
Caveat: not a generalised endorsement of EAV¶
EAV remains an anti-pattern when:
- The downstream consumer is not an LLM / string-tolerant.
- Ad-hoc analytics on the attribute values is a primary workload.
- The attribute set is stable + narrow (wide schema wins on simplicity).
Yelp's EAV choice is load-bearing because the LLM-consumer permission fired and the attribute set drifts continuously — both conditions must hold for the calculus to reverse.
Seen in¶
- sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product — canonical wiki instance. Yelp's BAA stores business structured information in Cassandra with an EAV schema "since the data is consumed by an LLM that already expects unstructured strings".
Related¶
- concepts/data-model-mismatch — the broader class of schema-vs-consumer-mismatch problems EAV is one answer to.
- concepts/schema-evolution — adjacent — EAV trades schema evolution for attribute-taxonomy drift.
- systems/apache-cassandra — Yelp's storage engine for the EAV table.
- systems/yelp-content-fetching-engine — reads the EAV rows and translates them into LLM-consumable shape.
- systems/yelp-biz-ask-anything — the canonical consumer.
- companies/yelp