Skip to content

CONCEPT Cited by 1 source

EAV schema for LLM consumption

Definition

An Entity-Attribute-Value (EAV) schema stores heterogeneous attributes of entities as (entity_id, attribute_name, value, …) rows — one row per attribute per entity — rather than one row per entity with a wide typed schema. The EAV shape sacrifices read-time type safety and schema rigidity in exchange for ingestion-pipeline isolation (adding a new attribute requires no DDL) and storage flexibility (no sparse-column waste).

EAV is usually treated as an anti-pattern in OLTP + OLAP contexts — type erasure at read time makes joins ugly, hurts the query planner, and trades compile-time errors for run-time surprises. But the calculus changes when the downstream consumer is an LLM:

"Sacrificing strict schema was acceptable: Because the data is consumed by an LLM that already expects unstructured strings. EAV was practical for our use case with frequently changing fields: A (business_id, field_name, field_group, value, updated_at) table lets us add new attributes without schema changes and keeps ingestion jobs isolated." (Source: sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product)

The LLM-downstream consumer permissions the EAV choice — the anti-pattern argument loses its force when type coercion is downstream-irrelevant.

Yelp's canonical schema

table: business_structured_facts
columns:
  business_id    -- partition key
  field_name     -- clustering key (e.g. "hours_tuesday",
                 --                   "vegan_options",
                 --                   "wifi")
  field_group    -- clustering key (taxonomy grouping)
  value          -- string (LLM-consumable, not typed)
  update_ts      -- recency + idempotency

All four clustering / data columns are LLM-friendly: value is a string the LLM reads directly; field_name + field_group provide enough context for the LLM to disambiguate (hours_tuesday vs wifi); update_ts enables LWW + replay-idempotency.

Why LLM consumers permission EAV

Three forces align:

  1. LLMs don't need typed values. Prices, hours, boolean amenities all flow to the LLM as strings; the LLM tokenises and reasons over strings directly.
  2. LLM prompt composition tolerates heterogeneity. Missing attributes, new attributes, and renamed attributes all surface harmlessly at prompt-assembly time (the caller just emits whatever rows exist). There's no compile-time error waiting for a DDL migration.
  3. Ingestion-pipeline isolation is a scaling multiplier. New attributes don't require a migration → new attributes don't require cross-team coordination → attribute coverage can grow continuously as upstream pipelines add fields.

Compared to adjacent stances

  • Typed wide schema — classical OLTP. Adds DDL burden but gives read-time type safety + planner optimisation. Appropriate when consumers are other systems that need types.
  • Document-per-entity JSON — a sibling schema-flexible shape. JSON documents work well when the entity has a deep nested structure; EAV works better when attributes are mostly flat + frequently new + per-attribute update-cadences differ.
  • Schema-on-read data lake — similar philosophy applied to analytics; the difference is that data lakes require Spark / Presto / etc. to coerce types at query time, whereas EAV-for-LLM just reads strings straight through.

Tradeoffs / gotchas

  • Query planners struggle. Queries like "all businesses with vegan options that are open after 10pm on Tuesday" require multi-row joins on attribute rows, not a single wide-row read. The EAV shape is not intended for ad-hoc analytical queries — it's intended for per-business LLM consumption.
  • Read-path has to filter by field_name. Unless the caller already knows which attributes to fetch, a scan-all-rows-for-this-business query becomes the read pattern. (Yelp's content-fetching engine compensates by having the caller specify a subset of sources / attributes.)
  • No enforcement of valid attribute values. A value field with "maybe open — check website" is a valid string; an LLM handles it, but downstream non-LLM consumers would fail. If a non-LLM consumer emerges, either a schema layer has to be added or that consumer has to tolerate strings.
  • Attribute taxonomy drift. Without DDL-level enforcement, field_name = "opens_tues" vs "hours_tuesday" can both exist from different producers. A soft taxonomy (prefix conventions, ownership lists) is the mitigation.
  • Secondary indexes are expensive in Cassandra EAV. Cassandra's secondary-index cost model disfavours per-attribute indexing; the read pattern has to key off the partition (business_id) or accept secondary-index cost.

Caveat: not a generalised endorsement of EAV

EAV remains an anti-pattern when:

  • The downstream consumer is not an LLM / string-tolerant.
  • Ad-hoc analytics on the attribute values is a primary workload.
  • The attribute set is stable + narrow (wide schema wins on simplicity).

Yelp's EAV choice is load-bearing because the LLM-consumer permission fired and the attribute set drifts continuously — both conditions must hold for the calculus to reverse.

Seen in

Last updated · 476 distilled / 1,218 read