Skip to content

PINTEREST 2026-03-06 Tier 2

Read original ↗

Pinterest — Unified Context-Intent Embeddings for Scalable Text-to-SQL

Summary

Pinterest's data platform team (Keqiang Li, Bin Yang) describes how Pinterest's Analytics Agent evolved from a schema-grounded RAG-based Text-to-SQL prototype into the company's most widely-adopted LLM agent (10× the usage of the next agent, 40% of the analyst population within two months of launch, target 50% by year-end). The post's two central engineering claims:

  1. Unified context-intent embeddings — instead of indexing tables by their documentation, index the analyst's historical SQL queries transformed into natural-language descriptions of the business question each query was designed to answer, then embed those descriptions. New user questions match against the questions past queries already answered, not against table names or column descriptions. This sidesteps the vocabulary mismatch between analyst phrasing and schema phrasing.
  2. Structural + statistical patterns with governance-aware ranking — alongside the intent embedding, extract validated join keys, filter conventions, aggregation patterns, and CTEs from the same query history, and combine them with governance metadata (table tiers, freshness, documentation completeness) so the system surfaces not just relevant tables but trustworthy ones.

Pinterest's thesis: "your analysts already wrote the perfect prompt" — every SQL query an analyst has ever written encodes hard-won domain expertise; the agent's job is to turn that corpus into a searchable library of expert-authored analytical solutions.

The post is also a full stack overview of the supporting infrastructure: PinCat (internal data catalog on DataHub) for tiering + glossary terms; AI Table Documentation generating column docs from lineage + queries + existing docs; glossary term propagation along the join graph to auto-tag columns; Vector Database as a Service on AWS OpenSearch + Airflow for scalable vectorized indexing; and the Analytics Agent itself, a four-layer architecture (Agent Orchestration / MCP Integration / Context / Execution) that validates generated SQL via EXPLAIN-before-EXECUTE with bounded retries and a default 100-row LIMIT.

Key takeaways

  1. RAG on table descriptions is not enough at 100K+ analytical tables. Pinterest operates 100,000+ analytical tables and 2,500+ analytical users across dozens of domains. Simple keyword matching and table summaries break when: (a) the question's wording doesn't match any table description, (b) multiple tables could answer but only specific join patterns work, (c) the "right" metric involves Pinterest-specific conventions, (d) quality signals, schemas, and query patterns live in different systems. The wiki's first canonical statement of why table-description RAG is a local maximum. (Source: sources/2026-03-06-pinterest-unified-context-intent-embeddings-for-scalable-text-to-sql.)

  2. The three-step SQL → intent pipeline that makes query history searchable by meaning:

  3. Step 1: Domain Context Injection — inject PinCat table/column descriptions, standardized glossary terms (e.g. g_advertiser_id vs adv_id both mapping to "advertiser_id"), metric definitions, and domain-expertise caveats alongside the raw SQL before any LLM sees it.
  4. Step 2: SQL to Text — the LLM produces three complementary outputs for each query: a high-level summary (business purpose + domain), a list of analytical questions the query could answer, and a detailed breakdown of query logic in plain English. The analytical-questions slot is the load-bearing design choice — future user questions match against questions this query already answered, not against summaries.
  5. Step 3: Text to Embedding — embed the natural-language description into a vector for intent-based retrieval. Two generalisation tricks: descriptions strip temporal specifics (exact dates, individual IDs) while preserving business-meaningful values (metric types, entity categories), so an October-2024 keyword-performance query generalises to future "ad CPC by keyword" questions regardless of date range. (patterns/sql-to-intent-encoding-pipeline)

  6. Structural + statistical patterns extracted from the same query history. Alongside intent embeddings, Pinterest parses queries to extract: join patterns (tables + keys + conditions), common filters (WHERE clauses, partition filters per table), aggregation patterns (COUNT DISTINCT vs SUM, grouping dimensions), subquery structures (common CTEs, nested shapes). From execution metadata they aggregate: table co-occurrence frequency, query success rates, usage recency + volume, author expertise. Together these form a library of validated, trusted SQL building blocks: when the agent generates SQL it does not guess join keys — it reuses patterns "actively used and validated by Pinterest analysts thousands of times."

  7. Governance-aware ranking fuses similarity with trust signals. Candidates aren't ranked by cosine similarity alone — the system fuses similarity scores with table tier (Tier-1 production / Tier-2 team-owned / Tier-3 staging+legacy), data freshness, documentation completeness, active ownership. "A Tier-1 table with active ownership and fresh data ranks higher than a semantically similar but deprecated or undocumented alternative." Canonical wiki statement of ranking fusion over governance metadata. (patterns/governance-tier-ranking-fusion, concepts/governance-aware-ranking)

  8. The asset-first principle drives agent design. The agent should surface existing, trusted assets — tables, curated queries, dashboards, metric definitions — before generating new SQL. Today implemented for tables + queries; future expansion to dashboards, datasets, metric definitions, curated query libraries, workflow artifacts. Design principle: promote reuse + consistency over ad-hoc query invention. (concepts/asset-first-agent-design)

  9. PinCat + governance tiering is the infrastructure that made AI viable. Pinterest's data warehouse once had "hundreds of thousands of tables, most with no clear owner or documentation" — governance roadmap was to reduce footprint from roughly 400K to around 100K through standardization and cleanup. Tier 1 = cross-team, production-quality, strict docs + quality requirements; Tier 2 = team-owned, lighter but still enforced standards; Tier 3 = everything else, aggressive retention + deprecation. Built on open-source DataHub, PinCat is the system of record for table tiers, owners, retention policies, and glossary terms (reusable business concepts like user_id or pin_id). "Governance and AI reinforce each other" — without tiering + glossary, AI-generated SQL would still pull from deprecated tables. (concepts/data-governance-tiering)

  10. Scaling documentation with AI + lineage + semantic propagation. Three complementary mechanisms to keep column-level docs current at warehouse scale:

  11. AI-generated table + column docs from data lineage + existing PinCat docs + column-level glossary terms + representative QueryBook queries. Tier-1 tables kept humans in the loop; Tier-2 flipped to LLM-drafts-human-reviews. All AI-generated docs clearly marked, owners notified.
  12. Column semantics via join-based lineage (glossary term propagation): analyze query logs to build a join graph between columns (e.g. data.pins_d.id joins to ad.ad_video_event_flat_spark.objectid); when a well-documented column with a glossary term repeatedly joins to an undocumented column, propagate the glossary term. Auto-tagged >40% of columns in scope. (concepts/glossary-term-propagation)
  13. Search-based propagation where join patterns are sparse: index glossary terms + column docs into a vector database, enable semantic similarity search between column descriptions and existing glossary term definitions. Combined reduction of manual documentation work: ≈70%. AI-generated table descriptions rated "usable or better" in >75% of user surveys.

  14. Vector Database as a Service on AWS OpenSearch + Airflow. With multiple teams at Pinterest building LLM features (table search, Text-to-SQL, AI docs) each reinventing custom indexes + ad hoc ingestion + brittle retrieval logic, Pinterest built an internal platform. Choice: AWS OpenSearch for internal-productivity use cases, paired with Hive tables as the source of truth for vectorized datasets + Airflow for index creation + ingestion DAGs. Teams define an index via simple JSON schema (alias, vector dim like 1536, source Hive table mapping); an Airflow workflow validates, creates, publishes metadata. Handles millions of embeddings with daily incremental updates. Hybrid patterns supported: semantic similarity + metadata filters ("Tier 1 tables semantically similar to user_actions containing impression data"). "Zero to a production-grade vector index in days instead of weeks." (patterns/internal-vector-db-as-service)

  15. Four-layer Analytics Agent architecture:

  16. Agent Orchestration — an LLM with Pinterest-specific prompts classifies tasks (doc lookup / table discovery / query discovery / Text-to-SQL / execution) and decides tool calls + order.
  17. MCP Integration — MCP tools providing a unified interface to: table search (Vector DB + PinCat), query search (query description index), knowledge search (internal docs), Presto execution with EXPLAIN validation.
  18. Context — PinCat schemas + tiers, vector indexes of tables and queries, expert-curated docs + metric definitions, usage patterns from query logs.
  19. Execution — Presto for validated SQL with EXPLAIN-before- EXECUTE, tight LIMIT, and error-recovery retry loops. Critical validation disciplines: references only existing tables/columns (PinCat validation); uses column profiling data to ensure filter values match actual data (e.g. 'WEB' not 'web' — the canonical "looks right but returns nothing" failure mode); reuses known join keys + filters from historical queries; EXPLAIN first, iterate on failures up to bounded retry; 100-row LIMIT by default. (patterns/explain-before-execute-validation)

  20. Conflict-resolution hierarchy when context sources disagree. With multiple sources (query patterns, documentation, schemas, general knowledge) conflicts are inevitable. Pinterest's defined precedence: (1) expert-curated documentation (canonical guides, metric definitions) = primary source of truth for business logic; (2) schema metadata from PinCat = authoritative for column names, types, table structure; (3) query patterns = guidance but validated against schemas before use; (4) general knowledge base = fallback when specialized sources lack coverage. "The agent generates SQL that is both semantically correct (aligned with business intent) and syntactically valid (grounded in actual schemas)."

  21. Self-reinforcing learning cycle — the library grows as analysts work. Every new query an analyst writes becomes a new entry in the knowledge base: new analytical patterns emerge, metric calculation standards evolve, join conventions spread, domain- specific filters become discoverable across teams. "The analyst who figures out how to compute retention by acquisition channel doesn't just answer their own question — they write a reusable recipe that any future analyst can discover." System gets better as usage grows; 2,500+ analysts collectively teach it.

Systems extracted

  • systems/pinterest-analytics-agent — the agent itself. Four-layer architecture (orchestration / MCP / context / execution). Built on table search + query search + metric-definition lookup + Presto execution. Most-used agent at Pinterest (10× the next), 40% analyst population adoption in two months, targeting 50% year-end.
  • systems/pinterest-pincat — Pinterest's internal data catalog built on DataHub. System of record for table tiers, owners, retention, column-level glossary terms. Pre-existing system but this post is its canonical wiki introduction.
  • systems/pinterest-vector-db-service — Pinterest's internal Vector Database as a Service on AWS OpenSearch. Config-driven index definition; Hive tables as source of truth; Airflow for index creation + ingestion. Millions of embeddings with daily incremental updates. Hybrid semantic-plus-metadata filtering.
  • systems/pinterest-ai-table-documentation — AI documentation pipeline. LLM generates table + column descriptions from lineage + existing PinCat docs + glossary terms + representative QueryBook queries. Tier-1 human-in-the-loop; Tier-2 LLM-drafts-human-reviews; all AI-generated clearly marked.
  • systems/pinterest-querybook — Pinterest's collaborative SQL editor (where analysts write / run / share queries). Source of representative-query inputs for AI Table Documentation.
  • systems/datahub — open-source data catalog (datahubproject.io) underneath PinCat. Canonical wiki introduction.
  • systems/opensearch — AWS OpenSearch as the Vector DB substrate.
  • systems/presto — the query execution engine; EXPLAIN-before- EXECUTE validation loop.
  • systems/apache-airflow — index creation + ingestion DAGs for Vector DB as a Service.
  • systems/model-context-protocol — the Analytics Agent's tool surface is an MCP integration layer.

Concepts extracted

  • concepts/text-to-sql — canonical wiki definition. The task of generating executable SQL from a natural-language question over a specific database schema; why naive "LLM + schema" RAG breaks at 100K+ tables; why production Text-to-SQL needs governance + query- history + validation layers.
  • concepts/unified-context-intent-embedding — Pinterest's named contribution. A single embedding space indexing natural-language descriptions of what historical queries were designed to answer, enabling semantic retrieval robust to variations in table names / column structures / phrasing.
  • concepts/analytical-intent-embedding — the vector representation of "the business question a query was designed to answer."
  • concepts/sql-to-text-transformation — the three-output LLM step that produces summary + analytical questions + detailed breakdown from a SQL query (with domain context injected).
  • concepts/governance-aware-ranking — ranking retrieval candidates by fusing similarity scores with trust signals (table tier, freshness, documentation completeness, ownership). Pinterest canonical instance.
  • concepts/glossary-term-propagation — propagating a well-documented column's glossary term to an undocumented column it joins to. Auto-tagged >40% of Pinterest columns.
  • concepts/data-governance-tiering — the Tier-1 / Tier-2 / Tier-3 discipline separating production-grade / team-owned / legacy tables with progressively lighter quality + doc requirements.
  • concepts/asset-first-agent-design"surface existing, trusted assets before generating new SQL." Design principle that favors reuse + consistency over ad-hoc agent invention.
  • concepts/analytical-question-bridge — the design trick that the SQL→text step generates explicit "analytical questions this query answers" so future user questions match question-to-question instead of question-to-table-description.
  • concepts/query-history-knowledge-base — treating accumulated SQL query logs as a durable, searchable library of expert-authored analytical solutions, not throwaway audit data.

Patterns extracted

  • patterns/sql-to-intent-encoding-pipeline — the three-step pipeline (domain context injection → SQL-to-text → text-to-embedding) that converts raw historical SQL into a semantically searchable intent index.
  • patterns/analytical-intent-retrieval — at query time, embed the user's natural-language question and retrieve by similarity to the analytical-question descriptions of past queries, not to table docs.
  • patterns/governance-tier-ranking-fusion — fuse semantic-similarity scores with governance metadata (tier, freshness, docs, ownership) when ranking retrieval candidates.
  • patterns/explain-before-execute-validation — generated SQL passes EXPLAIN before EXECUTE; on failure, iterate up to a bounded retry limit; default a conservative LIMIT (100 rows).
  • patterns/internal-vector-db-as-service — ship an internal platform (OpenSearch + Hive + Airflow at Pinterest) so teams go from zero to a production vector index in days without solving embedding + ingestion + monitoring from scratch.

Operational numbers

  • 100,000+ analytical tables at Pinterest.
  • 2,500+ analytical users across dozens of domains.
  • 400K → ~100K target table footprint reduction (governance roadmap).
  • >75% of AI-generated table descriptions rated "usable" or better in user surveys.
  • ~40% of manual documentation effort reduced by AI Table Documentation.
  • >40% of columns in scope auto-tagged with glossary terms via join-based lineage.
  • ~70% overall manual-documentation-work reduction from AI + lineage + search combined.
  • Millions of embeddings in the Vector DB, daily incremental updates.
  • 1536-dim embeddings example (post cites 1536 as a representative vector field dimensionality).
  • 40% analyst-population coverage within two months of launch; target 50% by year-end.
  • 10× the usage of the next-most-used agent at Pinterest.
  • #1 agent at Pinterest.
  • 100 rows default LIMIT on generated SQL.

Caveats

  • No absolute Text-to-SQL accuracy numbers — Pinterest reports "meets expectations for table discovery" and "SQL generation has room for improvement" but does not publish specific accuracy/precision/recall.
  • Hardest cases explicitly called out: complex analytical logic (multi-step calculations + window functions), ambiguous business terms not yet in documentation, cross-domain queries surfacing conflicting join patterns or metric definitions, schema evolution where deprecated-table patterns still appear in the index.
  • Post does not disclose: the embedding model used, the LLM used for SQL-to-text generation, specific MCP server implementations, latency / throughput of the Analytics Agent or Vector DB service, rollout traffic-control mechanism, error-recovery retry bounds, the exact similarity+governance fusion function.
  • Evaluation methodology: Pinterest built an internal benchmarking framework but does not publish head-to-head comparisons vs. other Text-to-SQL systems (DS-STAR, Snowflake Cortex Analyst, etc.).
  • Only certain agent capabilities are productionized at the time of writing — "As the agent gains new capabilities, we will add corresponding test coverage."
  • The recursive propagation loop (LLM-generated docs feed the LLM-generated embedding index) is not discussed as a potential failure mode; AI-generated docs going stale or wrong could silently poison retrieval.
  • No discussion of privacy / access-control on the query-history index — all analysts see query patterns from all teams is implied but not stated explicitly.

Source

Last updated · 319 distilled / 1,201 read