Skip to content

PATTERN Cited by 1 source

SQL-to-intent encoding pipeline

SQL-to-intent encoding is Pinterest's three-step pipeline that converts raw historical SQL queries into a semantically searchable intent index.

(Source: sources/2026-03-06-pinterest-unified-context-intent-embeddings-for-scalable-text-to-sql.)

The three steps

Step 1 — Domain context injection

Before any LLM reads the SQL, inject Pinterest-specific semantic context alongside the raw query:

  • Table and column descriptions from PinCat to add business context.
  • Standardized glossary terms — e.g. advertiser_id maps to g_advertiser_id in one table and adv_id in another.
  • Metric definitions — e.g. "engaged user" means specific action types.
  • Domain expertise — data quality caveats, recommended date ranges, Pinterest-specific conventions.

Without this step, the LLM sees only raw table and column names and misses the business meaning behind them.

Step 2 — SQL to text

An LLM transforms the SQL + context bundle into a structured natural- language description, producing three complementary outputs per query (see concepts/sql-to-text-transformation):

  • High-level summary — captures business purpose + domain.
  • Analytical questions — explicit list of questions this query could help answer. The load-bearing output.
  • Detailed breakdown — column definitions, transformation logic, filters, business purpose in plain English.

Two design choices make this effective at scale:

  1. The analytical questions slot creates the question-to-question bridge future user questions match against.
  2. Descriptions are kept deliberately generalizable — strip temporal specifics (exact dates, individual IDs), preserve business-meaningful values (metric types, entity categories).

Step 3 — Text to embedding

Embed the natural-language description (or its per-slot components) into vectors in a unified context-intent embedding space. This enables intent-based retrieval: at query time, embed the user's natural-language question the same way and find historical queries that answered similar analytical questions, regardless of keyword overlap.

When to reach for this pattern

Use this pattern when:

  • You have a large corpus of historical domain-specific artifacts (SQL queries here, but the same shape applies to runbooks, incident tickets, investigation notes, etc.).
  • The natural user interaction is in the same vocabulary as the business-question artifacts, not the technical-vocabulary metadata of the underlying systems.
  • You want to reuse validated solutions rather than re-derive them.

Do not reach for this pattern when:

  • Your artifact corpus is too small to amortize the SQL-to-text generation cost.
  • Artifacts are already written in user-vocabulary (a simple embedding without the translation step suffices).
  • You lack the domain-context infrastructure (PinCat-equivalent) needed for meaningful context injection — you'll get garbage summaries.

Seen in

Last updated · 319 distilled / 1,201 read