Skip to content

PATTERN Cited by 1 source

Metrics layer for deterministic Genie answers

The pattern

Pin a Genie (or any structured-data LLM-powered Q&A interface) to a metrics layer — a curated, schema-bound registry of business measures with explicit SQL definitions — so that the same natural-language question "What's my bank balance?" always produces the same number, independent of LLM trajectory variance.

The pattern is a retrofit observed in production: the default LLM-only output path is nondeterministic enough to be unfit for financial / operational reporting, where users expect the same question to produce the same answer every time. The metrics layer is the bridging primitive that pins the LLM's natural- language interpretation to a deterministic SQL execution path.

The canonical wiki source's verbatim disclosure:

"When early Genie deployments returned inconsistent results for structured queries, the team implemented a metrics layer to ensure they got deterministic answers, critical for financial and operational reporting."

"In the structured content, you need an answer. What is my bank balance? I don't want to see a different number every time." — Suresh Kaudi

(Source: World Bank Group on Databricks)

Why this is the load-bearing failure mode for production Genie

Two structural facts about LLM-powered structured-data Q&A interfaces force this pattern:

  1. LLM-generated SQL is nondeterministic in a way that matters for reporting. Even with deterministic decoding (temperature=0) and identical prompts, semantic ambiguity in the question leaves room for the model to choose between multiple valid SQL formulations of the same business measure. Each formulation may produce different numbers — different join paths, different aggregation grain, different filter semantics, different time- window boundaries. For financial / operational reporting, "a different number every time" is unacceptable even if all the numbers are individually defensible.
  2. Same-question-same-answer is a hard requirement, not a nice- to-have. Stakeholders ask the same question repeatedly across board meetings, exec briefings, regulator calls, and audit review. A bank-balance figure that depends on which Genie trajectory ran is operationally useless — the trust contract between the analyst and the executive depends on the number being the number.

The metrics layer pins each business measure to one SQL definition. Genie then becomes a natural-language router into that registry rather than a SQL author. "What's my bank balance?" maps to a single bank-balance metric definition with a single SQL execution path. Same question → same SQL → same number.

  • patterns/transform-upstream-to-collapse-measures — the Trinity Industries case (600 measure variants → one canonical measure layer). That pattern is about upstream measure consolidation in the Medallion-architecture silver tier before Genie is deployed. This pattern is about pinning Genie's output path to a metrics layer after Genie is deployed and observed to produce inconsistent answers. The two compose: measure consolidation upstream + metrics-layer pinning downstream both reduce Genie's degrees of freedom on the way to a deterministic answer. The Trinity case is the "can Genie answer correctly?" problem; the World Bank case is the "can Genie answer the same way every time?" problem.
  • concepts/source-of-truth-disambiguation — the broader data-agent challenge of ranking authority across multiple sources. The metrics layer is one mechanism for resolving disambiguation on the structured-data path: "the metrics-layer-defined SQL is authoritative; everything else is descriptive."
  • concepts/measure-proliferation — the upstream pathology that makes both this pattern and the Trinity measure-consolidation pattern necessary.
  • concepts/single-source-of-truth-dashboard — the user-facing endpoint for the same authoritative-measure property. Dashboards pin via fixed SQL; Genie pins via the metrics layer.

Required components

  1. A metrics-layer registry — schema-bound, version-controlled, audit-traceable. Each entry has at minimum:
  2. Measure name (canonical business name).
  3. SQL definition (deterministic, parameterisable).
  4. Dimension grain (the level the measure is meaningful at).
  5. Time-window semantics (point-in-time / period / live).
  6. Owner + last-changed-by + change history. The substrate could be Unity-Catalog-governed Metric Views, a custom UC-tags + UDF registry, or a third-party metrics layer wired to UC.
  7. Genie-to-metrics-layer pinning — Genie's NL→SQL path is constrained to dispatch to metrics-layer entries rather than author free-form SQL. Mechanism is a disclosure gap in the canonical source: the post says "the team implemented a metrics layer to ensure they got deterministic answers" but doesn't specify whether Genie is a thin lookup over the registry, or uses the registry as a constraint during SQL generation, or composes the registry's SQL into a larger query.
  8. Out-of-registry escape hatch — for ad-hoc questions that don't map to any registered measure, Genie must either fall back to free-form SQL (re-introducing the nondeterminism for that path only) or refuse the question. The canonical source doesn't disclose this design decision.
  9. (Recommended) Audit trail of question → measure → SQL → answer. The pattern is operationally meaningful only if stakeholders can verify which measure definition produced which number for a given question.
  10. (Recommended) Per-domain metrics layer — in multi-domain deployments (see patterns/intent-domain-decomposer-agentic-router), each per-domain Genie has its own metrics layer. Cross-domain questions get decomposed and dispatched to per-domain metrics layers separately.

When applies / doesn't fit

Applies when

  • The Genie surface serves financial / operational reporting use cases where same-question-same-answer is a hard contract.
  • The user population includes executives or external stakeholders who interpret the LLM's answer as the number, not a number.
  • A canonical measure registry already exists (or can be built) upstream — the metrics layer relies on someone having defined authoritative measures.
  • Genie has been deployed and observed producing inconsistent answers for the same question — the pattern is typically a retrofit, not a greenfield decision.
  • Per-domain pinning is required (multi-domain knowledge platform; see patterns/intent-domain-decomposer-agentic-router).

Doesn't fit when

  • The Genie surface is exploratory analytics where users expect creative SQL formulations and tolerate variance — e.g. ad-hoc data-science exploration, hypothesis-generation workflows.
  • No authoritative measure registry exists. The pattern presupposes someone has done the upstream work of defining the SQL for the business measure. Without that, the metrics layer is empty.
  • Latency-critical conversational UIs where the measure-lookup roundtrip dominates the response budget and free-form SQL generation is faster.
  • The substrate is unstructured-document Q&A (RAG) — the pattern is structured-data-specific. RAG has its own determinism story (citation-grounding) and doesn't compose with a metrics layer.

Failure modes

  • Out-of-registry questions silently re-introduce nondeterminism. When Genie falls back to free-form SQL for ad-hoc questions, those answers are nondeterministic again — but the user has no UI signal that the answer came from outside the metrics layer. Mitigation: explicit "this answer was computed from the X measure" / "this answer was computed ad-hoc" indicator in the response.
  • Stale measure definitions. When a business measure's semantics change (e.g. a regulatory definition update) but the metrics-layer entry isn't updated, Genie returns a consistent-but-wrong number. Mitigation: change-management process for measure definitions; alerting on measure- definition-drift across owner / business / regulatory surfaces.
  • Measure proliferation in the metrics layer itself. The pattern can degenerate into the same measure-proliferation pathology it was meant to solve, with multiple near- duplicate entries for the same business measure ("bank balance v1", "bank balance corrected", "bank balance ex-FX"). Mitigation: same governance discipline as any canonical-measure registry — review board, deprecation flow, naming conventions.
  • Decomposition over per-domain metrics layers can produce inconsistent dimension semantics. When two per-domain metrics layers define a shared dimension ("India") with subtly different scoping, cross-domain decomposed answers drift. Mitigation: shared dimension definitions across per-domain metrics layers (see Composes with).
  • Audit-trail completeness gaps. If the question → measure → SQL → answer trace is partial (e.g., missing the measure-version-at-time-of-answer), audit reconstruction is impossible. Mitigation: full audit-payload capture (Inference Tables is the Databricks-ecosystem instance).
  • Performance regression on registry roundtrip. If the metrics-layer lookup is on the hot path, query latency can regress noticeably for questions that don't need it. Mitigation: cache measure-definition lookups; pre-resolve common questions.

Composes with

Seen in

  • sources/2026-05-22-databricks-how-world-bank-group-uses-databricks-to-eradicate-poverty-through-shared-knowledgecanonical wiki source. World Bank Group retrofitted a metrics layer after observing that "early Genie deployments returned inconsistent results for structured queries". Suresh Kaudi's verbatim diagnosis: "In the structured content, you need an answer. What is my bank balance? I don't want to see a different number every time." The pattern composes with patterns/intent-domain-decomposer-agentic-router in the same deployment: each per-domain Genie has its own metrics layer, and "each Genie instance is built against a specific metrics layer, meaning a separate Genie is needed for each data domain." Caveat: mechanism-light — the post names the metrics-layer purpose and effect but not its concrete implementation. Whether the pinning is via UC Metric Views, a custom UC-tagged-UDF registry, a third-party metrics layer, or another substrate is not disclosed.

What we don't yet know

The canonical source is mechanism-light. Open questions for any future ingest on this pattern:

  • What is the metrics-layer substrate concretely? UC Metric Views? Custom tag + UDF registry? dbt metrics layer wired to Genie? Cube.dev?
  • How does Genie's NL→SQL path interact with the metrics layer at execution time — strict dispatch only, constrained generation, or composed?
  • What's the out-of-registry fallback? Refuse the question? Free-form SQL with a "low-confidence" indicator?
  • How are measure-definition changes propagated to Genie's context — re-prompted on every question, cached with TTL, pushed via a notification mechanism?
  • What's the per-question latency cost of the metrics-layer lookup vs free-form SQL?

A future engineering retrospective on the metrics-layer implementation in any Databricks-ecosystem deployment would close these gaps.

Last updated · 542 distilled / 1,571 read