Skip to content

CONCEPT Cited by 2 sources

Star schema

Definition

A star schema is a dimensional data-warehousing schema design (Kimball, 1996) consisting of:

  • A central fact table holding high-cardinality, often-numeric observations (sales transactions, page views, extraction events) with foreign keys referencing dimension tables.
  • A constellation of dimension tables — one per descriptive axis (date, customer, product, geography) — typically denormalised for query simplicity, low cardinality relative to the fact table.

The diagram looks like a star: the fact table at the centre, the dimension tables at the points.

Why it's the default for analytical workloads

  • Query-friendly join shape. Aggregations are select fact, join dimensions, group by dimension columns — a single layer of joins that query optimisers handle well.
  • Composes with columnar storage. Dimension columns are queried selectively; column-pruning is cheap. The fact table is wide-and-tall but each query touches a predictable subset of columns.
  • Inspectable. Analysts can read the schema and understand the business model directly: "sales by region by product by month" is one query.
  • Extensible. Adding a new dimension is a new table + a new FK on the fact table — additive, not destructive.

Star schema as state model for LLM pipelines

The VF Match FDR pipeline canonicalises an underappreciated use of star schema: the state model for a multi-step LLM extraction pipeline. Verbatim:

"Extensible data modeling: Data at each step is stored in a star schema, simplifying downstream analytics and improving query performance."

In this shape:

  • Central fact table = per-record extraction state. Each row is one source record (URL / scraped page / facility candidate); columns include per-step status (status-based checkpointing), extraction outputs, foreign keys to dimensions.
  • Dimension tables = reference data: countries, organisation types, medical specialties, equipment categories, geographic hierarchies.

This is a cleaner state model than ad-hoc per-step tables joined on row IDs because:

  1. One table to query for pipeline status. "How many records are stuck at step 2?" is one SELECT COUNT(*) GROUP BY step2_status, no multi-table JOIN.
  2. Resumability is a single-table predicate. Re-runs filter the fact table by step status; no JOIN required to find re-runnable rows.
  3. Reference data is shared across steps. The medical-specialty dimension is referenced by both extraction-output FKs and downstream-analytics queries; no duplication.
  4. Downstream analytics for free. Once the pipeline produces the fact + dimension tables, BI / analytics queries against the same shape are immediate — no separate ETL into an analytical schema.

Star schema vs alternatives in lakehouse pipelines

  • Snowflake schema (normalised dimension tables): less query-friendly, more update-friendly. Star is preferred when read performance dominates write performance — typical of analytical and pipeline-state workloads.
  • One Big Table (OBT) / wide-table denormalisation: collapses dimensions into the fact table. Faster on simple queries; harder to evolve; loses dimension reusability. Common for feature-store-style ML training data.
  • Per-step staging tables joined by ID: ad-hoc shape; common in Airflow pipelines built bottom-up. Star schema is the refactor target when staging-table sprawl becomes painful.

Star schema as Gold-layer BI-serving substrate

The 2026-05-27 BI Serving Pointers source canonicalises a different application of star schema: as the default Gold-tier substrate for BI-serving on the Databricks Lakehouse. Verbatim:

"Star schemas remain the gold standard for BI query performance. Wide, denormalized dimension tables joined to fact tables via surrogate keys give the query optimizer clean, predictable join paths."

Three platform-resident primitives are named as Kimball-style dimensional-modelling enablers in Databricks:

  • Primary and foreign key constraints — with the RELY hint to make the optimiser trust the constraint at plan time, enabling join-elimination and other constraint-driven plan rewrites.
  • Identity columns — for surrogate keys on dimension tables.
  • CHECK and NOT NULL constraints — for value-range enforcement.

The recommended layering:

"If you're following a medallion architecture, keep your normalized or Data Vault models in Silver and build denormalized star schemas in Gold for BI consumption."

I.e., the Silver tier holds normalised models (Data Vault is named as a specific shape), and the Gold tier holds denormalised star schemas that are queried by BI tools and AI agents through a semantic layer (e.g. Metric Views).

This is sibling to (not in conflict with) the VF Match FDR use of star schema as the state model for an LLM extraction pipeline: both treat the fact-and-dimension shape as the natural target for multi-step or aggregation-heavy queries, just at different layers of the platform.

Composes with

Failure modes

  • Fact-table column sprawl. Adding extraction outputs as new fact-table columns (rather than extracting them as dimensions) produces wide-and-sparse tables that are hard to evolve. Mitigation: discipline around what belongs as a fact-table column vs a dimension.
  • Dimension drift. When dimension definitions change (new medical-specialty taxonomy), historical fact records may reference out-of-date dimension keys. Mitigation: SCD Type 2 history on dimension tables.
  • Skewed FK distributions. When most fact rows reference one dimension row (the unknown / other bucket), JOINs and aggregations skew. Mitigation: high-cardinality dimensions
  • null handling.

Seen in

  • sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countriesfirst canonical wiki source. VF Match FDR's per-step extraction state and outputs land in a star schema explicitly framed as "simplifying downstream analytics and improving query performance". First wiki canonicalisation of star schema as the state-model substrate for a multi-step LLM extraction pipeline.
  • sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tcosecond canonical wiki source. Star schemas named as the "gold standard for BI query performance" and the recommended Gold-tier substrate in the BI-serving stack on Databricks. Names the platform-resident dimensional-modelling primitives (PK / FK with RELY, identity columns, CHECK / NOT NULL) and the Silver-Data-Vault → Gold-star-schema layering recommendation.
Last updated · 542 distilled / 1,571 read