Skip to content

CONCEPT Cited by 1 source

Data contract

Definition

A data contract is an explicit, versioned agreement between a data producer and one or more data consumers about the schema, semantics, quality guarantees, and delivery SLA of a data stream. It is the data-pipeline analogue of an API contract — the producer commits to what shape/content it will emit, the consumer commits to how it will interpret that emission, and both sides can evolve independently as long as the contract is preserved.

The term came into common use in the early-2020s "data mesh" / "modern data stack" conversations as a way to push accountability for data quality upstream to the producing team, rather than leaving it as a recurring firefight for downstream analytics / ML / reporting teams.

Why it matters

Without a data contract, every downstream consumer of a pipeline becomes a de-facto schema-and-semantics archaeologist: they reverse-engineer what the producer currently emits, build a consumer around that, and break silently when the producer ships something slightly different. This is the classic cause of brittle analytics pipelines, bad dashboards, and cost-attribution numbers that nobody trusts.

A data contract makes the producer's responsibility explicit:

  • Schema — column types, nullability, required fields.
  • Semantics — what a given field actually means (e.g. "cost" is attributed-dollars-per-resource-per-day, not allocated-budget).
  • Quality guarantees — completeness, freshness, accuracy bounds.
  • Delivery SLA — when data lands, how soon outages are declared.
  • Versioning — how and when breaking changes are signalled.

With the contract, the consumer side stops being defensive and starts being collaborative — a schema-evolution request becomes a contract amendment, negotiated up-front, not a post-facto PagerDuty page.

Netflix FPD instance

Netflix's Platform DSE team names data contracts as the coordination primitive that makes its FPD + CEA cloud efficiency platform scale. From the 2025-01-02 post:

"FPD establishes data contracts with producers to ensure data quality and reliability; these contracts allow the team to leverage a common data model for ownership. The standardized data model and processing promotes scalability and consistency."

The FPD team ingests from many internal platforms (Spark and others) — each with its own resource model, its own ownership model, its own usage-emission cadence. A contract per platform normalises what lands in FPD into the shared inventory/ownership/usage model that CEA can build analytics on top of. Without the contract, FPD would be endlessly re-parsing per-platform CSVs with slightly different column names and semantics; with the contract, every producer integration is a stable, versioned interface.

The published SLA complement ("well-defined Service Level Agreements (SLAs) to set expectations with downstream consumers during delays, outages or changes") is what makes the contract operationally real: the consumer can alert against the SLA, and the producer has a clear failure criterion.

(Source: sources/2025-01-02-netflix-cloud-efficiency-at-netflix)

Relationship to other wiki concepts

Seen in

Last updated · 319 distilled / 1,201 read