CONCEPT Cited by 1 source
Data contract¶
Definition¶
A data contract is an explicit, versioned agreement between a data producer and one or more data consumers about the schema, semantics, quality guarantees, and delivery SLA of a data stream. It is the data-pipeline analogue of an API contract — the producer commits to what shape/content it will emit, the consumer commits to how it will interpret that emission, and both sides can evolve independently as long as the contract is preserved.
The term came into common use in the early-2020s "data mesh" / "modern data stack" conversations as a way to push accountability for data quality upstream to the producing team, rather than leaving it as a recurring firefight for downstream analytics / ML / reporting teams.
Why it matters¶
Without a data contract, every downstream consumer of a pipeline becomes a de-facto schema-and-semantics archaeologist: they reverse-engineer what the producer currently emits, build a consumer around that, and break silently when the producer ships something slightly different. This is the classic cause of brittle analytics pipelines, bad dashboards, and cost-attribution numbers that nobody trusts.
A data contract makes the producer's responsibility explicit:
- Schema — column types, nullability, required fields.
- Semantics — what a given field actually means (e.g. "cost" is attributed-dollars-per-resource-per-day, not allocated-budget).
- Quality guarantees — completeness, freshness, accuracy bounds.
- Delivery SLA — when data lands, how soon outages are declared.
- Versioning — how and when breaking changes are signalled.
With the contract, the consumer side stops being defensive and starts being collaborative — a schema-evolution request becomes a contract amendment, negotiated up-front, not a post-facto PagerDuty page.
Netflix FPD instance¶
Netflix's Platform DSE team names data contracts as the coordination primitive that makes its FPD + CEA cloud efficiency platform scale. From the 2025-01-02 post:
"FPD establishes data contracts with producers to ensure data quality and reliability; these contracts allow the team to leverage a common data model for ownership. The standardized data model and processing promotes scalability and consistency."
The FPD team ingests from many internal platforms (Spark and others) — each with its own resource model, its own ownership model, its own usage-emission cadence. A contract per platform normalises what lands in FPD into the shared inventory/ownership/usage model that CEA can build analytics on top of. Without the contract, FPD would be endlessly re-parsing per-platform CSVs with slightly different column names and semantics; with the contract, every producer integration is a stable, versioned interface.
The published SLA complement ("well-defined Service Level Agreements (SLAs) to set expectations with downstream consumers during delays, outages or changes") is what makes the contract operationally real: the consumer can alert against the SLA, and the producer has a clear failure criterion.
(Source: sources/2025-01-02-netflix-cloud-efficiency-at-netflix)
Relationship to other wiki concepts¶
- API-contract siblings. Data contracts on the data-pipeline side are the counterpart of:
- concepts/backward-compatibility + concepts/schema-evolution on the API side (e.g. Lyft's protobuf design principles canonicalise the equivalent discipline for RPC APIs).
- concepts/contract-first-design on the service-design side.
- concepts/extensibility-protocol-design on the wire-format side.
- Upstream of chargeback. Data contracts are what make patterns/chargeback-cost-attribution work across teams — if the cost data producer doesn't honour a contract, the downstream bill is noise.
Seen in¶
- sources/2025-01-02-netflix-cloud-efficiency-at-netflix — canonical wiki instance; data contracts as the FPD-producer coordination primitive in Netflix's cloud-efficiency platform.