Skip to content

CONCEPT Cited by 1 source

Source-of-truth hydration

Definition

Source-of-truth hydration is the ingestion-side primitive of calling back to the source system on each event to fetch the complete current state, rather than relying on state embedded in the event payload. The event stream is a trigger; correctness is anchored in the source-of-truth API.

"MDS always fetches the latest facts from the source of truth. This pattern decouples the event stream from state consistency."sources/2026-05-04-netflix-democratizing-machine-learning-building-the-model-lifecycle-graph

The hydration contract

For each event type a consumer handles, it implements a hydration contract:

  1. Validate the event schema.
  2. Look up the source-system API endpoint for the event's entity type.
  3. Call the source API with the entity identifier from the event.
  4. Transform the response into the consumer's internal model.

In Netflix MDS, this looks like:

Event Hydration call
model_instance_created GET /api/v1/instances/{instance_id}
pipeline_run_completed GET /api/v1/pipeline-runs/{run_id}
ab_test_updated GET /api/v1/tests/{test_id}

The handlers are per-source-system — each source has its own event schema and its own hydration endpoint, but the consumer-side shape is uniform.

Why hydrate vs trust the payload?

Two structural reasons hydration beats payload-trust:

1. Order independence

The source API returns current state regardless of which event arrived first. So:

  • Out-of-order events → the last hydration call wins, which is the correct current state.
  • Dropped events → the next event for the same entity catches the state up.
  • Replayed events → idempotent; replay does no harm.

By contrast, if the event payload contained state, the consumer would need careful version-vectoring or LWW-timestamp logic to handle out-of-order arrivals, and a dropped event would leave the consumer with permanently stale state until backfilled.

2. Producer simplicity

"This design keeps producers simple. Source systems only need to announce that a change occurred, without building complete payloads or understanding downstream requirements."sources/2026-05-04-netflix-democratizing-machine-learning-building-the-model-lifecycle-graph

Producers don't need to know which consumers exist, what fields each consumer wants, or how to serialize their internal state for external consumption. They emit a thin notification-of-change event and let consumers pull what they need.

The cost: source-API read amplification

Every change → at least one hydration call to the source. For hot entities or chatty source systems, this can multiply API load:

"The tradeoff is that we place additional read load on source systems during hydration and need to be deliberate about rate limiting, caching, and backoff in our enrichment workers so that we don't overload them."sources/2026-05-04-netflix-democratizing-machine-learning-building-the-model-lifecycle-graph

Mitigations the post names explicitly:

  • Rate limiting in enrichment workers — bounded per-source call rate.
  • Caching at the consumer — short-TTL cache of recent hydration results.
  • Backoff — exponential backoff on source-side throttling.

Implicit but standard practice:

  • Coalescing — if multiple events for the same entity arrive in a short window, hydrate once.
  • Bulk hydration APIs — if the source supports it, fetch N entities per call.

Distinct from CDC payload-trust

CDC tools (Debezium, etc.) emit events containing the actual row data before + after change. The consumer trusts the payload. This avoids the hydration-API read amplification but:

  • The event payload is large.
  • The consumer is tightly coupled to the source-DB schema.
  • Filtering / projection has to happen consumer-side or in the CDC pipeline, not in the source application.

Source-of-truth hydration is the inverse trade: small events, loose coupling to source schemas, source-API read load.

Generalizes beyond MDS

The pattern recurs across systems where:

  • The producer is an authoritative API service that already exposes a query endpoint.
  • The consumer is an out-of-process system (catalog, search index, metadata graph, downstream ML pipeline).
  • State consistency matters more than event-completeness or delta-fidelity.

It is the dominant ingestion shape for catalog / metadata services that mirror state from many upstream sources, where each source already has its own canonical API.

Seen in

Last updated · 542 distilled / 1,571 read