Skip to content

SYSTEM Cited by 1 source

Netflix Data Canary Orchestrator

The Data Canary Orchestrator is Netflix's system for validating catalog metadata transformations against production traffic before publishing to the broader fleet. It detects data corruption in under 10 minutes and automatically blocks bad data from reaching members.

Architecture

Three components form the core:

  1. Orchestrator Instance — a dedicated instance of Netflix's catalog metadata service that coordinates the validation flow. On each new catalog version publish to the canary environment, it validates that baseline and canary clusters are healthy and version-synchronized, then triggers a chaos experiment via ChAP.

  2. Permanent Baseline Cluster — always serves the latest known-good production catalog version. Provides the control group for comparison.

  3. Permanent Canary Cluster — receives new catalog versions for validation. Provides the experimental group.

Validation flow

  1. New catalog version published to canary environment
  2. Orchestrator confirms baseline + canary are healthy and version-aligned
  3. Orchestrator triggers a chaos experiment routing ~0.2% of production traffic through both clusters
  4. Sticky canaries (session affinity) ensure each user stays on baseline or canary for the duration
  5. Real-time streaming of SPS (Starts Per Second) as the primary behavioral metric
  6. On regression detection → immediate abort + publish blocked
  7. On success → orchestrator reports results via generic REST endpoint to the transformer service

Production-hardening details

  • In-flight experiment continuity: orchestrator restart must detect and continue polling ongoing experiments
  • Leader election: prevents duplicate experiments when multiple orchestrator instances run during deployments
  • Version synchronization: tracks per-client-type version state to ensure proper baseline/canary alignment in a multi-tenant service

Design decisions

  • Production traffic over shadow traffic: shadow traffic can replay requests to the catalog service but cannot simulate the full playback lifecycle across multiple services and domains
  • Behavioral metrics over technical metrics: SPS directly measures customer impact; data errors may not manifest as application errors
  • Immediate abort over statistical confidence: trades some statistical confidence for speed within the tight 10-minute window

Operational numbers

  • Detection speed: 2.5–4 minutes depending on client type
  • Error differential on corruption: 10× between canary and baseline
  • End-to-end validation window: <10 minutes
  • Traditional code canary analysis: 30–60 minutes (too slow for data cycles)

Extensibility

The generic REST result-reporting interface means other Netflix teams can implement their own orchestrator patterns for different data sources without requiring transformer code changes — the extensibility was a deliberate design goal.

Seen in

Last updated · 546 distilled / 1,578 read