Skip to content

SYSTEM Cited by 1 source

Netflix ChAP (Chaos Automation Platform)

ChAP (Chaos Automation Platform) is Netflix's internal platform for running automated chaos experiments. It provides the experiment orchestration, metric collection, and threshold-based decision-making infrastructure that teams use to validate resilience hypotheses against production traffic.

Role in the Data Canary

The Data Canary Orchestrator triggers ChAP experiments to validate new catalog metadata versions. This required extending ChAP beyond its original code-deployment-focused design:

  • Custom threshold tuning — standard chaos experiment thresholds were too conservative for the 10-minute validation window. Netflix worked with the Resilience team to customize thresholds for data validation.
  • Multi-tenant experiment support — Netflix's catalog service supports multiple client types with different traffic patterns. Separate experiments per major client type revealed that the playback-request tenant identifies failures fastest.
  • Immediate abort on regression — rather than collecting data for post-hoc analysis, ChAP streams metrics in real-time and aborts experiments the moment regression is detected.
  • Sticky canaries — session-affinity routing ensures users stay on baseline or canary for the experiment duration, preventing cross-contamination.

Prior art

ChAP is referenced in a prior Netflix TechBlog post as the platform underlying Netflix's broader chaos engineering practice. It sits above the Simian Army tools as the orchestration and evaluation layer — the simians inject failures, ChAP evaluates whether the system tolerated them.

Seen in

Last updated · 546 distilled / 1,578 read