Skip to content

CONCEPT Cited by 1 source

Behavioral metric as primary signal

Using customer-behavior metrics (what users actually do) rather than infrastructure metrics (latency, error rates, CPU) as the primary signal for detecting system degradation — particularly data corruption that may not manifest as technical errors.

Definition

A behavioral metric directly measures customer impact:

  • Starts Per Second (SPS) — actual playback attempts at Netflix
  • Conversion rate, checkout completions, search result clicks
  • User engagement signals (session duration, interaction rate)

Infrastructure metrics (p99 latency, 5xx error rate, connection count) measure system health but may be blind to data-layer corruption that serves technically-correct but semantically-wrong responses.

Why behavioral beats technical for data corruption

Netflix's catalog metadata canary discovered that:

"SPS proved more reliable than latency or error rates for detecting catalog corruption because it directly measures customer impact, and data errors may not always manifest as application errors."

A service can return HTTP 200 with technically valid but semantically corrupt data. Latency may be normal. Error rates may be zero. But if the returned catalog tells the player a title doesn't exist, playback never starts — SPS drops.

Trade-offs

Behavioral metrics Technical metrics
Data corruption ✅ Direct signal ❌ Often blind
Code bugs ✅ Catches impact ✅ Catches directly
Signal speed Slower (aggregate) Faster (per-request)
Noise floor Higher (user variance) Lower (deterministic)
False negatives Lower for data issues Higher for data issues

Netflix's application

In the Data Canary, SPS is the primary experiment signal. Multi-tenant testing revealed that running traffic through the playback-request tenant consistently identified failures fastest — because playback directly exercises the full metadata path.

Seen in

Last updated · 546 distilled / 1,578 read