CONCEPT Cited by 1 source
Behavioral metric as primary signal¶
Using customer-behavior metrics (what users actually do) rather than infrastructure metrics (latency, error rates, CPU) as the primary signal for detecting system degradation — particularly data corruption that may not manifest as technical errors.
Definition¶
A behavioral metric directly measures customer impact:
- Starts Per Second (SPS) — actual playback attempts at Netflix
- Conversion rate, checkout completions, search result clicks
- User engagement signals (session duration, interaction rate)
Infrastructure metrics (p99 latency, 5xx error rate, connection count) measure system health but may be blind to data-layer corruption that serves technically-correct but semantically-wrong responses.
Why behavioral beats technical for data corruption¶
Netflix's catalog metadata canary discovered that:
"SPS proved more reliable than latency or error rates for detecting catalog corruption because it directly measures customer impact, and data errors may not always manifest as application errors."
A service can return HTTP 200 with technically valid but semantically corrupt data. Latency may be normal. Error rates may be zero. But if the returned catalog tells the player a title doesn't exist, playback never starts — SPS drops.
Trade-offs¶
| Behavioral metrics | Technical metrics | |
|---|---|---|
| Data corruption | ✅ Direct signal | ❌ Often blind |
| Code bugs | ✅ Catches impact | ✅ Catches directly |
| Signal speed | Slower (aggregate) | Faster (per-request) |
| Noise floor | Higher (user variance) | Lower (deterministic) |
| False negatives | Lower for data issues | Higher for data issues |
Netflix's application¶
In the Data Canary, SPS is the primary experiment signal. Multi-tenant testing revealed that running traffic through the playback-request tenant consistently identified failures fastest — because playback directly exercises the full metadata path.