CONCEPT Cited by 1 source
Customer-facing SLA¶
Definition¶
A customer-facing SLA is a quantitative promise stated in metrics the customer directly perceives — not internal system metrics. For a shared SQL query service like Presto at Meta-scale, the two headline examples from sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale are:
- Queueing time — how long a submitted query sits before a cluster starts executing it.
- Query failure rate — what fraction of submitted queries fail.
Both are things a data analyst / ML engineer / product-team user notices; neither is a CPU, disk, or network metric.
Why it matters at scale¶
"Defining SLAs around important metrics like queueing time and query failure rate in a manner that tracks customer pain points becomes crucial as Presto is scaled up. When there is a large number of users, the lack of proper SLAs can greatly hinder efforts to mitigate production issues because of confusion in determining the impact of an incident." — Meta, on scaling Presto (sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale)
Three structural roles the customer-facing SLA plays:
- Shared definition of "bad." Without it, an oncall and a product-team user cannot even agree whether an incident is ongoing.
- Alert trigger. Monitoring fires when SLA metrics breach; this is the trigger for oncall analyzers and for automated remediation.
- Prioritization of engineering investment. Where the SLA is breached regularly, automation (canary pipelines, bad-host drain, gateway throttling) gets prioritized; where it is not, ad-hoc fixes suffice.
Distinguishing from system SLOs¶
- System SLO: "coordinator p99 memory < 80%". Operator-facing.
- Customer-facing SLA: "p95 queueing time < N seconds". User-facing.
The first is a leading indicator for the second but not a substitute. Meta's insight: tie the alerting and automation stack to the customer-facing numbers, and let system SLOs be diagnostic supporting evidence.
Seen in¶
- sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — the first named piece of scaling advice: "Establishing easy-to-understand and well-defined customer-facing SLAs."
Related¶
- concepts/queueing-theory — queueing-time SLA is literally the output of a queueing model.
- concepts/automated-root-cause-analysis — SLA breach → RCA analyzer is the canonical Meta flow.
- patterns/oncall-analyzer — SLA breach is the trigger.
- concepts/blast-radius — SLAs frame which failures are customer-visible.