CONCEPT Cited by 1 source

Customer-facing SLA¶

Definition¶

A customer-facing SLA is a quantitative promise stated in metrics the customer directly perceives — not internal system metrics. For a shared SQL query service like Presto at Meta-scale, the two headline examples from sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale are:

Queueing time — how long a submitted query sits before a cluster starts executing it.
Query failure rate — what fraction of submitted queries fail.

Both are things a data analyst / ML engineer / product-team user notices; neither is a CPU, disk, or network metric.

Why it matters at scale¶

"Defining SLAs around important metrics like queueing time and query failure rate in a manner that tracks customer pain points becomes crucial as Presto is scaled up. When there is a large number of users, the lack of proper SLAs can greatly hinder efforts to mitigate production issues because of confusion in determining the impact of an incident." — Meta, on scaling Presto (sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale)

Three structural roles the customer-facing SLA plays:

Shared definition of "bad." Without it, an oncall and a product-team user cannot even agree whether an incident is ongoing.
Alert trigger. Monitoring fires when SLA metrics breach; this is the trigger for oncall analyzers and for automated remediation.
Prioritization of engineering investment. Where the SLA is breached regularly, automation (canary pipelines, bad-host drain, gateway throttling) gets prioritized; where it is not, ad-hoc fixes suffice.

Distinguishing from system SLOs¶

System SLO: "coordinator p99 memory < 80%". Operator-facing.
Customer-facing SLA: "p95 queueing time < N seconds". User-facing.

The first is a leading indicator for the second but not a substitute. Meta's insight: tie the alerting and automation stack to the customer-facing numbers, and let system SLOs be diagnostic supporting evidence.

Seen in¶

sources/2023-07-16-highscalability-lessons-learned-running-presto-at-meta-scale — the first named piece of scaling advice: "Establishing easy-to-understand and well-defined customer-facing SLAs."

concepts/queueing-theory — queueing-time SLA is literally the output of a queueing model.
concepts/automated-root-cause-analysis — SLA breach → RCA analyzer is the canonical Meta flow.
patterns/oncall-analyzer — SLA breach is the trigger.
concepts/blast-radius — SLAs frame which failures are customer-visible.

Customer-facing SLA¶

Definition¶

Why it matters at scale¶

Distinguishing from system SLOs¶

Seen in¶

Related¶