Skip to content

CONCEPT Cited by 1 source

Service tier classification

Service tier classification is the discipline of labelling services (or operations) by criticality tier — typically Tier-1 (revenue-critical / user-hot-path), Tier-2 (important but not directly revenue-critical), Tier-3 (internal / back- office / nice-to-have) — to prioritise SLO definition, reliability investment, alerting aggressiveness, and on-call coverage.

Definition

Zalando's 2018 framing:

"To try to put some structure into the SLOs we had, Service Tier definitions were published. To help with the Service Tiers, a new SLO reporting tool was developed. The new tool defined canonical SLIs and used the tier classification."sources/2021-09-20-zalando-tracing-sres-journey-part-ii

Three defining properties:

  1. Canonical tier enumeration — not ad-hoc adjectives, but a fixed small set (usually 3–5 tiers).
  2. Tier-specific policies — SLO targets, alert severity, on-call coverage, deploy-freeze sensitivity, etc. all key off the tier.
  3. Mandated classification at service onboarding — every new service picks a tier; every existing service has one assigned.

Why tier classification is load-bearing at scale

  • SLOs cost to author and maintain. A 4,000-service fleet cannot have bespoke SLOs on every service — the authoring burden alone is prohibitive. Tiering concentrates effort on Tier-1 services (the critical few) and applies lighter- weight defaults / no-SLO to the long tail.
  • Alerting threshold sprawl. Without tiers, on-call urgency for a nightly-batch's failure and a checkout outage look the same; tiers give the alerting platform a severity key.
  • Capacity + deploy-safety decisions. Can we deploy on Black Friday? The answer depends on the service's tier — Tier-1 services are typically frozen, Tier-3 services can continue normal cadence.
  • Budget & hiring signal. Tier-1 service ownership justifies SRE + on-call investment; Tier-3 service doesn't.
  • Error-budget proportionality. Tier-1 operations get stricter SLOs (e.g., 99.95%); Tier-3 gets looser (e.g., 99%).

Typical tier definitions (Zalando-style)

Tier Examples SLO target Alerting On-call Deploy gating
Tier 1 Checkout, payments, product browse 99.95%+ Paging 24/7 primary Heavy gates, Cyber-Week freeze
Tier 2 Reviews, recommendations, wish list 99.9% Email + dashboards Business hours Standard gates
Tier 3 Internal tools, analytics, batch jobs 99% or none Dashboards only Best effort Minimal gates

Definitions vary by organisation; the structure is what matters.

Tiering services vs tiering operations

Zalando's Part II post tiers services, but the post's own subsequent pivot to operation- based SLOs suggests the right long-term unit of tiering is the CBO / operation, not the service. A single service can participate in both Tier-1 and Tier-3 operations; tiering operations avoids the mismatch.

Scope-limiting as a strategic choice

Zalando's Service Tier rollout was explicitly scoped:

"this work was limited in scope. They targeted a single department, Digital Experience, home to one of the SRE teams. Services in other departments were not included in this effort and there was no mandate for them to adopt the new Service Tier definitions. Attempting to roll this out for the entire company (>4000 services) would not be feasible."

Lessons:

  • Scope limitation is honest when the effort is team-sized, not org-sized. A 7-person SRE team cannot classify 4,000 services with any quality; choosing DX department is the correct tradeoff.
  • Scope-limited rollouts leave an org-wide classification gap. DF-outside services lack tier metadata → lack SLO reporting → lack SLO-based prioritisation. This gap remains open-ended.
  • Company-wide tier mandates require executive air cover — an SRE team cannot push tiering on other departments without a senior principal requiring it.

Prerequisites

  • Consensus on the tier enumeration — usually written as an internal standard (Zalando published its Service Tier definitions).
  • SLO reporting tool aware of tiers — dashboards, alerts, and budget tracking all keyed on the tier attribute.
  • Ownership + tier recorded in the service catalogue — can be a single column in a services registry.
  • Governance process for disputes (team wants Tier-1 for status; SRE pushes back without justification).

Seen in

Last updated · 476 distilled / 1,218 read