CONCEPT Cited by 1 source
Critical business operation¶
A Critical Business Operation (CBO) is a named, top-level user-facing business action (checkout, item view, order placement, sign-in, add-to-cart) that is elevated to a first-class alerting + SLO primitive. The CBO's error rate, latency, and throughput — not the individual backing service's — are what get monitored, alerted on, and held to SLO.
Definition¶
Zalando's instantiation:
"This alert handler monitors the error rate of what we call Critical Business Operations (CBO) and when it is triggered it uses the tracing data to determine where the error comes from across the entire distributed system, and pages the team that is closest to the problem." —
Three defining properties:
- CBO is the alertable unit, not the service. One CBO ("checkout") maps to a set of distributed-tracing root spans whose parent-child graph spans many microservices.
- Error rate is measured at the CBO level. A CBO fails when the root span carries an error tag or non-2xx status — regardless of which service inside the trace caused it.
- Per-CBO SLO, alerting, and dashboards become the canonical way to talk about reliability; per-service metrics are diagnostic but not primary.
Why CBOs are load-bearing¶
Classical service-centric observability breaks down at scale:
- Service A can be 99.99% available while checkout is broken. If 3 of 20 services on the checkout path are degraded (each at 99.5%), each service's SLO reports green but the user-visible compound failure rate is ~1.5%.
- User journeys cross team boundaries. "Checkout" touches inventory, pricing, payments, fraud, notifications; no single team owns the compound success rate.
- Service-level alerting produces noise and misses. Alerts fire for services not on any hot path; outages that matter hide in compound rates nobody tracks.
Defining CBOs as first-class lets the organization:
- Align SLOs with user experience (concepts/operation-based-slo).
- Run symptom-based alerting on things the business cares about.
- Prioritise reliability work by user-facing impact rather than per-service KPIs that may be misaligned.
Relationship to the trace¶
A CBO maps to a root span identity pattern plus some attribute filters:
CBO "checkout_v2" :=
root_span.operation_name == "POST /checkout"
AND root_span.service == "checkout-api"
AND root_span.tag.version >= 2
The error rate of the CBO is the fraction of root spans
matching the pattern that carry error=true or http.status >=
500. A CBO catalogue is usually small (single-digit to low-
hundreds); they are curated business-critical operations,
not an auto-discovered enumeration.
CBOs vs Google's SLI/SLO model¶
Google's SRE-book SLI framework has always allowed per- endpoint or per-user-journey SLIs, and Google's recent user-journey SLO writings align with CBO thinking. The Zalando term Critical Business Operation emphasises two things the generic SRE-book framing soft-pedals:
- Business-critical curation: CBOs are explicitly picked by product + SRE together as the operations whose reliability matters to revenue / customer trust.
- Cross-service compound measurement: the CBO's success rate necessarily crosses service boundaries and must be measured at the root span — this makes OpenTracing / OpenTelemetry a hard prerequisite.
Dependencies¶
- Distributed tracing deployed fleet-wide with consistent root-span identification — see OpenTracing semantic conventions.
- CBO catalogue maintained by product + SRE, typically a small curated list rather than an auto-discovered one.
- Error taxonomy agreed per CBO — what counts as a "checkout failure"? HTTP 5xx? User-abandoned after a 5xx? 5xx including those retried successfully by the client? This is typically the hardest design choice.
- Alerting + SLO tooling aware of CBO as a dimension — vanilla per-service dashboards insufficient; the CBO needs a first-class definition in the alerting platform.
Downstream primitives¶
- concepts/symptom-based-alerting — alerts fire on CBO error rate, not service error rate.
- concepts/adaptive-paging — when a CBO alert fires, the paging target is computed by traversing the trace graph to find the likely root-cause service.
- concepts/operation-based-slo — SLO is defined per CBO, not per service.
- Throughput estimation for capacity planning — a Throughput Calculator projects per-service RPS fan-out from expected CBOs-per-minute.
Seen in¶
- — Zalando names CBO as the alertable primitive underlying Adaptive Paging and Symptom-Based Alerting. Quote: "monitors the error rate of what we call Critical Business Operations."
- sources/2022-04-27-zalando-operation-based-slos —
technical deep-dive. Names the origin of the CBO
catalogue — renamed from internal "User Functions"
(generated by SREs + experienced engineers for Cyber-Week
load-testing work, ordered by revenue impact) to
"Critical Business Operation" to encompass non-user
operations (e.g. SRE's own "Ingest Metrics", "Query
Traces" CBOs). Introduces the senior-manager ownership
model — each CBO's SLO is signed off by the Director / VP
owning the customer experience, not by any component
service's team. Also names per-CBO error-budget tracking
across three 28-day windows in the new
Service
Level Management Tool. Ties the "transport-agnostic SLI"
shift (OpenTracing
errortag rather than 5xx-rate) to CBOs crossing protocol boundaries, so graceful degradation fallbacks can still register as CBO failures. - — CBO as the priority-class assignment axis for admission control. Order confirmations are canonical-Zalando CBOs (SLO-protected, revenue-linked); marketing / brand-alert / campaign notifications are non-CBO bulk traffic. The Communication Platform's three-tier priority system P1/P2/P3 maps directly onto CBO/non-CBO distinction — the platform's Stream Consumer assigns per-event-type AIMD coefficients such that P1 (CBO-carrying) event types barely slow under congestion while P3 contracts sharply. Canonical instance of the CBO classification surfacing in an admission-control coefficient table rather than only in SLO definitions and paging policy — the downstream implication of the business taxonomy reaches into runtime rate control (concepts/per-priority-aimd-coefficients, patterns/priority-differentiated-load-shedding).
- — CBOs as probe-scenario scoping unit. Zalando's 2024 e2e test probe tier scopes its probe scenarios 1-to-1 to CBOs: home →gender→product, catalog→filter→product, product→size→ cart→checkout. Declared growth path: "include more of our Critical Business Operations (CBOs) and we also [are] looking at extending this idea to our mobile apps." Canonical wiki instance of CBOs as a probe-level scoping unit that ties browser-altitude synthetic monitoring to the same symptom-primitive as trace-derived CBO error-rate alerts. The probe surfaces a CBO failure mode (frontend interactivity crash from React hydration breakdown) that the trace-derived CBO alert misses because HTTP 200 still flows.
Related¶
- concepts/symptom-based-alerting
- concepts/operation-based-slo
- concepts/adaptive-paging
- concepts/service-tier-classification
- concepts/error-budget — per-CBO budget tracking across three 28-day windows in the Service Level Management Tool.
- concepts/multi-window-multi-burn-rate — alerting strategy keyed on CBO budget burn.
- concepts/end-to-end-test-probe — browser-altitude primitive that uses CBOs as its scoping unit.
- systems/opentracing
- systems/zalando-adaptive-paging
- systems/zalando-service-level-management-tool
- systems/zalando-throughput-calculator
- patterns/e2e-test-as-synthetic-probe — the pattern that operationalises CBOs as browser-altitude symptom sources.