CONCEPT Cited by 1 source

Critical business operation¶

A Critical Business Operation (CBO) is a named, top-level user-facing business action (checkout, item view, order placement, sign-in, add-to-cart) that is elevated to a first-class alerting + SLO primitive. The CBO's error rate, latency, and throughput — not the individual backing service's — are what get monitored, alerted on, and held to SLO.

Definition¶

Zalando's instantiation:

"This alert handler monitors the error rate of what we call Critical Business Operations (CBO) and when it is triggered it uses the tracing data to determine where the error comes from across the entire distributed system, and pages the team that is closest to the problem." — sources/2021-09-20-zalando-tracing-sres-journey-part-ii

Three defining properties:

CBO is the alertable unit, not the service. One CBO ("checkout") maps to a set of distributed-tracing root spans whose parent-child graph spans many microservices.
Error rate is measured at the CBO level. A CBO fails when the root span carries an error tag or non-2xx status — regardless of which service inside the trace caused it.
Per-CBO SLO, alerting, and dashboards become the canonical way to talk about reliability; per-service metrics are diagnostic but not primary.

Why CBOs are load-bearing¶

Classical service-centric observability breaks down at scale:

Service A can be 99.99% available while checkout is broken. If 3 of 20 services on the checkout path are degraded (each at 99.5%), each service's SLO reports green but the user-visible compound failure rate is ~1.5%.
User journeys cross team boundaries. "Checkout" touches inventory, pricing, payments, fraud, notifications; no single team owns the compound success rate.
Service-level alerting produces noise and misses. Alerts fire for services not on any hot path; outages that matter hide in compound rates nobody tracks.

Defining CBOs as first-class lets the organization:

Align SLOs with user experience (concepts/operation-based-slo).
Run symptom-based alerting on things the business cares about.
Prioritise reliability work by user-facing impact rather than per-service KPIs that may be misaligned.

Relationship to the trace¶

A CBO maps to a root span identity pattern plus some attribute filters:

  CBO "checkout_v2" :=
    root_span.operation_name == "POST /checkout"
    AND root_span.service == "checkout-api"
    AND root_span.tag.version >= 2

The error rate of the CBO is the fraction of root spans matching the pattern that carry error=true or http.status >= 500. A CBO catalogue is usually small (single-digit to low- hundreds); they are curated business-critical operations, not an auto-discovered enumeration.

CBOs vs Google's SLI/SLO model¶

Google's SRE-book SLI framework has always allowed per- endpoint or per-user-journey SLIs, and Google's recent user-journey SLO writings align with CBO thinking. The Zalando term Critical Business Operation emphasises two things the generic SRE-book framing soft-pedals:

Business-critical curation: CBOs are explicitly picked by product + SRE together as the operations whose reliability matters to revenue / customer trust.
Cross-service compound measurement: the CBO's success rate necessarily crosses service boundaries and must be measured at the root span — this makes OpenTracing / OpenTelemetry a hard prerequisite.

Dependencies¶

Distributed tracing deployed fleet-wide with consistent root-span identification — see OpenTracing semantic conventions.
CBO catalogue maintained by product + SRE, typically a small curated list rather than an auto-discovered one.
Error taxonomy agreed per CBO — what counts as a "checkout failure"? HTTP 5xx? User-abandoned after a 5xx? 5xx including those retried successfully by the client? This is typically the hardest design choice.
Alerting + SLO tooling aware of CBO as a dimension — vanilla per-service dashboards insufficient; the CBO needs a first-class definition in the alerting platform.

Downstream primitives¶

concepts/symptom-based-alerting — alerts fire on CBO error rate, not service error rate.
concepts/adaptive-paging — when a CBO alert fires, the paging target is computed by traversing the trace graph to find the likely root-cause service.
concepts/operation-based-slo — SLO is defined per CBO, not per service.
Throughput estimation for capacity planning — a Throughput Calculator projects per-service RPS fan-out from expected CBOs-per-minute.

Seen in¶

sources/2021-09-20-zalando-tracing-sres-journey-part-ii — Zalando names CBO as the alertable primitive underlying Adaptive Paging and Symptom-Based Alerting. Quote: "monitors the error rate of what we call Critical Business Operations."