ZALANDO 2024-07-28

Zalando — OpenTelemetry for JavaScript Observability at Zalando¶

Summary¶

Aryan Raj (Zalando Engineering, 2024-07-28) publishes the architecture companion to the worker-threads postmortem (sources/2024-07-24-zalando-nodejs-and-the-tale-of-worker-threads promised this followup). It discloses Zalando's two JavaScript OpenTelemetry SDKs — one for Node.js server workloads, one for the browser — both built on top of upstream OTel as a thin Zalando-specific wrapper that pre-configures platform defaults, packages a set of critical metrics, and acts as a single proxy for all underlying dependencies so service owners can instrument with one line of code:

import { SDK } from "@zalando/observability-sdk-node";
new SDK().start();

The server-side SDK was built by the SRE Enablement and Web Platform teams at the end of 2022, directly motivated by the 2022-04 incident in the worker-threads post where the on-call team had "almost zero visibility into what the affected application was doing". It auto-configures from the platform's Kubernetes environment variables, enables HTTP and optional Express.js auto-instrumentations, collects CPU, memory, GC, and event-loop lag metrics by default, and exports spans + metrics to the telemetry backend (ServiceNow Cloud Observability / Lightstep).

The browser SDK was built after — the observability gap on the client was larger because the only pre-2023 tool was Sentry error logging, which told the team that an error happened but not why. The motivating example: the web checkout experience sometimes had customer requests flagged as bots by the WAF and blocked at Skipper; without client-side tracing, there was no way to connect a button click to a missing request at the proxy and count affected customers. Tracing on the client required solving several problems the server side never had to:

Page-payload performance cost — telemetry packages can add 400 KB; Zalando cherry-picked OTel packages to stay at ~30 KB added to the page, deferred non-critical packages from the initial load, and used sendBeacon() to make exports least-critical on the network.
GDPR consent — every export call gates on explicit customer consent before sending data to the collector.
No AsyncLocalStorage — the browser runtime has no native async-context primitive (TC39 AsyncContext is still a proposal). The alternative, Zone.js, monkey-patches global functions in the customer's browser — which Zalando found distasteful. They opted out of context propagation entirely on the client and resorted to manual span-passing through function parameters (as they had done in OpenTracing); they did the same on the server for migration convenience from their large existing OpenTracing codebase.

Package layout — an isomorphic API package (used by both SDKs) so applications that run on both server and client can instrument once:

@zalando/observability-api           # types + API
@zalando/observability-sdk-node      # Node wrapper
@zalando/observability-sdk-browser   # web wrapper

Early-2024 rollout into the Rendering Engine's web framework brought unprecedented client-side visibility — page load, entity resolution, and AJAX requests are now traced. The Rendering Engine's renderer concept (independent UI+data units) was extended with a props.tools.observability.traceAs("op_name") API so frontend developers can instrument custom async operations (e.g. filter-update button click → /search?q=shoes) without touching OTel directly; the framework tags operations with HTTP errors and lets the team tie asynchronous client-side work to proxy-level spans.

Client-side metrics followed: a new OTel exporter plugs into Zalando's existing RUM pipeline to report the four Web Vitals — FCP, LCP, INP, CLS — tagged with arbitrary attributes (a new attribute was added to identify the "designer" experience). With OTel metric attributes, Zalando can correlate web-vital regressions with features — a capability they lacked with their custom RUM setup — and plan to decommission the custom metric-collection service.

Metrics bucketing was a non-trivial engineering problem. OpenTelemetry JS histograms default to [0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000], which is designed for server latency in milliseconds over time; a single-value-per-page-load metric like LCP (~600-2000ms) or CLS (0-1) skews catastrophically against these defaults. Zalando solved this with OTel's view / custom-aggregation API, per-metric buckets (shown in the post):

const metricBuckets = {
  fcp: [0, 100, 200, 300, 350, 400, 450, 500, 550, 650, 750,
        850, 900, 950, 1000, 1100, 1200, 1500, 2000, 2500, 5000],
  lcp: [0, 100, ..., 1500, 1550, 1600, 1650, 1700, 1800, 1900,
        2000, 2500, 5000],  // ~32 buckets for 0-5000ms, denser
                            // around p50-p75 web-vital range
  cumulativeLayoutShift: [0, 0.025, 0.05, ..., 1, 1.25, 1.5,
                          1.75, 2],  // 0-2 range with 0.025
                                     // steps near good scores
};

The author discussed this limitation with OTel contributors at KubeCon Paris and flagged the events API as a potentially better fit for browser-based single-value metrics.

Next-step roadmap: move Critical Business Operations (CBOs) to the client side. The example — catalog-page filter-apply, a key conversion-funnel step — currently has health inferred from HTTP status at the proxy; moving it to client-side tracing lets alerts fire on the actual user experience.

Key takeaways¶

OpenTelemetry is the Zalando observability standard, delivered via platform-owned SDK wrappers. The SDKs are not applications themselves — they are thin layers over upstream OTel that pre-configure 15+ platform conventions (env vars, semantic tags, exporter endpoints, instrumenta- tion sets) and provide a single-statement start-up so "observability" for a new service is a one-line import. This is the pattern the corpus calls patterns/observability-sdk-as-zalando-specific-wrapper. Zalando SDKs are multi-language; this post focuses on the Node and Browser members.
Node.js SDK was built at end-of-2022 because the worker-threads incident had "almost zero visibility" — it is a direct remediation. Source: explicit in both posts. "Before 2023 the observability state of these applications was quite poor. During an incident, on-call responders would try to locate the root cause of the issue only to find some applications in the request flow having no instrumentation at all. In one specific, very interesting example, we had almost zero visibility into what the affected application was doing" — the hyperlink in the original post is to the worker-threads article. The 53-app adoption figure from the prior source sources/2024-07-24-zalando-nodejs-and-the-tale-of-worker-threads is the same SDK canonicalised here.
Browser observability had one tool — Sentry — before 2023, and it only answered "did an error happen". The web-checkout-WAF-block example is the canonical motivating gap: a button click at t=0 produces a request that never arrives at Skipper, because the WAF dropped it as a false-positive bot; from the customer's perspective the page hung, from the backend's perspective no request happened. Without client-side trace spans that start in the browser, there is no way to count affected customers or diagnose the WAF false-positive rate.
Bundle-size budget for telemetry on the web is a first-class design constraint — the first-attempt telemetry package added 400 KB to the page; Zalando cherry-picked the packages they actually needed and got to 30 KB added, a >13× reduction. Non-critical telemetry packages are lazy-loaded after the critical load path; the exporter uses sendBeacon() to push network requests to the lowest priority tier (concepts/send-beacon-telemetry-transport). The post also cites Grafana Faro as a greenfield alternative worth evaluating.
Edge proxy (Skipper) doubles as the telemetry-collector ingress for browser data — GDPR and public-internet reachability require an externally addressable collector; Zalando uses Skipper as the ingress, with rate-limits configured as endpoint protections, and ships a template to other applications wanting to deploy their own proxy-as- collector (patterns/edge-proxy-as-telemetry-collector-ingress).
GDPR consent gates every telemetry export on the web — the SDK only sends data if the user has consented; this is a quiet but load-bearing piece of client-side tracing architecture that server-side tracing never has to think about (concepts/gdpr-consent-gated-telemetry).
Zalando opted out of OpenTelemetry's context API on both client and server — on the server, because the context-vs-OpenTracing-span-passing API difference made migrating existing instrumentation frustrating; on the browser, because the only way to propagate context automatically is Zone.js, which monkey-patches global functions in the customer's browser (Zalando "are not big fans of this"). They use OTel's approach #3: tracer.startSpan("name", {}, context) and pass span objects manually through function parameters (concepts/async-context-propagation, patterns/manual-span-passing-over-async-context).
Isomorphic SDK architecture — one API package, two runtime SDKs. @zalando/observability-api holds types + APIs; @zalando/observability-sdk-node and @zalando/observability-sdk-browser implement runtime- specific adapters. For isomorphic applications (those that render both on the server and in the browser — e.g. pages served by the Rendering Engine), the same instrumentation code compiles for both targets. (Source: "This structure became especially useful while instrumenting isomorphic applications".)
Framework-exposed tracing API lets renderer developers instrument custom client-side operations. The Rendering Engine's renderer abstraction was extended with props.tools.observability.traceAs("op_name") which returns a span object with addTags() and finish(). This is how developers can instrument a filter-apply button or an AJAX call as a first-class span without touching OTel APIs directly (patterns/framework-exposed-tracing-api-for-renderer-developers).
Core Web Vitals as OTel metrics with per-metric custom histogram buckets — FCP, LCP, INP, CLS all have ranges and distributions that vary by 3+ orders of magnitude (LCP in ms, CLS in unitless 0-1). OTel JS's default histogram buckets would bucket most LCP values into 1000-2500ms (no resolution where p50-p75 lives) and most CLS values into 0-5 (no resolution at all, since CLS rarely exceeds 1). Zalando declares custom buckets via OTel's view + custom-aggregation API — denser buckets around the "good" web-vital band per metric (concepts/custom-histogram-buckets). The author noted at KubeCon Paris that OTel's events API may be a better fit for browser-based single-value metrics than histograms; not yet adopted.
Decommissioning the custom RUM metric-collection service — with OTel metrics plus arbitrary attributes available at query time (via the Lightstep / ServiceNow Cloud Observability backend), Zalando's existing custom metrics store stops being necessary. The tradeoff they name: the custom service "worked great over the years" but "we missed flexibility in adding custom attributes to the collected metrics and thus correlating regressions with features was difficult". OTel's open-schema attribute approach replaces a bespoke fixed-column system with a wide-column one at similar cost, because the tooling (Lightstep) already understood OTel.
The next productionisation step is moving CBOs to the client side. Server-side CBOs infer user-experience-health from HTTP status; the catalog-filter example is one where most of the interesting behaviour happens in the browser (asynchronous rendering, partial failures, debouncing). Client-side tracing unlocks real-user-impact CBOs; the post doesn't yet report metrics on how this has affected alert precision or time-to-detect.

Architectural shape (one diagram's worth)¶

┌──────────────────────┐        ┌──────────────────────┐
│  @zalando/obs-api    │◄──────┤ shared types + API    │
│  (types + API pkg)   │       │ used by both SDKs     │
└──────────────────────┘       └──────────────────────┘
       ▲                              ▲
       │                              │
┌──────────────────────┐       ┌──────────────────────┐
│ @zalando/obs-sdk-node│       │@zalando/obs-sdk-brwser│
│                      │       │                       │
│ • auto-config from   │       │ • ~30 KB bundle       │
│   K8s env vars       │       │ • cherry-picked pkgs  │
│ • HTTP + Express     │       │ • sendBeacon exporter │
│   auto-instrumentat. │       │ • GDPR-gated exports  │
│ • CPU/mem/GC/        │       │ • Web Vitals:         │
│   event-loop-lag     │       │   FCP,LCP,INP,CLS     │
│ • SDK().start()      │       │ • custom histogram    │
│                      │       │   buckets per metric  │
└──────────┬───────────┘       └──────────┬────────────┘
           │                              │
           │  OTLP spans                  │  OTLP spans+metrics
           │                              │  via Skipper
           ▼                              ▼
     ┌────────────────────────────────────────┐
     │   Telemetry backend:                   │
     │   ServiceNow Cloud Observability       │
     │   (formerly Lightstep)                 │
     └────────────────────────────────────────┘

Numbers & shape¶

Datum	Value	Source
Year Node.js SDK built	2022 (end of)	Post
Year browser SDK rolled out in Rendering Engine	early 2024	Post
Node.js SDK adoption	53 applications by 2024-07	sources/2024-07-24-zalando-nodejs-and-the-tale-of-worker-threads
Peer package trialled (added page weight)	~400 KB	Post
Final browser SDK page weight	~30 KB	Post
Weight reduction via cherry-pick	>13×	Derived
Default OTel-JS histogram buckets	15 buckets, 0-10000ms	OTel source
Zalando LCP buckets (approx)	~32 buckets, 0-5000ms	Post
Zalando CLS buckets	~32 buckets, 0-2 range	Post
Core Web Vitals covered	FCP, LCP, INP, CLS	Post
Auto-instrumentations default-on (Node)	HTTP (toggle); Express.js (optional flag)	Post
Metrics default-on (Node)	CPU, memory, GC, event-loop lag	Post

Caveats¶

No numbers on client-side SDK adoption. The post names Rendering Engine as the first web framework instrumented with the browser SDK but doesn't disclose how many other Zalando frontend apps have adopted it, or what fraction of page loads are now instrumented. The 53-app figure from the prior post is Node.js-only.
No numbers on WAF false-positive rate. The web-checkout-WAF-block scenario is the motivating example for the browser SDK, but the post doesn't disclose what the post-rollout measurement of customer-facing false-positives was — presumably available in the client-side trace data but not reported here.
No disclosure on how custom-histogram-bucket values were chosen. The LCP buckets look approximately log-normal around 800ms; no disclosure whether these were derived from observed p50-p99 shape, Google Web Vitals thresholds (good ≤2500ms, needs-improvement ≤4000ms, poor >4000ms), or manual tuning.
No data on CBO migration progress. Moving CBOs to the client is named as the next step; the post doesn't report how many CBOs have been migrated or what the alert- precision lift has been.
Decommissioning timeline for the custom RUM service not stated. "We no longer need our custom setup to collect metrics and can happily de-commission it soon" — no date, no ownership transition detail.
OTel events API roadmap for browser metrics not committed. The author flagged it at KubeCon Paris as a better shape for single-value metrics than histograms but there is no commitment or timeline to migrate bucket-based CWV metrics to the events API.
The post promises the SDK internal shape "in a subsequent post" (per the worker-threads article), and while this post delivers some of the shape, it stops short of the internal architecture of the Node SDK (how the wrapper proxies OTel methods, how auto-instrumentations are registered, exporter configuration defaults). A third post may eventually fill this in.
GDPR consent mechanism is not detailed — how the SDK obtains the consent signal (cookie? consent-manager SDK? prop?), and what the degradation shape is when the user hasn't consented (spans are buffered and discarded? not created at all?) are not covered. The pattern is named but the implementation isn't.