SYSTEM Cited by 2 sources
Zalando Observability SDK (Node)¶
What it is¶
@zalando/observability-sdk-node is Zalando's thin
internal wrapper over upstream
OpenTelemetry core Node.js packages, built by the SRE
Enablement and Web Platform teams at the end of 2022. It
pre-configures Zalando-specific defaults and acts as a proxy
for all underlying OTel dependencies so service owners can
instrument a Node.js application with a single statement
(Source: sources/2024-07-28-zalando-opentelemetry-for-javascript-observability-at-zalando):
Why it exists¶
The SDK was built in direct response to the worker-threads incident (sources/2024-07-24-zalando-nodejs-and-the-tale-of-worker-threads): before 2023, the on-call team had "almost zero visibility" into what a misbehaving Node.js service was doing during production incidents. The original 2022-04 positive-feedback- loop incident was fixed without event-loop-lag instrumenta- tion — the team guessed at the root cause. Post-incident, the author called for a dedicated Node.js observability effort; this SDK is that effort.
What it configures¶
Without any constructor arguments, new SDK().start() does
the following (Source: sources/2024-07-28-zalando-opentelemetry-for-javascript-observability-at-zalando):
- Parses platform environment variables set by Zalando's Kubernetes platform for all deployed applications — the SDK is auto-configured without service-owner action.
- Registers auto-instrumentations:
- HTTP module functions are monkey-patched to produce span data on network calls.
- Optional Express.js instrumentation enabled via boolean flag in the (optional) constructor config.
- Enables built-in metrics collection at a configured interval:
- CPU usage
- Memory usage
- Garbage collection metrics
- Event-loop lag — the specific signal the worker- threads incident was blind to
- Enables span and metric exporters on a specified interval, shipping data to the telemetry backend (Lightstep / ServiceNow Cloud Observability).
OTel-context trade-off¶
The SDK does not use OTel's context API on the server
side — they resorted to manual span passing through
function parameters
(tracer.startSpan("name", {}, context)), because the
differing API shape between OpenTelemetry's
tracer.startActiveSpan and OpenTracing's manual span-passing
"makes it difficult to migrate existing instrumentation
code, especially in a large codebase like ours" — Zalando
already had substantial OpenTracing instrumentation in place
and needed migration to be easy.
Adoption¶
Two years after build (by 2024-07), 53 Node.js applications at Zalando were instrumented with this SDK (Source: sources/2024-07-24-zalando-nodejs-and-the-tale-of-worker-threads).
Relationship to sibling SDKs¶
Part of a 3-package family, tied together by an isomorphic API package:
@zalando/observability-api # shared types + API
@zalando/observability-sdk-node # this package
@zalando/observability-sdk-browser # sibling
The isomorphic structure makes it easy to instrument isomorphic applications — those running on both the server and the browser, typically pages served by the Rendering Engine.
Seen in¶
- sources/2024-07-28-zalando-opentelemetry-for-javascript-observability-at-zalando — canonical disclosure; architecture, package layout, auto-instrumentation catalogue, manual-span-passing rationale.
- sources/2024-07-24-zalando-nodejs-and-the-tale-of-worker-threads — 53-apps-instrumented figure; motivating incident.
Related¶
- systems/opentelemetry — upstream standard this wraps.
- systems/zalando-observability-api — shared API package.
- systems/zalando-observability-sdk-browser — browser sibling.
- systems/nodejs — runtime.
- systems/lightstep — backend.
- concepts/observability-sdk-wrapper.
- patterns/observability-sdk-as-zalando-specific-wrapper.
- companies/zalando.