ZALANDO 2024-07-24

Zalando — Node.js and the tale of worker threads¶

Summary¶

Fabien Lavocat (Zalando Engineering, 2024-07-24) tells a production-incident story from April 2022 about the Zalando campaign service (a Node.js application) running on Kubernetes. A Friday-night page fired because the translation service (which the author's team owns) was receiving 20× its normal request rate from the campaign service; its p99 latency jumped 100 ms → 500 ms and its error rate went 0 % → 4 %. The campaign service was simultaneously burning through CPU/memory allocations. Two services were destabilising each other in what looked like a positive feedback loop. The first, quick fix reduced pod counts on the campaign service and let the translation service scale; the loop seemed to stop — then resumed minutes later "as if in defiance of my gaming night". The longer fix removed the Node.js cluster-mode code from the campaign service; request volume to the translation service dropped from 20,000 → 100 req/min.

The Monday post-mortem revealed the first story was wrong. The file they had read (translation-fetcher.js, which called process.exit(1) on fetch failure) wasn't the one actually running — the live code (translation.js) had a proper fallback to fallbackTranslations. There was no true positive feedback loop from translation-service failures; that was a red herring. The real root cause was deeper: a rare AWS instance placement (a 48-core host, where all other days had returned 4/8/16 cores), interacting with Node's cluster module using os.cpus().length — which returns host CPU count, not the container's CPU allocation — spawned 48 worker processes inside a single pod with only 2 GB memory and effectively 20 milli-CPU per process. Node.js itself then began killing worker threads to reclaim memory. The cluster.js handler respawned a worker the instant one died; new workers' startup path immediately fetched translations; the socket died mid-request when memory pressure killed the worker again. That explained the context cancelled / Response is closed errors at the translation service (client-side connection hangup, not server failure) and the read timeouts at the campaign service (workers had so little CPU budget they couldn't read the response before being killed).

The takeaway the author named: "Node.js simply starts killing worker threads when it needs to reclaim memory", and the absence of event-loop lag instrumentation on the campaign service meant they had to guess at this during incident response. The post closes by announcing Zalando's Node.js Observability SDK developed in response, now instrumenting 53 Node.js applications two years later, with a subsequent post promised on that SDK's internals.

Key takeaways¶

os.cpus().length is a container-unaware primitive. When Node.js os.cpus() is called inside a Kubernetes container, the returned value reflects the host machine's core count, not the container's resources.cpu allocation. This is not a Kubernetes bug — the Linux /proc/cpuinfo that the syscall surfaces has always been host-scoped — but it is a load-bearing leak abstraction when combined with the require('cluster') idiom of "spawn one worker per CPU". Zalando's campaign service had been running on AWS nodes with 4, 8, or 16 cores for months; it was only on a single April 2022 night that AWS placed the pod on a 48-core machine, at which point cluster.js loyally forked 48 workers inside a single 2 GB / 1-CPU-request pod. concepts/os-cpus-container-leak canonicalises this primitive-leak class; concepts/nodejs-cluster-mode canonicalises the cluster idiom and why it's counter-indicated in Kubernetes. (Source: sources/2024-07-24-zalando-nodejs-and-the-tale-of-worker-threads)
Node.js cluster mode in Kubernetes is discouraged. Post names the anti-pattern explicitly: "using cluster mode for Node.js in a Kubernetes environment is discouraged because Kubernetes can help you do this in a simple way out of the box, for example by setting cpu request to 1000m to allocate one CPU core per pod." The long-term fix was to delete cluster mode, add more pod replicas, and let Kubernetes do the horizontal scaling — the native unit for concurrency at the container-orchestration altitude. The pattern patterns/kubernetes-replicas-over-in-process-workers canonicalises the choice.
Node.js kills worker threads under memory pressure. The author spent most of the post-mortem looking for a garbage-collection or stack-trace signal for why workers were exiting; none existed. The resolution was found by reproducing locally: giving a Docker container 1 CPU + 1000 MB while forcing 50 cluster workers, he saw the same "Worker fragment (pid: X) died / Worker Y started" pattern after ~25 workers had started. The lesson: Node.js unilaterally kills worker threads (likely via OOM-like reclamation triggered internally, not by the Linux OOM killer on the container cgroup) to free memory back to the runtime. Combined with the cluster.js respawn-on-exit handler, this creates an infinite death-spiral loop. concepts/worker-respawn-death-spiral canonicalises the shape: memory pressure → worker-kill → immediate respawn → startup work (here: fetch translations) → more memory pressure.
The positive-feedback-loop diagnosis was wrong — it was a red herring. The Friday-night team read translation-fetcher.js, saw process.exit(1) on fetch failure, and concluded that translation-service slowness was killing workers → respawning workers → re-hitting translation service → more slowness. Monday's deeper look showed the live entry point was translation.js, which had a proper catch clause returning fallbackTranslations on failure. The translation-service response time had no causal role in the worker-respawn rate; workers were being killed by Node memory pressure regardless of whether the fetch succeeded or not. The fact that both systems appeared to recover when pods were reduced (and then both deteriorated again) looks like positive feedback but was actually parallel symptoms of a single root cause (too many workers in too little memory). Named explicitly: "there never was a positive feedback loop with the translation service, it was all up in our heads and I felt a bit stupid about it." concepts/red-herring-postmortem canonicalises the anti-pattern: reading code unrelated to the execution path and fitting a causal story to the visible symptoms. The incident explains why reading the live entry point first — not the most-plausibly-named file — is an incident-response discipline.
Two different services' errors came from one underlying fact: worker death during in-flight HTTP. The translation-service logs showed java.lang.IllegalStateException: Response is closed; distributed tracing at the campaign service showed context cancelled spans. Both were the same event: a Node worker exited mid-request, so the TCP socket closed from the client side, which the Java translation-service code surfaces as "response is closed" and the Node-side tracing surfaces as "context cancelled". Combined with read timeouts (workers had ~20 milli-CPU each and couldn't read the response body in time), the error signature is a multi-observable composite of the same single underlying cause. The incident-response misstep was treating these as independent clues pointing at a translation-service-to-campaign-service interaction, when they were two views of the same in-pod worker death.
Event-loop lag would have been the right signal. Named explicitly: "this information was not readily available to us because the campaign service did not instrument its event loop lag, the degradation of which is a common root cause of API call read timeouts." Event-loop-lag measurement (a monotonic timer callback every N ms, measuring actual elapsed time vs expected) surfaces the class of single-threaded-runtime starvation that makes read-timeout errors look like client bugs when they're really producer-side CPU starvation. This event-loop-lag instrumentation gap is what motivated Zalando's subsequent Node.js Observability SDK (53 applications instrumented by 2024-07, promised as its own blog post). It complements event-loop blocking as the observability counterpart.
Rare host-shape placement was the trigger. The author searched 30 days of logs for the "number of CPUs" startup-log line and found the 48-core observation had occurred exactly twice: Friday's incident, and once on 2022-04-06 at 10:49 where the same cluster.js code had spawned many workers, one pod over-utilised both CPU and memory and was killed repeatedly, the translation service scaled 4→20 pods on the back of it (despite only 2× requests), but a replacement pod landed on a 16-core host and the system stabilised by accident. This is a blast- radius multiplier tied to cloud-scheduler randomness: the bug had existed for the full lifetime of the campaign service, but surfaced only on the occasional day when AWS handed out an unusually large instance. The real fix removed the dependency on host-shape entirely by killing cluster mode.
Observability is the post's closing pivot. The post ends announcing an outcome: Zalando built a Node.js Observability SDK in direct response to this incident, and it was adopted across 53 applications over ~2 years. The commitment is operational, not rhetorical — a concrete "if-event-loop-lag-instrumentation-had-been -in-place-we'd-have-solved-this-Friday" realisation leading to a common-signals SDK as the cross-org remediation. The subsequent SDK post is promised but not in the 2024-07-24 write-up.

Systems / concepts / patterns extracted¶

Systems: - Node.js — single-threaded runtime; cluster module spawns worker processes; kills them under memory pressure; os.cpus() returns host CPU count unconditionally. - Kubernetes — container orchestrator; pods with resources.cpu / resources.memory are isolated from host CPU topology (but os.cpus() sees through).

Concepts (new): - concepts/nodejs-cluster-mode — the require('cluster') idiom of forking one worker per CPU, and why it conflicts with container-native scaling. - concepts/os-cpus-container-leak — os.cpus().length returns host core count, not container CPU allocation — a container-isolation leak at the /proc layer. - concepts/positive-feedback-loop-cascading-failure — the failure-mode signature (two systems' degradations feeding each other) vs its confirmation discipline; this incident fits the signature but lacked the mechanism. - concepts/worker-respawn-death-spiral — memory-pressure worker kills + immediate respawn + startup I/O = sustained-high-load death spiral; generalises beyond Node.js (any supervisor pattern that respawns on exit without backoff). - concepts/event-loop-lag-instrumentation — timer-callback probe for single-threaded runtime starvation; the observability primitive missing in the campaign service. - concepts/red-herring-postmortem — incident-response anti-pattern: reading a file that looks relevant to the symptom (here: one with the same domain term in its filename, containing a literal process.exit) without confirming it's on the live call path.

Concepts (extended): - concepts/event-loop-blocking-single-threaded — a new Seen-in entry for Node.js at Zalando scale; the observability gap makes event-loop blocking invisible. - concepts/blast-radius — the host-shape-placement multiplier: the same bug was safe on 4/8/16-core hosts, catastrophic on 48-core hosts; cloud-scheduler randomness is a blast-radius axis.

Patterns (new): - patterns/kubernetes-replicas-over-in-process-workers — at the container-orchestrator altitude, prefer more pod replicas over in-process worker forking; the cpu: 1000m + replicas shape lets Kubernetes do the horizontal scaling. - patterns/memory-induced-worker-kill-death-spiral — the anti-pattern this post exposes; named for identification in code review and incident response.

Operational numbers¶

Metric	Value	Note
Translation service p99 latency (normal → degraded)	100 ms → 500 ms
Translation service error rate (normal → degraded)	0 % → 4 %	"slow burn error"
Campaign service RPS to translation service (normal → degraded)	1,000/min → 20,000/min	20× amplification
Workers spawned per pod (48-core host)	48	One-per-CPU loyalty of cluster.js
Memory per worker (48-core / 2 GB pod)	~40 MB	"10,000× the Apollo guidance computer"
CPU per worker (48-core / 1 CPU pod)	~20 milli-CPU	Single-thread starvation
Worker death rate (log observation)	20+/sec	Cluster-wide respawn
RPS drop after fix (cluster-mode removed)	20,000 → 100/min
Normal AWS host core count	4, 8, or 16	30-day log survey
Rare AWS host core count	48	Observed twice in 30 days (2022-04-06, incident night)
Translation-service replica scaling	4 → 20	Previous 2022-04-06 near-incident
Node.js Observability SDK adoption	53 apps	By 2024-07 (2+ years after incident)
Local-repro trigger	1 CPU, 1000 MB, 50 cluster workers	Forced `Worker fragment (pid: X) died` after ~25 worker starts

Architectural context and caveats¶

Campaign service is not on the customer critical path. The 4 % error rate burned for several hours before anyone paged, because the service is not customer-facing enough for operational dashboards to alert on it. The translation service team paged the campaign service owners, not Zalando's customer-facing alert path. This is why the incident's duration wasn't minutes — it was hours — and also why a post-mortem this substantive was possible: the blast radius at the business layer was contained, even while the infrastructure-layer symptoms were severe.
Kubernetes cgroup memory limits matter. The post doesn't name oom_kill, and the worker death pattern (internal, inside Node.js' process space) is not the Linux OOM killer reaping the pod. But the trigger is still memory pressure inside the container cgroup; if the pod had been given 48× more memory (96 GB instead of 2 GB), Node presumably wouldn't have felt the need to kill workers. The actual remediation wasn't bigger memory — it was fewer workers, achieved by removing cluster mode entirely and letting Kubernetes replicas take over.
The rollout was not routine. The author notes that "deploying the service to production" was itself a struggle: the service hadn't been deployed in a while and the team was missing permissions. This is a deployment- freshness issue that independently predicts incident severity: services that rarely ship accumulate deployment-pipeline rot, which compounds with incident pressure. Deliberately understated in the post but worth noting for axis-5 (Cyber-Week-prep) completeness.
No explicit follow-up yet on the Node.js Observability SDK internals. The 2024-07 post promises the SDK's architectural details as a subsequent post. As of this ingest (April 2026), it may or may not have shipped — the wiki search of existing Zalando sources doesn't surface a dedicated Node.js Observability SDK post. The wiki should link the two when the follow-up lands.
No distributed-tracing instrumentation on the campaign service. Post names this directly: "the campaign service was not instrumented so we could not get much out of our tracing tooling." Distributed tracing alone would not have surfaced the event-loop-lag signal (that's a lower-level runtime metric), but the absence of traces forced the Friday-night team into log-diving and file- reading, which is what enabled the translation-fetcher.js red herring. Observability absence is itself a blast-radius multiplier on incident-response time.
The cluster module is legacy Node.js concurrency. Modern Node.js (v12+) prefers worker_threads (shared memory, same-process threads) for CPU-parallel workloads, leaving cluster mode as a vestigial multi-process shape. The post's use of cluster.fork() and cluster.isMaster predates that shift; the service was coded before the Kubernetes migration and hadn't been re-architected. This is a legacy-code-meets-new-substrate incident at its core.

Gaps in the public record¶

The translation service's Java-side hardening, if any, after the incident is not described. Did they add request-rate limits per client? Circuit-breaker? The post focuses on the campaign service's fix (cluster- mode removal) and doesn't discuss whether the translation service changed.
No numbers on incident total duration, engineering hours, or business impact. The hours-of-duration is qualitative ("for several hours"); no MTTR, no revenue impact, no customer-facing downtime figure.
No disclosure on how many Zalando services still run Node.js cluster mode. The author's recommendation is "don't do this"; whether Zalando did an org-wide audit and migration is not named.
The Node.js Observability SDK's internal shape is promised but not delivered in this post. The reader learns that the SDK exists (53 apps) but not what signals it captures beyond event-loop lag.
The AWS instance family and scheduling logic that occasionally placed the pod on a 48-core host is not named. Is this a Karpenter / Cluster Autoscaler heuristic? A workload-class affinity? The two observations in 30 days (2022-04-06 and incident night) suggest the frequency is tens-per-year at most for this service — rare enough that it's not picked up in routine capacity planning.
No treatment of the Node.js worker-kill heuristic itself. The author reports the empirical behaviour ("Node.js simply starts killing worker threads when it needs to reclaim memory") but not the runtime internal decision logic. Is it tied to a specific heap threshold? A V8 GC pressure signal? Worth a deeper Node.js runtime citation; not delivered here.
No post-2022 confirmation that the fix held. The post is a 2024-07 narrative of an April-2022 incident; the author doesn't provide two-year production data on whether cluster-mode removal + replica-based scaling has been incident-free since. Reasonable to assume "yes, or we'd have heard about it", but not stated.