Skip to content

ZALANDO 2023-04-12

Read original ↗

Zalando — How Software Bill of Materials change the dependency game

Summary

A Zalando platform engineer describes how Zalando treats Software Bill of Materials (SBOMs) not as a compliance artifact but as first-class fleet-wide dependency data: every deployed application has its SBOM extracted from its container image (not its source code) and published as a curated dataset in the company data lake, queryable by any engineer via SQL. The post's central architectural claim is that an SBOM corpus over the whole fleet turns dependency governance from a per-repo pull-request-chasing game (dependabot, scala-steward, maven-versions-plugin) into a cross-fleet analytics workload — answering questions like "which of my thousands of microservices use log4j 2.14?" in minutes instead of weeks. The post canonicalises three concrete operational wins: (1) the log4j / Log4Shell mass-patch playbook — query the SBOM corpus for affected apps, auto-generate change-sets per build-tool type, centrally track patch progress; (2) the Akka license change footprint assessment; (3) a dependency-bloat audit that discovered Java teams importing the full AWS SDK (200 MB+) instead of individual modules, with build-time and image-size gains after remediation. The post also publishes empirical per-language dependency-count distributions (Python lowest; JS/TS highest at 5–10× Java) with named outliers (jupyter for Python at 2.5× next-biggest; tableau for Java at 3.14× next-biggest). Honest caveats on SBOM data quality for JVM — divergent package names / group IDs, and uber-jars that flatten away the java-archive metadata the scanner needs — are flagged as adoption gotchas. No fleet size / QPS / latency numbers (the post is at the platform-capability altitude, not the per-app-runtime altitude), but the architectural shape is well-formed and maps cleanly to established supply-chain-security primitives. In scope as a Zalando platform / supply-chain-security disclosure; opens a new Zalando axis (supply-chain / SBOM-driven dependency governance) and canonicalises several primitives already implicit in the wiki but not yet first-class.

Key takeaways

  1. Ship the SBOM with every deploy — from the container, not the source tree. Zalando generates an SBOM for every application it deploys, based on its Container image (Source: sources/2023-04-12-zalando-how-software-bill-of-materials-change-the-dependency-game). This catches OS-level packages + app-level dependencies in one artifact, whereas source-tree scanners miss whatever the base image contributes. The generator used at container scan time is syft (from Anchore), which emits a portable format — CycloneDX or SPDX — that downstream tools like grype parse for CVE correlation. See concepts/container-extracted-sbom for the generation- locus trade-off vs source-tree SBOMs.

  2. SBOM as data-lake asset, not per-repo compliance file. Zalando publishes "a curated data set containing dependency data from the SBOM for every application we deploy … in our data lake and thus can be easily queried and visualized by any engineer" (Source: sources/2023-04-12-zalando-how-software-bill-of-materials-change-the-dependency-game). This is the central architectural primitive of the post — patterns/sbom-as-queryable-data-lake-asset. The shift from "each repo scans itself and opens PRs" to "one SQL query answers a fleet-wide question" is what makes mass-patch response feasible.

  3. Log4Shell-class mass patch: query → change-set → PR fan-out. The post names the pattern verbatim: "For large-scale patch actions (like the famous log4j upgrade), we prepare change sets for different types of build files and automate the Pull Request creation across all repositories. This allows for central tracking of the patch progress and requires minimal support from the team for the deployment." (Source: sources/2023-04-12-zalando-how-software-bill-of-materials-change-the-dependency-game). Canonicalised as patterns/vulnerability-fleet-sweep-via-sbom-query. The Docker-image metadata also carries the owning team, so notifications route automatically.

  4. Akka license change turned an ecosystem risk into a scoped query. The 2022 Akka re-licensing (Apache-2.0 → Business Source License, commercial beyond certain revenue thresholds) forced every large JVM shop to assess exposure. Zalando frames this as a direct SBOM use-case alongside log4j: "upgrades to major versions of libraries, changes in licensing of open-source libraries (for example Akka) create the need to understand the library footprint to assess the need for action or migration costs" (Source: sources/2023-04-12-zalando-how-software-bill-of-materials-change-the-dependency-game). The SBOM corpus makes "which apps link an Akka module?" a one-query question.

  5. Empirical per-language dependency-count distributions. The post publishes observed percentile curves across Zalando's fleet: Python has the lowest dependency count per application; Go is 1.4–2× Python; Java (covering Java, Kotlin, Scala — the SBOM scanner detects java-archives) is 2–3× Go; JavaScript/TypeScript is 5–10× Java (Source: sources/2023-04-12-zalando-how-software-bill-of-materials-change-the-dependency-game). Named outliers: Python — jupyter at 2.5× the next- biggest Python app; Java — tableau at 3.14× the next- biggest Java app. Growth is described as exponential across the application-popularity percentile axis. Canonicalised as concepts/dependency-count-by-language-ecosystem.

  6. SBOM-driven dependency-bloat audit discovered AWS SDK over-import. "We noticed that some applications were using the full SDK (200MB+ in Java) instead of its individual modules. Addressing this finding helped reduce build times and lower resulting docker image size significantly." (Source: sources/2023-04-12-zalando-how-software-bill-of-materials-change-the-dependency-game). Canonicalised as patterns/sbom-driven-dependency-bloat-audit — a variant of transitive reachability analysis where the reachable set is sourced from an SBOM corpus rather than a language-native tool like goda reach.

  7. Internal library adoption curves reveal template-drift. One graph traces three quarterly snapshots of an internal JVM/Kotlin library: versions 0.22.0+ show expected sawtooth adoption (previous version drops as next is picked up), but version 0.21.0 usage "constantly increases" despite three newer versions being available. The author hypothesises "new applications are created by using the same application template, which misses the dependency update" (Source: sources/2023-04-12-zalando-how-software-bill-of-materials-change-the-dependency-game) — a cross-cut with patterns/template-project-nudges-consistency: the nudging template also nudges staleness if its pinned deps aren't refreshed.

  8. JVM SBOM data quality is the adoption gotcha. Two failure modes named: (a) divergent package names / group IDs across scans, making cross-app correlation harder; (b) uber-jars flatten metadata"some SBOMs did not show any java-archive entries, because the team's build process flattened all dependencies into an uber-jar and the required metadata needed for library detection was lost" (Source: sources/2023-04-12-zalando-how-software-bill-of-materials-change-the-dependency-game). Canonicalised as concepts/uber-jar-metadata-loss. Recommendation: "caution when using SBOM tools and double- checking that the SBOM generation works correctly for all applications."

  9. Dependabot / scala-steward / versions-plugin frame this as insufficient. The opening paragraph positions per-repo update tooling — dependabot, scala-steward, maven-versions-plugin, gradle-versions-plugin — as "catch-up game" visibility only: "Playing the catch-up game and getting some visibility through incoming pull requests or changes is far from great, though and we can do better here" (Source: sources/2023-04-12-zalando-how-software-bill-of-materials-change-the-dependency-game). The SBOM corpus is framed as the missing "complete picture of used dependencies over time" these per-repo tools can't provide. This is a direct complement to the wiki's existing patterns/dependency-update-discipline — per-repo is the tactical layer; SBOM-as-data-lake is the strategic fleet layer.

  10. Future direction: correlate dependency hygiene with DORA metrics. "we hope to be able to correlate dependency data with dependency hygiene practices, deployment frequency, change failure rates, and lead times for each application" (Source: sources/2023-04-12-zalando-how-software-bill-of-materials-change-the-dependency-game). The long-term vision is to make dependency hygiene a measurable engineering-quality signal, joined against DORA metrics on a per-app basis.

Systems, concepts, patterns extracted

Systems (SBOM tooling stack)

  • systems/cyclonedx — OWASP-stewarded SBOM format; named by Zalando as a "common format" for portability + tooling integration.
  • systems/spdx — Linux Foundation SBOM format; named alongside CycloneDX as the other canonical format.
  • systems/syft — Anchore's SBOM generator; extracts dependencies from container images, source trees, or archives.
  • systems/grype — Anchore's vulnerability scanner that consumes syft-generated SBOMs to identify CVEs.
  • systems/dependabot — GitHub's per-repo dependency- update bot; named in the opening paragraph as the canonical "per-repo catch-up" tool the SBOM approach complements.
  • systems/scala-steward — Scala-specific equivalent of dependabot; named alongside.
  • systems/log4j — named explicitly in the "vulnerabilities in commonly used libraries (e.g. log4j, spring, commons-text)" enumeration; the canonical forcing function for SBOM adoption industry-wide (Log4Shell, CVE-2021-44228).

Concepts (dependency-governance primitives)

  • concepts/sbom-software-bill-of-materials — what an SBOM is, what it contains (package name, version, license per entry), CycloneDX vs SPDX, source-tree vs container- extraction loci.
  • concepts/container-extracted-sbom — the specific discipline of scanning the built container image rather than the source tree, and why (catches OS packages + detects what actually ships, not what the build system thinks ships).
  • concepts/dependency-count-by-language-ecosystem — the empirical observation that dependency counts grow exponentially across application percentiles and vary by 1–2 orders of magnitude across language ecosystems (Python < Go < Java < JavaScript).
  • concepts/uber-jar-metadata-loss — the JVM-specific gotcha where a shaded / fat / uber-jar flattens dependency metadata such that SBOM tools can't detect the constituent libraries.

Patterns

Caveats

  • No fleet size / QPS / latency / storage numbers. The post sits at the platform-capability altitude — it discloses what the SBOM pipeline enables, not how many applications are scanned, what the ingestion throughput is, what the data lake storage footprint looks like, or how quickly a Log4Shell-class query returns over the corpus. No retention window for historical SBOM snapshots is disclosed.
  • No SBOM-pipeline implementation details. How SBOMs flow from CI/CD into the data lake is unnamed (Kafka? S3 + Glue? direct JDBC? batch vs streaming ingestion?). The "data lake" is a single-phrase reference; the storage format (Parquet? Iceberg?) is not disclosed. Query engine / BI tool is unnamed.
  • No quantified log4j response. The post claims "very low time it takes us to analyze the impact of the Akka license change or CVEs," but doesn't quote how long "low" is, how many apps were affected, what fraction self-patched via automated PRs vs required human intervention, or what the MTTR from CVE announcement to fleet-remediation was.
  • No SBOM-consumer API. Any app team can query, but the post doesn't disclose whether there's a programmatic API (REST? GraphQL? direct SQL?), an alerting / subscription hook (e.g. "notify me when any app in my bindle adds a dependency with a critical CVE"), or integration with runtime (e.g. gating deploys on SBOM-derived policy).
  • Adoption footprint undisclosed. Whether SBOM generation is mandatory for deploys, opt-in, enforced by build tooling, or self-service is unstated. The "every application we deploy" phrasing implies mandatory, but coverage percentages aren't given.
  • Single-author framing. The post is labelled as Zalando Engineering generally, without a named author / team — typical of their platform-capability posts, in contrast to the named-author write-ups like the PostGIS post.

Operational numbers disclosed

  • AWS SDK full vs modules: "200MB+ in Java" for the full SDK — the quantified finding that triggered the dependency-bloat audit.
  • Per-language dependency-count ratios: Python 1× → Go 1.4–2× → Java 2–3× of Go → JavaScript/TypeScript 5–10× of Java.
  • Named outliers: jupyter at 2.5× the next-biggest Python app's dependency count; tableau at 3.14× the next-biggest Java app's.
  • Internal library versions discussed: 0.21.0, 0.22.0+ as the canonical sawtooth-vs-stuck-version illustration across three quarterly snapshots.
  • Named CVE sources: log4j, spring, commons-text as the "commonly used libraries" cluster; openssl as the example of a project that "preannounces security updates allowing for more preparation time".

Positioning on the wiki

  • Opens Zalando's twelfth canonical axis: Supply-chain security / SBOM-driven dependency governance. Complements the existing eleven architectural axes (Postgres-on-K8s kernel-latency, experimentation platform, mobile testing, JVM integration testing, Cyber- Week load-automation, unified GraphQL BFF, frontend platform evolution, JVM language governance, MDM / knowledge-graph, Postgres-on-K8s geospatial, ML platform).
  • First wiki canonical instance of SBOM as a first-class primitive. The wiki previously covered supply-chain topics at the per-language altitude (rustls / cargo-audit-style content via Fly.io) and vulnerability-response at the OS- library altitude (concepts/os-library-vulnerability-ungovernable via Meta/WhatsApp). This post adds the fleet-wide / data-lake / cross-language altitude.
  • First wiki canonical instance of CycloneDX / SPDX / syft / grype as systems.
  • Extends patterns/dependency-update-discipline with Zalando as a second seen-in instance at a complementary altitude: Fly.io's instance is single-project, tactical, per-dep-update discipline; Zalando's is fleet-wide, strategic, SBOM-corpus-driven. Both are real; they compose.
  • Extends concepts/transitive-dependency-reachability with an SBOM-corpus-driven variant of the reachability- audit discipline — Datadog's instance uses goda reach; Zalando's uses SBOM queries across a fleet.

Source

Last updated · 501 distilled / 1,218 read