Skip to content

PATTERN Cited by 1 source

SBOM as queryable data-lake asset

Intent

Treat every application's Software Bill of Materials as a first-class dataset in a central data lake, not as a per-repo compliance file or per-deploy artifact buried in an object store. Each deploy emits its SBOM as a row (or a small set of rows) in a shared table keyed by (application, deploy_timestamp, component_name, component_version, license). Any engineer with SQL access can then answer fleet-wide dependency questions in minutes instead of weeks.

This inverts the default shape of per-repo dependency tooling (dependabot, scala-steward, maven-versions-plugin, gradle-versions-plugin), which can only answer "what does this repo need to update?" — the SBOM-as-data-lake pattern answers "which of our thousands of repos contains dependency X at version Y right now?".

Architecture shape

CI/CD pipeline
Build container image ──► push to registry
                        [syft](<../systems/syft.md>) scan
                    SBOM (CycloneDX or SPDX JSON)
                          Ingestion
                  Data-lake tables (Parquet / Iceberg /
                   Glue-cataloged / similar)
                 SQL query engine (Athena / Presto /
                  BigQuery / similar) + BI visualisation
    "Which apps link log4j 2.0-2.14?" → one query, minutes

The dataset is typically append-only per deploy (one snapshot per image build), enabling time-series analyses: "how did our log4j version distribution evolve over the quarter?", "what's the adoption curve of our internal shared library v0.22.0 vs v0.21.0?".

Canonical wiki instance (Zalando 2023-04-12)

Zalando publishes "a curated data set containing dependency data from the SBOM for every application we deploy, based on its Container image. The data set is available in our data lake and thus can be easily queried and visualized by any engineer" (Source: sources/2023-04-12-zalando-how-software-bill-of-materials-change-the-dependency-game).

The pattern's concrete wins at Zalando:

  • Log4Shell-class mass patch — query for affected apps, auto-generate change-sets per build-tool type, open PRs at fleet scale, centrally track progress (patterns/vulnerability-fleet-sweep-via-sbom-query).
  • Akka license-change footprint assessment — a single query across the corpus.
  • AWS SDK bloat audit — cross-app pattern-detection of apps importing the full SDK vs individual modules (patterns/sbom-driven-dependency-bloat-audit).
  • Internal library adoption curves — three quarterly snapshots show version-0.21.0 stuck while 0.22.0+ exhibits healthy sawtooth adoption; hypothesis that template-project drift is the cause.
  • Per-language dependency-count percentile plots — descriptive statistics over the corpus (concepts/dependency-count-by-language-ecosystem).

Why the data-lake shape matters

  • SQL is the lingua franca. Engineers who'd never write syft sbom.json | jq ... will write SELECT app, version FROM sboms WHERE component = 'log4j-core' AND version LIKE '2.%'.
  • Cross-cutting questions become cheap. "Which teams own apps that use deprecated library X?" joins SBOM rows with a team-ownership table. "Which licences do our apps link?" is SELECT DISTINCT license FROM sboms.
  • Time-series analytics come for free. Append-only per deploy means you can ask "when did we stop shipping log4j 2.14?" without special instrumentation.
  • Visualisation is trivial. Any BI tool that speaks the query engine (QuickSight, Tableau, Looker, Metabase) gets dependency dashboards with a few clicks.

Adjacent anti-patterns

  • SBOM in the deploy artifact only. Ships an SBOM next to each binary / container, doesn't aggregate. Answers per-app questions but no fleet question. Still requires scripting to do anything cross-cutting.
  • Per-repo dependency scanner with no aggregation. GitHub's Dependabot alerts show per-repo CVEs; aggregating across a thousand repos into "which repos use X" is a manual GraphQL API + scripting job.
  • Compliance-motivated SBOM generation only. Generate once per release, hand it to security or legal, never analyse at scale. Misses every operational use.
  • Per-deploy SBOM stored without a schema. Dump the syft-json into S3, call it done. No query engine, no SQL, no adoption. The engineering lever is the data-lake table schema, not the raw JSON.

Integration with per-repo tooling

The SBOM-as-data-lake pattern doesn't replace per-repo dependency- update discipline — it complements it:

Both layers are real; they compose. The SBOM corpus tells you which repos to ask dependabot to update; dependabot does the per-repo update mechanic.

Infrastructure primitives this pattern assumes

  • A data lake (S3 + Iceberg, GCS + BigQuery, Delta Lake, etc.) for append-only analytic storage.
  • A SQL query engine (Athena, Presto, Trino, BigQuery, Snowflake).
  • A CI/CD integration that runs the SBOM generator on every build and ships the output to the ingestion path.
  • An application → team ownership table to route notifications (Zalando uses Docker image metadata).

Seen in

Last updated · 501 distilled / 1,218 read