PATTERN Cited by 1 source
SBOM as queryable data-lake asset¶
Intent¶
Treat every application's Software Bill of Materials as a
first-class dataset in a central data lake, not as a
per-repo compliance file or per-deploy artifact buried in an
object store. Each deploy emits its SBOM as a row (or a small
set of rows) in a shared table keyed by
(application, deploy_timestamp, component_name,
component_version, license). Any engineer with SQL access can
then answer fleet-wide dependency questions in minutes instead
of weeks.
This inverts the default shape of per-repo dependency tooling
(dependabot,
scala-steward, maven-versions-plugin,
gradle-versions-plugin), which can only answer "what does
this repo need to update?" — the SBOM-as-data-lake pattern
answers "which of our thousands of repos contains dependency
X at version Y right now?".
Architecture shape¶
CI/CD pipeline
│
▼
Build container image ──► push to registry
│
▼
[syft](<../systems/syft.md>) scan
│
▼
SBOM (CycloneDX or SPDX JSON)
│
▼
Ingestion
│
▼
Data-lake tables (Parquet / Iceberg /
Glue-cataloged / similar)
│
▼
SQL query engine (Athena / Presto /
BigQuery / similar) + BI visualisation
│
▼
"Which apps link log4j 2.0-2.14?" → one query, minutes
The dataset is typically append-only per deploy (one snapshot per image build), enabling time-series analyses: "how did our log4j version distribution evolve over the quarter?", "what's the adoption curve of our internal shared library v0.22.0 vs v0.21.0?".
Canonical wiki instance (Zalando 2023-04-12)¶
Zalando publishes "a curated data set containing dependency data from the SBOM for every application we deploy, based on its Container image. The data set is available in our data lake and thus can be easily queried and visualized by any engineer" (Source: sources/2023-04-12-zalando-how-software-bill-of-materials-change-the-dependency-game).
The pattern's concrete wins at Zalando:
- Log4Shell-class mass patch — query for affected apps, auto-generate change-sets per build-tool type, open PRs at fleet scale, centrally track progress (patterns/vulnerability-fleet-sweep-via-sbom-query).
- Akka license-change footprint assessment — a single query across the corpus.
- AWS SDK bloat audit — cross-app pattern-detection of apps importing the full SDK vs individual modules (patterns/sbom-driven-dependency-bloat-audit).
- Internal library adoption curves — three quarterly snapshots show version-0.21.0 stuck while 0.22.0+ exhibits healthy sawtooth adoption; hypothesis that template-project drift is the cause.
- Per-language dependency-count percentile plots — descriptive statistics over the corpus (concepts/dependency-count-by-language-ecosystem).
Why the data-lake shape matters¶
- SQL is the lingua franca. Engineers who'd never write
syft sbom.json | jq ...will writeSELECT app, version FROM sboms WHERE component = 'log4j-core' AND version LIKE '2.%'. - Cross-cutting questions become cheap. "Which teams
own apps that use deprecated library X?" joins SBOM
rows with a team-ownership table. "Which licences do our
apps link?" is
SELECT DISTINCT license FROM sboms. - Time-series analytics come for free. Append-only per deploy means you can ask "when did we stop shipping log4j 2.14?" without special instrumentation.
- Visualisation is trivial. Any BI tool that speaks the query engine (QuickSight, Tableau, Looker, Metabase) gets dependency dashboards with a few clicks.
Adjacent anti-patterns¶
- SBOM in the deploy artifact only. Ships an SBOM next to each binary / container, doesn't aggregate. Answers per-app questions but no fleet question. Still requires scripting to do anything cross-cutting.
- Per-repo dependency scanner with no aggregation. GitHub's Dependabot alerts show per-repo CVEs; aggregating across a thousand repos into "which repos use X" is a manual GraphQL API + scripting job.
- Compliance-motivated SBOM generation only. Generate once per release, hand it to security or legal, never analyse at scale. Misses every operational use.
- Per-deploy SBOM stored without a schema. Dump the syft-json into S3, call it done. No query engine, no SQL, no adoption. The engineering lever is the data-lake table schema, not the raw JSON.
Integration with per-repo tooling¶
The SBOM-as-data-lake pattern doesn't replace per-repo dependency- update discipline — it complements it:
- Per-repo: systems/dependabot / systems/scala-steward open PRs on the specific deps a single repo can update. Tactical.
- Fleet-wide: SBOM corpus answers "which apps need the update?" and enables automated fleet-wide PR fan-out. Strategic.
Both layers are real; they compose. The SBOM corpus tells you which repos to ask dependabot to update; dependabot does the per-repo update mechanic.
Infrastructure primitives this pattern assumes¶
- A data lake (S3 + Iceberg, GCS + BigQuery, Delta Lake, etc.) for append-only analytic storage.
- A SQL query engine (Athena, Presto, Trino, BigQuery, Snowflake).
- A CI/CD integration that runs the SBOM generator on every build and ships the output to the ingestion path.
- An application → team ownership table to route notifications (Zalando uses Docker image metadata).
Seen in¶
- sources/2023-04-12-zalando-how-software-bill-of-materials-change-the-dependency-game — canonical wiki instance. Zalando's full stack: container SBOM generation + data-lake publication + SQL query access
- visualisation for log4j / Akka / AWS-SDK use-cases.
Related¶
- concepts/sbom-software-bill-of-materials — the substrate.
- concepts/container-extracted-sbom — the generation- locus choice that makes fleet uniformity tractable.
- patterns/vulnerability-fleet-sweep-via-sbom-query — the headline use-case the data-lake shape enables.
- patterns/sbom-driven-dependency-bloat-audit — the dependency-footprint-analysis use-case.
- patterns/dependency-update-discipline — the per-repo tactical layer this pattern composes with.