CONCEPT Cited by 1 source
Config-as-code pipeline¶
Definition¶
A config-as-code pipeline is a data / infrastructure pipeline whose behaviour is primarily driven by code-versioned configuration files (Python / YAML / Hack / Starlark / …) alongside the runtime code that interprets them. Adding a capability — say, a new data field — requires synchronised edits across multiple subsystems because the config is the composition substrate.
Meta's 2026-04-06 post names the canonical shape:
"Our pipeline is config-as-code: Python configurations, C++ services, and Hack automation scripts working together across multiple repositories. A single data field onboarding touches configuration registries, routing logic, DAG composition, validation rules, C++ code generation, and automation scripts — six subsystems that must stay in sync." (Source: sources/2026-04-06-meta-how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines.)
Meta's specific instance: 4 repositories, 3 languages, 4,100+ files, 6 subsystems that any single data-field change must update coherently.
Properties that matter¶
- Cross-repo coupling. A logical unit of change spans multiple repos. Monorepo proximity doesn't automatically solve this — the subsystems are still conceptually separate.
- Cross-language coupling. Config files in one language (Python), services in another (C++), automation in a third (Hack) — an agent modifying one language must understand constraints in all three.
- Invisible invariants. Serialisation compatibility, intermediate field-name renames, append-only enum spaces — none of these live in compile-time type systems.
- Silent failure modes dominate. Wrong field name → code-gen passes, wrong output at runtime. Removed enum value → compiles, crashes in production on old serialised payloads. Tests don't catch these without deliberate fixtures.
The tribal-knowledge density problem¶
Config-as-code pipelines accumulate tribal knowledge faster than any other workload class this wiki has catalogued:
- Codebase-size scaling — tribal knowledge grows roughly with file count. Meta's ~50 tribal patterns over 4,100 files is representative.
- Cross-subsystem scaling — each new subsystem multiplies the pairwise-invariant surface. Meta's 6 subsystems force ~15 pairwise- compatibility relationships.
- Deprecation scaling — append-only identifier rules accumulate over time; deprecated enum values never truly leave. The "serialisation compatibility" graveyard grows monotonically.
Why AI agents struggle here specifically¶
Meta's explicit finding: AI agents pointed at a config-as-code pipeline "would guess, explore, guess again and often produce code that compiled but was subtly wrong." The generic-code-assistant failure mode is:
- Overconfidence — the code compiles, so the agent believes it.
- No map — cross-subsystem invariants aren't in any one file.
- Pretraining redundancy fails — these are proprietary codebases; the model has never seen them.
The pretraining-overlap asymmetry that makes context files hurt on Django / matplotlib (2025 academic research) inverts on config-as-code pipelines: the knowledge the agent needs is precisely what is not in pretraining.
Canonical architectural response¶
Meta AI Pre-Compute Engine — a 50+-agent swarm produces a 59-file compass-not-encyclopedia knowledge layer + a cross-repo dependency index ("what depends on X?" in ~200 tokens vs ~6,000 tokens of ad-hoc exploration) + a data-flow map.
The pattern (see patterns/precomputed-agent-context-files): extract the cross-subsystem invariants once, offline; consume them many times, online.
Contrast with monorepo-at-scale¶
Config-as-code pipelines are not the same problem as monorepo-at-scale:
- Monorepo-at-scale is about file count, commit velocity, build graph, index size. Response: Sapling + Glean — storage + indexing primitives.
- Config-as-code pipeline is about cross-subsystem semantic invariants. Response: the precompute engine + compass-shaped context files — semantic primitives.
Meta has both problems and has built separate infrastructure for each.
Seen in¶
- Meta Data Platform pipeline (2026-04-06) — canonical wiki instance. Python configs + C++ services + Hack automation scripts over 4 repos / 4,100+ files / 6 synchronised subsystems. Single data-field onboarding touches configuration registries, routing logic, DAG composition, validation rules, C++ code generation, and automation scripts. (Source: sources/2026-04-06-meta-how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines.)
Related¶
- concepts/tribal-knowledge — the category that dominates in these pipelines
- concepts/monorepo — a different scale axis; can co-occur, often is
- systems/meta-ai-precompute-engine — Meta's canonical architectural response
- patterns/precomputed-agent-context-files — the pattern the response instantiates