CONCEPT Cited by 1 source

Config-as-code pipeline¶

Definition¶

A config-as-code pipeline is a data / infrastructure pipeline whose behaviour is primarily driven by code-versioned configuration files (Python / YAML / Hack / Starlark / …) alongside the runtime code that interprets them. Adding a capability — say, a new data field — requires synchronised edits across multiple subsystems because the config is the composition substrate.

Meta's 2026-04-06 post names the canonical shape:

"Our pipeline is config-as-code: Python configurations, C++ services, and Hack automation scripts working together across multiple repositories. A single data field onboarding touches configuration registries, routing logic, DAG composition, validation rules, C++ code generation, and automation scripts — six subsystems that must stay in sync." (Source: sources/2026-04-06-meta-how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines.)

Meta's specific instance: 4 repositories, 3 languages, 4,100+ files, 6 subsystems that any single data-field change must update coherently.

Properties that matter¶

Cross-repo coupling. A logical unit of change spans multiple repos. Monorepo proximity doesn't automatically solve this — the subsystems are still conceptually separate.
Cross-language coupling. Config files in one language (Python), services in another (C++), automation in a third (Hack) — an agent modifying one language must understand constraints in all three.
Invisible invariants. Serialisation compatibility, intermediate field-name renames, append-only enum spaces — none of these live in compile-time type systems.
Silent failure modes dominate. Wrong field name → code-gen passes, wrong output at runtime. Removed enum value → compiles, crashes in production on old serialised payloads. Tests don't catch these without deliberate fixtures.

The tribal-knowledge density problem¶

Config-as-code pipelines accumulate tribal knowledge faster than any other workload class this wiki has catalogued:

Codebase-size scaling — tribal knowledge grows roughly with file count. Meta's ~50 tribal patterns over 4,100 files is representative.
Cross-subsystem scaling — each new subsystem multiplies the pairwise-invariant surface. Meta's 6 subsystems force ~15 pairwise- compatibility relationships.
Deprecation scaling — append-only identifier rules accumulate over time; deprecated enum values never truly leave. The "serialisation compatibility" graveyard grows monotonically.

Why AI agents struggle here specifically¶

Meta's explicit finding: AI agents pointed at a config-as-code pipeline "would guess, explore, guess again and often produce code that compiled but was subtly wrong." The generic-code-assistant failure mode is:

Overconfidence — the code compiles, so the agent believes it.
No map — cross-subsystem invariants aren't in any one file.
Pretraining redundancy fails — these are proprietary codebases; the model has never seen them.

The pretraining-overlap asymmetry that makes context files hurt on Django / matplotlib (2025 academic research) inverts on config-as-code pipelines: the knowledge the agent needs is precisely what is not in pretraining.

Canonical architectural response¶

Meta AI Pre-Compute Engine — a 50+-agent swarm produces a 59-file compass-not-encyclopedia knowledge layer + a cross-repo dependency index ("what depends on X?" in ~200 tokens vs ~6,000 tokens of ad-hoc exploration) + a data-flow map.

The pattern (see patterns/precomputed-agent-context-files): extract the cross-subsystem invariants once, offline; consume them many times, online.

Contrast with monorepo-at-scale¶

Config-as-code pipelines are not the same problem as monorepo-at-scale:

Monorepo-at-scale is about file count, commit velocity, build graph, index size. Response: Sapling + Glean — storage + indexing primitives.
Config-as-code pipeline is about cross-subsystem semantic invariants. Response: the precompute engine + compass-shaped context files — semantic primitives.

Meta has both problems and has built separate infrastructure for each.

Seen in¶

Meta Data Platform pipeline (2026-04-06) — canonical wiki instance. Python configs + C++ services + Hack automation scripts over 4 repos / 4,100+ files / 6 synchronised subsystems. Single data-field onboarding touches configuration registries, routing logic, DAG composition, validation rules, C++ code generation, and automation scripts. (Source: sources/2026-04-06-meta-how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines.)

concepts/tribal-knowledge — the category that dominates in these pipelines
concepts/monorepo — a different scale axis; can co-occur, often is
systems/meta-ai-precompute-engine — Meta's canonical architectural response
patterns/precomputed-agent-context-files — the pattern the response instantiates