CONCEPT Cited by 1 source
Tribal knowledge¶
Definition¶
Tribal knowledge is the set of undocumented domain-specific conventions, invariants, and failure modes that live only in engineers' heads and occasional code comments — never in formal documentation. Classic examples: "we always use field name X in mode A but Y in mode B, and the compiler won't catch it if you mix them"; "never remove a deprecated enum value, it breaks serialisation compatibility"; "the tool assumes the config file is sorted, nobody ever wrote that down."
The term originates in organisational theory but has taken on sharp operational meaning with the arrival of AI coding agents: an LLM has access to the code, the public docs, and the pretraining data, but not to the tribal knowledge — which is precisely the knowledge it needs to modify the code correctly.
Why it matters now¶
Before AI agents, tribal knowledge was a team-scale problem — solved by onboarding, pairing, PR review, and code comments. At small scale, this was tolerable. At Meta's scale, the failure mode is explicit: a config-as-code pipeline spanning four repos, three languages, 4,100+ files accumulates faster than the tribe that carries it can document it.
When AI agents are pointed at code whose tribal knowledge is missing, Meta observes three failure modes (Source: sources/2026-04-06-meta-how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines):
- Silent wrong output — "two configuration modes use different field names for the same operation (swap them and you get silent wrong output)"
- Silent build / code-gen failure — "one pipeline stage outputs a temporary field name that a downstream stage renames (reference the wrong one and code generation silently fails)"
- Backward-compatibility break — "dozens of 'deprecated' enum values must never be removed because serialization compatibility depends on them"
Without tribal knowledge, agents "would guess, explore, guess again and often produce code that compiled but was subtly wrong" — burning 15-25 tool calls exploring before producing subtly incorrect code.
Where it lives¶
Meta's five-questions framework names the five places:
- What does this module configure? — surface docs / module comments.
- What are the common modification patterns? — recent commit history + team memory.
- What are the non-obvious patterns that cause build failures? — pure tribal knowledge, only in code comments or incident postmortems.
- What are the cross-module dependencies? — partially in build files, partially tribal.
- What tribal knowledge is buried in code comments? — the deepest layer, the mechanism Meta names explicitly: "Question five was where the deepest learnings emerged. We found 50+ non-obvious patterns … None of this had been written down before."
The extraction architecture¶
The canonical architectural response on the wiki: Meta's systems/meta-ai-precompute-engine runs a one-session orchestration of 11 module analyst agents asking the five questions of every module and producing 59 compass-not-encyclopedia context files with Non-Obvious Patterns as a mandated section in every file.
This is a two-stage trade:
- Stage 1 (extraction) — AI reads comments + code + history to make the tribal knowledge explicit.
- Stage 2 (consumption) — AI agents consume the explicit knowledge at request time, avoiding re-derivation.
The first stage pays once; the second stage pays every request.
Tribal knowledge vs regular documentation¶
| Axis | Regular docs | Tribal knowledge |
|---|---|---|
| Location | README / wiki / comments | Heads, incident postmortems, scattered comments |
| Format | Human-paced prose | "If you do X, Y silently breaks" — failure-oriented |
| Audience | New engineers | Agents + new engineers on unfamiliar modules |
| Freshness | Rare updates | Evolves with every refactor |
| Failure mode if missing | Slow onboarding | Silent wrong output |
Meta's explicit finding¶
Academic research on Django / matplotlib found AI-generated context files decreased agent success. Meta's counter: "It was evaluated on codebases like Django and matplotlib that models already 'know' from pretraining. In that scenario, context files are redundant noise. Our codebase is the opposite: proprietary config-as-code with tribal knowledge that exists nowhere in any model's training data."
The pretraining-overlap asymmetry is the variable that distinguishes corpora where context files help from those where they hurt. For codebases that pretrained LLMs have seen, explicit context adds noise. For proprietary codebases with heavy tribal knowledge, explicit context is the difference between silent-wrong and correct.
Seen in¶
- Meta Data Platform pre-compute engine (2026-04-06) — canonical wiki instance. 50+-agent swarm extracts 50+ non-obvious patterns from a config-as-code data pipeline (4 repos / 3 languages / 4,100+ files). Result: AI context coverage ~5% → 100%, 40% fewer tool calls per task, hallucinated file paths zero, critic quality score 3.65 → 4.20 / 5.0 across 3 rounds. (Source: sources/2026-04-06-meta-how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines.)
Related¶
- concepts/compass-not-encyclopedia — the format the extracted knowledge lives in
- concepts/config-as-code-pipeline — the workload class with the highest tribal-knowledge density
- concepts/context-engineering — the parent discipline; tribal-knowledge extraction is the offline-preloading variant
- concepts/context-file-freshness — tribal knowledge evolves; extracted context goes stale
- patterns/precomputed-agent-context-files — the canonical architectural pattern
- patterns/five-questions-knowledge-extraction — the per-module methodology
- systems/meta-ai-precompute-engine — the system that extracts and maintains it