CONCEPT Cited by 1 source
LoRA (Low-Rank Adaptation)¶
Definition¶
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) technique that freezes the weights of a pre-trained base model and trains only a small number of new "adapter" parameters, arranged as low-rank decompositions injected into selected linear layers (typically attention projections).
For a base linear layer y = W · x with W ∈ ℝ^{d×k}, LoRA adds a learned delta:
where A ∈ ℝ^{r×k}, B ∈ ℝ^{d×r}, and r ≪ min(d, k) — the rank. For a typical LLM hidden size of 4096 and LoRA rank of 8, the adapter adds (8·4096 + 4096·8) = 65,536 parameters per targeted layer — orders of magnitude smaller than the full 4096·4096 ≈ 16.7M of the original weight.
During training: only A and B receive gradients; W is frozen. During serving: either run the adapter as a separate delta path (adapter-at-inference) or fold B·A back into W at deploy time — see concepts/adapter-merging.
Original paper: Hu et al., LoRA: Low-Rank Adaptation of Large Language Models (2021).
Why LoRA instead of full fine-tuning¶
- Training cost. Full 8B fine-tuning requires storing full optimiser state + gradients for all 8B parameters. LoRA with rank 8 targeting 4 attention matrices per layer reduces trainable parameters by 3-4 orders of magnitude.
- Storage cost per fine-tune. A full fine-tuned 8B checkpoint is ~16 GB; a LoRA delta is typically ~10-100 MB. Teams can ship many task-specific LoRAs for the same base.
- Resistance to catastrophic forgetting. The frozen base preserves original capabilities; adapter-only training is hard-capped in how much it can drift.
- Adapter merging. Post-training,
B·Acan be added intoWso serving has zero inference overhead vs. the base model — see concepts/adapter-merging.
Caveats¶
- Depth of domain adaptation is capped relative to full fine-tuning. If the domain demands shifting what the base model represents at the deepest layers, LoRA may not get there — in that case continued pretraining or full fine-tuning is the lever.
- Rank choice is a trade-off, not a tuning triviality. Too small and the adapter can't express the task; too large and the cost advantage over full fine-tuning evaporates.
- Multiple-adapter composition is fragile. Composing two independently-trained LoRAs doesn't generally give the composition of their behaviours.
- Target-module choice matters. Applying LoRA only to attention Q/V vs. all linear layers produces different quality/cost profiles.
Canonical wiki instance¶
Instacart's Intent Engine (2025-11-13) SRL system uses LoRA to fine-tune Llama-3-8B on training data generated by an offline RAG "teacher" pipeline. The production model:
- Base: Llama-3-8B.
- Fine-tuning technique: LoRA.
- Training data: high-quality curriculum dataset from the offline teacher pipeline.
- Deployment: LoRA adapters merged into base weights before serving (see concepts/adapter-merging) — removes any per-inference adapter overhead.
- Hardware: H100 (upgraded from A100 during latency optimization).
- Latency: ~300 ms target (from ~700 ms out-of-the-box on A100).
- Quality: precision 96.4%, recall 95.0%, F1 95.7% — near-parity with the much larger frontier teacher model.
Ratio of parameters the 8B gained vs. its total: fractional. But the deployment gets 96.4% of the frontier model's precision at ~2% of the frontier model's serving cost — the main economic win of LoRA-based student distillation. (Source: sources/2025-11-13-instacart-building-the-intent-engine)
Related adaptation lever in the wiki corpus¶
- concepts/continued-pretraining — full base-weight continued training on new-domain data. More invasive than LoRA; eBay's e-Llama is the canonical instance. LoRA's adapter-only posture is an explicit lightweight alternative.
- concepts/knowledge-distillation — LoRA is often the mechanism by which a student model absorbs teacher-generated labels in a teacher-student deployment shape. Instacart's SRL combines both.
- concepts/catastrophic-forgetting — LoRA structurally limits this by freezing the base.
Seen in¶
- sources/2025-11-13-instacart-building-the-intent-engine — canonical reference; LoRA fine-tune of Llama-3-8B for production SRL at Instacart, adapter-merged + served on H100.
Related¶
- concepts/adapter-merging — post-training fold of the LoRA delta into base weights for zero-overhead serving
- concepts/knowledge-distillation — the distillation framing within which Instacart uses LoRA
- concepts/catastrophic-forgetting — LoRA's structural mitigation
- concepts/continued-pretraining — more-invasive alternative
- systems/llama-3-1 — base model family Instacart fine-tunes
- systems/instacart-intent-engine
- patterns/head-cache-plus-tail-finetuned-model / patterns/offline-teacher-online-student-distillation
- companies/instacart