CONCEPT Cited by 1 source

LoRA (Low-Rank Adaptation)¶

Definition¶

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) technique that freezes the weights of a pre-trained base model and trains only a small number of new "adapter" parameters, arranged as low-rank decompositions injected into selected linear layers (typically attention projections).

For a base linear layer y = W · x with W ∈ ℝ^{d×k}, LoRA adds a learned delta:

y = W · x + (B · A) · x

where A ∈ ℝ^{r×k}, B ∈ ℝ^{d×r}, and r ≪ min(d, k) — the rank. For a typical LLM hidden size of 4096 and LoRA rank of 8, the adapter adds (8·4096 + 4096·8) = 65,536 parameters per targeted layer — orders of magnitude smaller than the full 4096·4096 ≈ 16.7M of the original weight.

During training: only A and B receive gradients; W is frozen. During serving: either run the adapter as a separate delta path (adapter-at-inference) or fold B·A back into W at deploy time — see concepts/adapter-merging.

Original paper: Hu et al., LoRA: Low-Rank Adaptation of Large Language Models (2021).

Why LoRA instead of full fine-tuning¶

Training cost. Full 8B fine-tuning requires storing full optimiser state + gradients for all 8B parameters. LoRA with rank 8 targeting 4 attention matrices per layer reduces trainable parameters by 3-4 orders of magnitude.
Storage cost per fine-tune. A full fine-tuned 8B checkpoint is ~16 GB; a LoRA delta is typically ~10-100 MB. Teams can ship many task-specific LoRAs for the same base.
Resistance to catastrophic forgetting. The frozen base preserves original capabilities; adapter-only training is hard-capped in how much it can drift.
Adapter merging. Post-training, B·A can be added into W so serving has zero inference overhead vs. the base model — see concepts/adapter-merging.

Caveats¶

Depth of domain adaptation is capped relative to full fine-tuning. If the domain demands shifting what the base model represents at the deepest layers, LoRA may not get there — in that case continued pretraining or full fine-tuning is the lever.
Rank choice is a trade-off, not a tuning triviality. Too small and the adapter can't express the task; too large and the cost advantage over full fine-tuning evaporates.
Multiple-adapter composition is fragile. Composing two independently-trained LoRAs doesn't generally give the composition of their behaviours.
Target-module choice matters. Applying LoRA only to attention Q/V vs. all linear layers produces different quality/cost profiles.

Canonical wiki instance¶

Instacart's Intent Engine (2025-11-13) SRL system uses LoRA to fine-tune Llama-3-8B on training data generated by an offline RAG "teacher" pipeline. The production model:

Base: Llama-3-8B.
Fine-tuning technique: LoRA.
Training data: high-quality curriculum dataset from the offline teacher pipeline.
Deployment: LoRA adapters merged into base weights before serving (see concepts/adapter-merging) — removes any per-inference adapter overhead.
Hardware: H100 (upgraded from A100 during latency optimization).
Latency: ~300 ms target (from ~700 ms out-of-the-box on A100).
Quality: precision 96.4%, recall 95.0%, F1 95.7% — near-parity with the much larger frontier teacher model.

Ratio of parameters the 8B gained vs. its total: fractional. But the deployment gets 96.4% of the frontier model's precision at ~2% of the frontier model's serving cost — the main economic win of LoRA-based student distillation. (Source: sources/2025-11-13-instacart-building-the-intent-engine)

concepts/continued-pretraining — full base-weight continued training on new-domain data. More invasive than LoRA; eBay's e-Llama is the canonical instance. LoRA's adapter-only posture is an explicit lightweight alternative.
concepts/knowledge-distillation — LoRA is often the mechanism by which a student model absorbs teacher-generated labels in a teacher-student deployment shape. Instacart's SRL combines both.
concepts/catastrophic-forgetting — LoRA structurally limits this by freezing the base.

Seen in¶

sources/2025-11-13-instacart-building-the-intent-engine — canonical reference; LoRA fine-tune of Llama-3-8B for production SRL at Instacart, adapter-merged + served on H100.

concepts/adapter-merging — post-training fold of the LoRA delta into base weights for zero-overhead serving
concepts/knowledge-distillation — the distillation framing within which Instacart uses LoRA
concepts/catastrophic-forgetting — LoRA's structural mitigation
concepts/continued-pretraining — more-invasive alternative
systems/llama-3-1 — base model family Instacart fine-tunes
systems/instacart-intent-engine
patterns/head-cache-plus-tail-finetuned-model / patterns/offline-teacher-online-student-distillation
companies/instacart