Skip to content

CONCEPT Cited by 1 source

LoRA (Low-Rank Adaptation)

Definition

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) technique that freezes the weights of a pre-trained base model and trains only a small number of new "adapter" parameters, arranged as low-rank decompositions injected into selected linear layers (typically attention projections).

For a base linear layer y = W · x with W ∈ ℝ^{d×k}, LoRA adds a learned delta:

y = W · x + (B · A) · x

where A ∈ ℝ^{r×k}, B ∈ ℝ^{d×r}, and r ≪ min(d, k) — the rank. For a typical LLM hidden size of 4096 and LoRA rank of 8, the adapter adds (8·4096 + 4096·8) = 65,536 parameters per targeted layer — orders of magnitude smaller than the full 4096·4096 ≈ 16.7M of the original weight.

During training: only A and B receive gradients; W is frozen. During serving: either run the adapter as a separate delta path (adapter-at-inference) or fold B·A back into W at deploy time — see concepts/adapter-merging.

Original paper: Hu et al., LoRA: Low-Rank Adaptation of Large Language Models (2021).

Why LoRA instead of full fine-tuning

  1. Training cost. Full 8B fine-tuning requires storing full optimiser state + gradients for all 8B parameters. LoRA with rank 8 targeting 4 attention matrices per layer reduces trainable parameters by 3-4 orders of magnitude.
  2. Storage cost per fine-tune. A full fine-tuned 8B checkpoint is ~16 GB; a LoRA delta is typically ~10-100 MB. Teams can ship many task-specific LoRAs for the same base.
  3. Resistance to catastrophic forgetting. The frozen base preserves original capabilities; adapter-only training is hard-capped in how much it can drift.
  4. Adapter merging. Post-training, B·A can be added into W so serving has zero inference overhead vs. the base model — see concepts/adapter-merging.

Caveats

  • Depth of domain adaptation is capped relative to full fine-tuning. If the domain demands shifting what the base model represents at the deepest layers, LoRA may not get there — in that case continued pretraining or full fine-tuning is the lever.
  • Rank choice is a trade-off, not a tuning triviality. Too small and the adapter can't express the task; too large and the cost advantage over full fine-tuning evaporates.
  • Multiple-adapter composition is fragile. Composing two independently-trained LoRAs doesn't generally give the composition of their behaviours.
  • Target-module choice matters. Applying LoRA only to attention Q/V vs. all linear layers produces different quality/cost profiles.

Canonical wiki instance

Instacart's Intent Engine (2025-11-13) SRL system uses LoRA to fine-tune Llama-3-8B on training data generated by an offline RAG "teacher" pipeline. The production model:

  • Base: Llama-3-8B.
  • Fine-tuning technique: LoRA.
  • Training data: high-quality curriculum dataset from the offline teacher pipeline.
  • Deployment: LoRA adapters merged into base weights before serving (see concepts/adapter-merging) — removes any per-inference adapter overhead.
  • Hardware: H100 (upgraded from A100 during latency optimization).
  • Latency: ~300 ms target (from ~700 ms out-of-the-box on A100).
  • Quality: precision 96.4%, recall 95.0%, F1 95.7% — near-parity with the much larger frontier teacher model.

Ratio of parameters the 8B gained vs. its total: fractional. But the deployment gets 96.4% of the frontier model's precision at ~2% of the frontier model's serving cost — the main economic win of LoRA-based student distillation. (Source: sources/2025-11-13-instacart-building-the-intent-engine)

Seen in

Last updated · 319 distilled / 1,201 read