LoRA: Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

Category: alignment Updated: 2026-02-27

LoRA (Hu et al., 2021): rank-4 decomposition ΔW=BA reduces trainable parameters to 0.01% of full model while matching full fine-tuning BLEU on E2E NLG; no inference latency added after weight merging.

Key Data Points
MeasureValueUnitNotes
LoRA trainable parameters (rank-4)~0.01%of full modelHu et al.: 4.7M trainable vs 175B total for GPT-3 scale model at rank 4
Rank used in Hu et al. experiments4–8rank rRanks 4 and 8 match or exceed full fine-tuning; very small r suffices for most tasks
E2E NLG BLEU — LoRA vs full fine-tuning68.6 vs 68.2BLEULoRA (rank 4) slightly outperforms full fine-tuning on E2E NLG benchmark (Hu et al. Table 4)
Memory reduction (LoRA vs full fine-tune)GPU memoryNo optimizer states for frozen weights; full fine-tuning stores Adam states for all params
QLoRA quantization4-bit NormalFloatquantizationDettmers et al.: 4-bit quantized base model + LoRA adapters; 65B model fits on single GPU

LoRA (Low-Rank Adaptation) addresses the computational challenge of fine-tuning large pretrained models: full fine-tuning requires optimizer states, gradients, and weight copies for every parameter — scaling prohibitively with model size. LoRA reparameterizes weight updates as products of small matrices, reducing trainable parameters by orders of magnitude while retaining task performance.

The Core Reparameterization

For a pretrained weight matrix W₀ ∈ ℝ^{d×k}, full fine-tuning learns a dense update ΔW ∈ ℝ^{d×k}. LoRA instead constrains:

ΔW = B · A, where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, rank r ≪ min(d, k)

The forward pass becomes:

h = W₀x + ΔWx = W₀x + BAx

  • W₀ is frozen (no gradient computed)
  • Only A and B are trained
  • A is initialized from N(0, σ²); B is initialized to zero (so ΔW = 0 at start)
  • Scaling factor α/r is applied (α is a hyperparameter, typically equal to r)

Parameter Efficiency

For a weight matrix of size d=4096, k=4096 (typical attention projection in a large model):

MethodTrainable params (per matrix)vs Full Fine-Tune
Full fine-tuning4096 × 4096 = 16.7M
LoRA rank 64(4096+4096) × 64 = 524K3.1%
LoRA rank 8(4096+4096) × 8 = 65.5K0.39%
LoRA rank 4(4096+4096) × 4 = 32.8K0.20%
LoRA rank 1(4096+4096) × 1 = 8.2K0.05%

For a 175B parameter model, LoRA at rank 4 applied to attention Q/V matrices reduces trainable parameters from 175B to ~4.7M — a reduction of ~37,000×.

Benchmark Results (Hu et al., 2021)

MethodE2E BLEUWikiSQL AccSAMSum R-1Trainable params
Full fine-tune68.274.0%50.3175B
Adapter (Houlsby)66.373.2%49.8+0.3%
Prefix tuning67.073.9%49.8+0.1%
LoRA (rank 4)68.673.8%50.80.01%

LoRA matches or slightly exceeds full fine-tuning on all three benchmarks while using a fraction of the trainable parameters.

Rank Sensitivity Analysis

Rank rE2E BLEUWikiSQL AccBehavior
168.073.5%Near-optimal; lowest cost
268.473.7%Marginal improvement
468.673.8%Sweet spot
868.573.9%Plateau
6468.574.0%No benefit over r=4

The empirical result that r=4 nearly saturates performance supports the low intrinsic dimensionality hypothesis of Aghajanyan et al. (2021).

QLoRA: Quantization + LoRA

Dettmers et al. (2023) combined LoRA with 4-bit quantization (NF4 — Normal Float 4, optimized for normally distributed weights):

MethodGPU memory (65B model)Performance vs 16-bit
16-bit full fine-tune~780 GB (not feasible on ≤8 GPUs)100%
16-bit LoRA~200 GB~99%
QLoRA (4-bit NF4 + LoRA)~48 GB (1× A100 80GB)~99%

QLoRA makes it possible to fine-tune very large models on a single consumer GPU, enabling instruction tuning and alignment fine-tuning at dramatically lower hardware cost.

See fine-tuning for the general fine-tuning framework, instruction-tuning for the instruction-following fine-tuning paradigm LoRA is commonly applied to, and knowledge-distillation for an alternative approach to creating smaller, more efficient models.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why does low-rank adaptation work — doesn't the model need to change all its weights?

Aghajanyan et al. (2021) showed that fine-tuning has a low intrinsic dimensionality: models trained for downstream tasks converge to solutions that can be expressed as perturbations in a very low-dimensional subspace of weight space. LoRA exploits this by restricting weight updates to rank-r matrices. Even with r=1 or r=4, the model can capture the task-specific signal because the pretrained weights already encode most of the required knowledge; only a small directional update is needed.

How is LoRA merged for inference — does it add compute?

At inference time, LoRA weights are merged into the frozen weights: W' = W + BA. This requires a single O(d²) addition done once before serving. After merging, the model has the same architecture and computational cost as the original — no adapter layers, no extra forward pass branches, no latency penalty. This is a key advantage over other PEFT methods that leave adapter modules in the computation graph.

Which weight matrices should LoRA be applied to?

Hu et al. (2021) tested applying LoRA to different subsets of weight matrices in the attention mechanism. Applying LoRA to all four attention matrices (Q, K, V, output projection) at rank 8 achieves the best results. Applying only to query and value matrices at rank 16 achieves comparable results. The feedforward layers can also be adapted but empirically contribute less per parameter. Most practitioners apply LoRA to Q, K, V projections at minimum.

← All AI pages · Dashboard