LoRA微调参数少99.6%，效果反超全量微调 (English)

Generated: 2026-06-20 18:28:32

---

Brother, I Almost Drove Myself Crazy Just to Save One A800

You have to come back with me to that late night last year.

I had a project on my hands—turning Qwen-7B into a customer service model. My thinking was simple: full fine-tuning, of course. Everyone knows full-parameter tuning gives you the best results, right?

So what happened?

I crammed it onto a single 80GB A800 with batch size 1. It ran two steps and instantly went OOM. I stared at that red error message on the screen and wanted to throw my computer out the window.

After two days of struggling, I had to borrow two more cards from colleagues just to get it running. Setting up the environment, tweaking parameters, waiting for training to finish—over a week gone.

And the thing that made my blood pressure spike the most? The final result was actually worse than the version I later fine-tuned with LoRA!

That one took only two hours and used less than 20GB of VRAM.

You know what? That night I lay in bed staring at the ceiling, thinking the whole time: What the hell was that whole week of overtime even for?

---

Let's Start With Why Full Fine-Tuning Is Looking More and More Like a Joke

Here's a number for you: GPT-3 has 175 billion parameters.

How much VRAM does full fine-tuning need?

1.2 TB.

You read that right—TB, not GB.

For LLaMA 65B, just storing the parameters as float16 takes 130 GB. Add gradients and optimizer states, and total requirements shoot up 16 times.

I tested it myself—Qwen2-7B full fine-tuning barely fit on one A800. Increase the sequence length or batch size a little, and you blow the VRAM instantly. Qwen2-72B? Forget it with fewer than 10 A800s.

Full fine-tuning has become one of the biggest roadblocks to deploying large models.

But think about it—for a downstream task, like writing customer service replies for your company, do you really need to tweak all thousands of billions of parameters?

---

2021–2022: LoRA Is Born, an Idea So Simple It Makes You Doubt

The LoRA paper hit arXiv in 2021 and was published at ICLR in 2022.

Honestly, when I first read it, I thought: "It's just adding two tiny matrices. What's the big deal?"

Later I realized I was the naive one.

LoRA's logic is incredibly straightforward—freeze the original weight matrix W and attach two small matrices, A and B, on the side. During forward propagation:

output = W·x + (B·A)·x × (alpha/r)

Training only updates A and B; W stays frozen.

The parameter r controls the rank, usually 4, 8, or 16. Alpha is the scaling factor that controls the update strength of the LoRA branch.

Let me give you a concrete number to feel the difference—

For a weight matrix of size (4096, 4096), the original parameter count is 16.7 million.

Switch to LoRA with rank=8: A is (8, 4096)—32,768 parameters; B is (4096, 8)—another 32,768. Total: 65,536.

That's 99.6% fewer parameters than the original.

You read that right—from 16.7 million down to 65,000. That's the most mind-blowing thing about LoRA.

In their experiments, the authors of the LoRA paper observed a phenomenon: during fine-tuning, most meaningful weight updates actually lie in a low-dimensional subspace. Constraining the updates to low-rank matrices results in almost zero performance loss.

When I first read that conclusion, I thought: "Really? With that many fewer parameters, the performance is still good?"

Later I ran a bunch of experiments and found out—it's actually true.

---

Diving Deeper: Where Exactly Does LoRA Save VRAM?

Many people's understanding of LoRA stops at "fewer parameters," but those who have actually trained models will ask: when it saves VRAM, where exactly is it saved?

This question deserves to be broken down thoroughly.

Let's start with computation.

You might think that fewer parameters means less computation.

Wrong.

When it comes to computing gradients, LoRA actually increases the workload.

Why?

During forward propagation, the original weight W still has to be computed. LoRA just doesn't compute gradients for W—it doesn't keep W from participating in the computation.

During backpropagation, originally you only needed gradients for W; now you also need gradients for A and B. Combined, the computation does increase.

So where does the VRAM saings come from?

It comes from model states.

I've done experiments watching this firsthand. During training, you need to store three things: model parameters, gradients, and optimizer states.

With mixed-precision full fine-tuning, these three together require about 16 times the model size—model parameters in fp16 take 2x, gradients in fp16 take 2x, and optimizer states in fp32 (including first moment, second moment, and parameter copy) take 12x.

LoRA's strategy: W still participates in forward and backward passes, but it doesn't compute gradients and doesn't update parameters.

This means—you don't need to store W's gradients or optimizer states at all.

Take a model that uses 2GB of VRAM during inference (1 billion parameters, fp16). Full fine-tuning would require roughly 2GB × 16 = 32GB just for these model states.

If LoRA only fine-tunes some layers, the space saved…

Is enormous.

I did the math: with r=8, the trainable parameters in LoRA account for about 0.78% of total parameters (based on attention layers Wq and Wv). That means gradient VRAM savings of about 99.22%.

LoRA doesn't save on computation—it might even add a little—but on storage, it achieves an order-of-magnitude compression.

---

I Wrote My Own Code to Verify It

I ran a small experiment to check whether LoRA's backpropagation was correct.

In simple terms: I constructed an input x, ran it through the original weight W and through the LoRA weight (BA) plus W, then computed gradients. I manually computed gradients with torch.autograd.grad and compared them to the combined gradient of LoRA.

The results matched exactly.

Loss value: tensor(1.8470, grad_fn=)

Seeing that number gave me solid peace of mind.

No matter how fancy the theory sounds, there's nothing like running the code yourself.

---

2023–2024: The Siblings of LoRA

After LoRA became popular, variants sprang up like mushrooms.

Let me pick a few that I've actually used and that truly represent the landscape.

QLoRA is the variant I use the most.

Its idea is direct: quantize the pretrained weights to 4-bit, then attach LoRA on top.

Weights that originally needed 16-bit storage are squeezed down to 4-bit. VRAM consumption takes another huge cut.

I tested it on a 7B model. With QLoRA fine-tuning, a single RTX 3090 (24GB VRAM) could handle it.

Would you have dared to think that before? Full fine-tuning a 7B model requires at least 40GB VRAM. Now a single 3090 does it.

The trade-off? QLoRA requires quantization-aware training, adding a few steps compared to LoRA. On some tasks, the performance might be slightly lower—but the gap is small, so small that in most scenarios you can't even feel it.

DoRA is another interesting direction.

It decomposes weight updates into magnitude and direction dimensions, handling each separately.

This idea comes from an observation: full fine-tuning changes weights unevenly in direction and magnitude. DoRA tries to control this process more finely.

I have to admit, on some benchmarks DoRA does show an improvement. But in my own business scenarios, the difference from LoRA was negligible—maybe my tasks aren't complex enough yet.

---

The Two Most Critical Questions: To LoRA or Not? And How to Use It?

First question: Should every scenario use LoRA?

My answer: No.

But if you ask me "For most business scenarios, is LoRA the best choice?" I'd say with confidence: Yes.

I've summarized a decision logic, four steps:

Step 1: If resources are limited, go straight to QLoRA.

Can't handle full fine-tuning on a single card? Don't agonize—LoRA or QLoRA is your only option.

**Step

LoRA微调参数少99.6%，效果反超全量微调 (English)

LoRA微调参数少99.6%，效果反超全量微调 (English)

Let's Start With Why Full Fine-Tuning Is Looking More and More Like a Joke

2021–2022: LoRA Is Born, an Idea So Simple It Makes You Doubt

Diving Deeper: Where Exactly Does LoRA Save VRAM?

I Wrote My Own Code to Verify It

2023–2024: The Siblings of LoRA

The Two Most Critical Questions: To LoRA or Not? And How to Use It?

Cael Lee

Ready to get started?