大模型高效微调-LoRA原理详解和训练过程深入分析 (English)

Generated: 2026-06-20 16:12:30

---

Last month, a reader messaged me, his screen practically trembling.

He said he'd been stalking JD.com for three months, finally snagged a 4090, and for a moment felt like the king of compute! But then? He eagerly tried to run a fine-tuning, had just slammed in the code—screen went gray—OOM, just exploded.

I asked him: "You... didn't run full fine-tuning directly, did you?"

He: "What else would I do?"

Sigh. Just think—that 4090 of his only has 24GB of VRAM. Full fine-tuning of an 8B-parameter model—just storing the model in fp16 takes 16GB, another 16GB for gradients, and then the optimizer states: a full fp32 copy of the parameters, first moment, second moment… good grief, you're looking at a baseline of 128GB VRAM!

See, that's not fine-tuning, that's a funeral for your GPU.

Speaking of which, have you ever felt this way too? You want to do something big, but the hardware just grinds you into the dirt. So many devs dream of tinkering with big models, only to be turned away at the very first step—by VRAM.

I was the same way back then. It hurt so much I questioned my existence. How did I turn things around? I humbly surrendered to LoRA.

The core idea of LoRA, put simply, is this: You were about to move a mountain, but now you just need to prop two little ladders against its base.

The original weight matrix W₀ is like a huge, heavy jack. We leave it untouched and just stick two small matrices, B and A, next to it, using these little things to "simulate" the tiny bit of change needed from fine-tuning.

Guess what? Suppose the original matrix is 4096×4096. You'd need to compute d²—that's 16.77 million parameters. But with LoRA, you split it into d×r and r×d. With r=8, the parameter count drops to just 65,000.

That's two orders of magnitude fewer! It's not magic—it's low-rank decomposition from linear algebra.

Think about it: a 4096×4096 matrix can be approximated by two little things. What does that tell you? That the original matrix has a ton of redundancy! In other words, the directions you need to adjust during fine-tuning are few and squeezed into a low-dimensional subspace.

The paper tested it on GPT-3 175B: full fine-tuning needed 1.2TB of VRAM; with LoRA, it dropped straight to 350GB. And the final model went from 350GB down to 35MB—just storing those low-rank matrices.

The first time I saw that compression ratio, I thought the data must be wrong. But after running it myself a few times, I realized—it's really that insane!

But here's a detail you absolutely would not guess.

You think LoRA reduces computation? Wrong! LoRA doesn't actually cut down on forward or backward compute—in fact, it adds a tiny bit.

You added two little ladders, so you're doing extra multiplication. So where does the saving come from?

It comes from—you no longer have to compute gradients for the pretrained weights! And you don't have to maintain their optimizer states either!

The original weights W₀ are still there, but they're frozen—they produce no gradients, no updates. How much gradient memory does that free up? Take a 7B model: just the gradients save 7B×2 bytes, which is 14GB. Optimizer states save even more—fp32 momentum, variance, parameter copy—12 bytes per parameter in total, directly saving another 84GB!

And your two little ladders? They only take up a few megabytes to tens of megabytes in total. Their gradients and optimizer states are almost negligible.

So you see, the bulk of the VRAM savings happen silently.

And the cost? Training time doesn't necessarily decrease—it might even get slower. Because you have to compute the gradients for those two extra matrices. But in the real world, who cares? Waiting a few more hours is way better than not being able to run at all!

Speaking of which, when I first started using LoRA, I stepped into a few pits myself.

Like picking the rank r too casually. Some people think bigger r is always better, crank it up to 64 or even 128. But experiments show that for many tasks, r=4 or 8 is enough. Any bigger and you risk overfitting, and VRAM and compute also start climbing fast. HuggingFace's PEFT library defaults to r=8—I suggest you start at 8, and only adjust if needed.

Learning rate is another easy pitfall. The LoRA paper used the same learning rate as full fine-tuning, but from my own experiments I found that LoRA converges slower and needs a higher learning rate. Recently there's a paper called LoRA+ that specifically analyzed this: matrix B is initialized to zero, so early B gradients are near zero, meaning B barely updates. They recommend giving B its own learning rate—several times larger. I tried it, it works, and can boost performance by 1-2 points. But it's not magic on every dataset. If you don't want the hassle, just use a uniform lr=2e-4—it's good enough for most scenarios.

And that initialization—a lot of people don't think it through.

The LoRA paper initializes B to zero, and A with a random Gaussian. Why make one zero? Why not just random for both?

No can do!

Think about it: if both are random, right from the start you'll inject a bunch of noise into the model, the loss will explode, and convergence will never stabilize. Keep B=0, A random. In the first training step, A updates, but B is still zero, so ΔW is still zero. Only in the second step does B get a nonzero gradient from A. The model remains stable at the initial stage.

That's a very fine touch.

Later on, LoRA became way more popular in image generation than in LLMs. Almost all custom models on the market are built on LoRA underneath. But diffusion models are a bit special—they have a time dimension, and different time steps might need different levels of adjustment. The standard approach is to add LoRA to all Cross-Attention layers in the UNet, sometimes to the MLP as well.

But remember: training epochs can't be too many—three to five is the max. Go beyond that, and your generated images will look identical to the training set—copy-pasted, no soul.

In the end, LoRA is just like the saying: You don't need to build a new ship—as long as the rudder can turn, that's enough.

You got it?

大模型高效微调-LoRA原理详解和训练过程深入分析 (English)

大模型高效微调-LoRA原理详解和训练过程深入分析 (English)

Cael Lee

Ready to get started?