LoRA:大型语言模型的低秩适配器 (English)

Generated: 2026-06-23 08:53:42

---

A few months ago, I was staring down a line of output—"torch.cuda.OutOfMemoryError"—and I nearly ripped my 3090 out of the chassis and threw it across the room! 😡

I had this project: fine-tune a 7B model for text classification. And what happened? Full parameter fine-tuning kicked off, demanded 40GB of VRAM at minimum, and my card just flatlined on the spot. Two OOMs in a row, and by the third I chickened out and started digging into parameter‑efficient fine‑tuning (PEFT) for a lifeline.

Honestly? The first time I saw "Low‑Rank" in the LoRA paper title, all I could think was "matrix decomposition voodoo." I thought: Here we go again—all smoke and mirrors, right?

But then I tried it.

You know what?

It. Was. Amazing. 🔥

---

Before you jump in, let’s break down why LoRA became the savior for us VRAM‑poor folks.

You see, before LoRA came along, you had two choices to fine‑tune a large model:

One was full parameter fine‑tuning (SFT) — great results, sure. But the cost? You have to train the entire model from scratch, over and over. GPT‑3 175B? For one task you store 175B weights, for ten tasks it’s 10 × 175B. That’s thousands of GB just for storage—forget about the VRAM and electricity during training. And there’s another trap: catastrophic forgetting. You learn a new skill and instantly forget everything you knew before—like a bear picking corn, dropping one ear to grab another.

The other option was various fancy Adapters — fewer parameters, but bigger problems! You insert extra layers into the Transformer, but during inference you can’t merge them back into the original weights. Result? Latency skyrockets, online services scream, user experience tanks.

But! In 2021, the Microsoft team came up with LoRA.

The core idea is so simple it’s ridiculous:

"The rank of weight updates is actually very low, so instead of touching the original weight matrix W, we create two small matrices A and B, and use their product to approximate the change in W."

No formulas here, but basically—

During training, the original model stays frozen; you only train those two little things A and B. At inference time, you add AB back into W and merge them into new weights. Extra inference latency? Zero! Absolutely zero! It’s like getting something for nothing! 💥

My immediate reaction: No way. What makes you think the update is low‑rank?

The paper just threw a bunch of rock‑solid experiments at you: On GPT‑3 175B, LoRA cut the number of trainable parameters by 10,000 times, slashed GPU memory by two‑thirds, and matched—or even beat—full fine‑tuning in performance.

And that’s not all: with the same base model, you can attach as many LoRAs as you want. Switch tasks? Change a file, just like that—faster than flipping a page.

So really, who could resist?

---

Talking about actual practice—I used HuggingFace’s PEFT library (0.6.1), along with transformers 4.35.0 and bitsandbytes for quantization and mixed precision.

When I was setting up LoraConfig, I stepped into every single pitfall. Let me walk you through them one by one so you don’t have to.

rank (r): This controls the “rank” of the low‑rank matrices and directly impacts the number of parameters. The paper recommends starting at 8; for most tasks, 8 is enough. I tried everything from 1 to 64, and 8 versus 16 made almost no difference. But if you dare set r to 1? Performance falls off a cliff! r=64 wasn’t any better either, because with only ten thousand data points I was more likely to overfit. Some people like 128 or even higher, but my advice: unless you have an insane amount of data, you’re just burning compute for nothing.

lora_alpha: The scaling factor. Together with r it determines how big each update step is. In the source code, scaling = alpha / r. With default r=16 and alpha=32, scaling is 2. If alpha is too big, your steps are too large and you risk tearing things apart. I tried alpha=128, and boom—training collapsed on the spot, a textbook gradient explosion.

targetmodules: This is where I got burned the hardest! At first, I followed the paper and only picked qproj and vproj. Result? Convergence so slow it broke my heart. Later I dug into the source code and discovered that many open‑source models don’t even have those module names! For example, DeepSeek has qproj, kproj, vproj, oproj, plus gateproj, upproj, downproj in the feed‑forward layers.

Want to know what layers your model has? Don’t guess—just print them:


from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-llm-7b-chat")
for name, _ in model.named_modules():
 if "proj" in name:
 print(name)

Once you see the results, it’s obvious which ones to add to targetmodules. My current habit: for attention modules, include all of q, k, v, o; for feed‑forward layers, if it’s a generation task, add gateproj and up_proj—the improvement is immediately noticeable.

bias: Leave it alone, just set it to "none". LoRA’s whole point is training two projection matrices—why would you touch the bias? Don’t create trouble for yourself.

lora_dropout: I set it to 0.05, mainly to prevent overfitting on small datasets. I tried 0.1 once—convergence was so slow I wanted to swear. Stick with 0.05, okay?

There’s another trick in initialization. The paper initializes A with Kaiming (random Gaussian) and B as all zeros. Some people ask: why not set both to zero?

Listen, this is counter‑intuitive: if B is also zero, then the gradient for A contains the transpose of B—which is zero—so A can never learn anything. That’s a dead end. So A must be random, B must be zero—sure, it starts at zero, but after one training batch the balance is broken.

Practice is the only truth—I’ve confirmed it. 👇

---

After training, what’s most satisfying about LoRA is merging the weights.

One command, model.mergeandunload(), and you’re done. Of course you can skip merging and just use peft_model.generate() directly with the LoRA branch. But I prefer to merge, because then you get a complete model for deployment—no extra branches to load, saving both VRAM and hassle.

Later, I also tried QLoRA.

First quantize the base model to 4‑bit, then attach LoRA modules and train. VRAM drops to below 8GB! My ancient 3060 could suddenly run a 7B model!

One thing to watch: with a quantized model plus LoRA, the LoRA module’s gradients must stay in FP16 or FP32—don’t calculate gradients in int4. bitsandbytes handles this automatically, but you need to make sure the optimizer doesn’t quantize the LoRA weights too. In terms of performance, there’s a tiny loss compared to pure LoRA, but it’s almost invisible—especially for simple tasks like classification; it’s practically free.

---

I always see people arguing online: “Can LoRA really match full parameter fine‑tuning?”

My answer is one sentence: **For most tasks, especially on small‑ to medium‑sized datasets, Lo

LoRA:大型语言模型的低秩适配器 (English)

LoRA:大型语言模型的低秩适配器 (English)

Cael Lee

Ready to get started?