一文带你熟悉lora微调各类参数,轻松上手deepsee (English)

Generated: 2026-06-20 18:32:49

---

Damn, Fine-Tuning DeepSeek with LoRA Nearly Broke Me!

So the other day I got a job—fine-tuning DeepSeek for a psychology counseling team to handle multi-turn conversations. The moment you hear that, you're probably thinking: "LoRA? Easy peasy!"

That's exactly what I thought too. And the result?

I crashed and burned right out of the gate!

VRAM exploded, loss flatlined like it was dead, and the output was complete gibberish. I asked, "What should I do if a patient has insomnia?" and it replied, "I suggest playing Honor of Kings."

Can you believe it? It took me three whole days to finally get this one thing through my thick skull—

If your LoRA parameters are off, everything else is a waste.

Come on, today I'm gonna lay it all out for you—every pitfall I fell into and every trick I tried. Code, comparisons, parameter tuning—I'll break it all down.

---

1. Don't Just Focus on r and alpha! Have You Been Burned Too?

So many people jump straight in asking, "What should r be?"

Like it's the only thing that matters.

Honestly, the first time I tried, I thought bigger r was always better, so I set it to 128. What happened? Training was slower than a snail, and the results were nowhere near as good as r=16.

I learned my lesson the hard way. Find the sweet spot, you know?

r (Rank) — Don't Get Tricked by Big Numbers

Some use 8, 16, 64, and some even use 128. After a ton of trial and error, I found that for a 7B model on single-turn instruction tuning, r=8 is more than enough.

For multi-turn dialogues with a bit more data? r=16 to 32 is a safer bet.

For models like DeepSeek-R1-32B, they commonly use r=16 paired with alpha=32—see the ratio? alpha=2r.

This isn't just superstition. I ran comparisons with learning rates of 1e-4 and 5e-5, and when alpha=2r, convergence was the smoothest—like sliding a piece of chocolate.

If you're using unsloth, r=64 works too, but note it loads up all 8 target modules. My own experience says—

Just tuning qproj and vproj already covers 90% of the effect!

Add everything else? You eat up about 10% more VRAM for very little gain. Think about it.

target_modules — Don't Be Greedy

Some only add ["qproj", "vproj"], while others throw in oproj, gateproj, upproj, downproj—the whole gang.

I tested on DeepSeek-7B-chat:

Using only q/v, loss drops a bit slower, but the final BLEU score is about the same.

Full set? Converges a tiny bit faster, but VRAM usage skyrockets.

So—

If you're tight on VRAM, just stick with q and v. If you're loaded, go for it—but I'm not rich, so I can't afford that.

dropout — Think Small Data Doesn't Need It?

Default is 0.05. But unsloth's docs say setting it to 0 also works and trains faster.

I tried both 0 and 0.05. At 0, loss tends to oscillate like a rollercoaster; at 0.05, it's more stable—like walking on flat ground.

But here's the key—

If your data set is really small, like a few hundred samples, cranking up dropout (to 0.1) can help prevent overfitting. I tested this myself on 200 samples, and the validation loss dropped by 0.3!

A quick summary:

More LoRA parameters isn't always better. The biggest trap I fell into was thinking bigger r equals better results, only to double training costs with negligible gains.

**Small data (<1k) use r=8 to 16, large data (>10k) go with r=32 to 64, alpha=2r, pick key modules for target_modules, and set dropout based on your needs.**

---

2. Learning Rate and Batch Size—These Are the Real Bosses!

By now, you might think you've got it. But many people go all-in on LoRA parameters and overlook learning rate, optimizer, and batch size.

But these are what let you sleep at night.

Learning Rate — Is 5e-5 Really the Gold Standard?

You often hear "5e-5 is the golden learning rate for LoRA."

At first, I didn't buy it and tried 1e-4. The loss shot up like a rocket!

Then I tried 3e-5? Convergence was slower than a turtle—I kept staring at the loss curve, wishing I could give it a push.

5e-5 is indeed stable. I used a cosine scheduler with warm-up for the first 20% of steps, then a smooth decline—way better than linear.

As for optimizers? AdamW is the standard—don't swap it. The default betas (0.9, 0.999) work fine. I tried tweaking to 0.95, 0.998, didn't make a difference. Total waste of time.

Batch Size & Gradient Accumulation — Not Enough VRAM?

What if you're low on VRAM?

The common trick is: batchsize=1, gradient accumulation steps=32, effectively batchsize=32.

I tested this on a single RTX 3090, running a 7B model, and VRAM usage stayed around 22GB—rock solid.

If you have a weaker card, like a 2080Ti with 11GB, you can push accumulation to 64 steps. The cost? Training takes twice as long.

My go-to recommendation: batch_size=1, accumulate for 32 to 64 steps. This mimics a large batch without blowing up VRAM.

One pitfall I have to warn you about: Don't set batch_size to 4 and accumulate for 8 steps. That might actually use more VRAM because the activation values are larger.

Mixed Precision & Quantization — Lifesavers!

4-bit quantization is a lifesaver for running large models—no exaggeration.

Set loadin4bit=True in BitsAndBytesConfig, with compute_dtype=torch.float16.

Lesson learned: without quantization, a 7B model alone takes 14GB of parameters, and with optimizer and activations, even 24GB on a 3090 can't handle it.

With quantization? VRAM drops to under 8GB! All that freed space goes to activations—sweet.

But careful: don't mess up torch_dtype when loading the model. I usually use torch.float16, but some models are more stable under bf16, like DeepSeek-R1-32B, which needs bf16.

DeepSpeed ZeRO-3 — A Toy for the Big Shots

If you're fine-tuning a 32B or larger model, single-card quantization won't cut it—you need ZeRO.

DeepSpeed Stage 3 with LoRA, using only 0.4% trainable parameters, can tune a 32.9-billion parameter model!

I've tried it on 4 A100s: ZeRO-3 + LoRA + gradient checkpointing went from 70GB per card down to 35GB—finally able to run.

But setup is a pain. The ZeRO config file's offload parameters are tricky; offloading to CPU will slow you down to a crawl.

Gradient Checkpointing — Must-Have for Low VRAM

This is non-negotiable. Everyone says it's "trading time for space"—and it's true.

I tested it: after enabling gradient checkpointing, each training step was 15% slower, but VRAM usage nearly halved.

For those with limited VRAM, this is a must.

---

3. Data Construction — I Took the Bullet for You on Multi-Turn Dialogue Pitfalls

Even if the parameters are perfect, the wrong data will still trip you up.

Multi-turn dialogue data construction was where I fell the most, by far.

The Simplest Method, The Most Effective

There are various methods out there; I used the most basic one—

Concatenate user and assistant turns into a single text, then set the token loss for user parts to -100.

The first time I didn't mask the user output, the model learned to continue the user's speech like a chatterbox.

Picture this

一文带你熟悉lora微调各类参数,轻松上手deepsee (English)

一文带你熟悉lora微调各类参数,轻松上手deepsee (English)

1. Don't Just Focus on r and alpha! Have You Been Burned Too?

r (Rank) — Don't Get Tricked by Big Numbers

target_modules — Don't Be Greedy

dropout — Think Small Data Doesn't Need It?

2. Learning Rate and Batch Size—These Are the Real Bosses!

Learning Rate — Is 5e-5 Really the Gold Standard?

Batch Size & Gradient Accumulation — Not Enough VRAM?

Mixed Precision & Quantization — Lifesavers!

DeepSpeed ZeRO-3 — A Toy for the Big Shots

Gradient Checkpointing — Must-Have for Low VRAM

3. Data Construction — I Took the Bullet for You on Multi-Turn Dialogue Pitfalls

The Simplest Method, The Most Effective

Cael Lee

Ready to get started?