如何系统入门大模型微调并进行相关的实践? (English)

Generated: 2026-06-21 00:21:22

---

From GPT to Ollama: I Spent a Week and Blew My VRAM Three Times Before I Could Tell You This

You have no idea what I went through last week.

Three VRAM blowouts. Two times my model turned into a complete idiot. Countless script revisions until I was staring at "CUDA Out of Memory" in giant red letters at 3 AM, ready to chuck my whole computer out the window.

But let me tell you—when I finally got that medical model I fine-tuned running on Ollama, threw a hepatitis B lab report at it, and watched it spit out spot-on analysis after analysis—every single melted GPU second was worth it.

There are tons of articles online about fine-tuning. But nine out of ten follow the same script: "Install a package, run a command, you're done." Nobody tells you why fine-tuning works the way it does today. Nobody explains how those historical models are actually related.

So I decided to walk the whole chain myself—from theory to deployment—and dump every pitfall and insight I've collected over the years.

---

1. A Little Backstory: The GPT vs. BERT "Palace Drama"

I have to get this straight for you first—the whole GPT vs. BERT feud. Otherwise you'll never understand why fine-tuning looks the way it does today.

Six years ago, there were two camps fighting it out.

The BERT crowd believed in "fill in the blank." Pretrain a model by randomly masking some words and making it guess what's missing. Add a next-sentence prediction task, and it learns bidirectional context cold. Basically, BERT is the king of reading comprehension—classification, named entity recognition, sentiment analysis, all in its wheelhouse. Back then, almost every company just slapped a tiny classifier on top of BERT and called it a day. So easy.

The GPT crowd? They only did one thing: word chain. Predict the next word based on what came before, nothing else. At the time, hardly anyone thought this would go anywhere, but OpenAI kept grinding on it. A lot of people in the industry laughed behind their backs.

You can guess what happened next. GPT kept growing, and at some tipping point, a model that could only do word prediction suddenly developed reasoning and conversational abilities all on its own. Meanwhile, BERT's paradigm of "feature extraction + task adaptation" slowly got buried in the age of large models.

So here's the thing: "fine-tuning" today is a completely different beast from what we used to call "downstream fine-tuning."

Back then, you fine-tuned BERT for classification by adding a classifier on top and retraining those few layers. Now you fine-tune GPT by teaching a model that only knows how to write essays to follow instructions, output specific formats, and answer professional questions. Essentially—you're taking a kid who already knows how to talk and teaching them to do a specific job.

A lot of people writing fine-tuning tutorials don't even understand this fundamental shift. Otherwise they wouldn't just throw a few LoRA commands at you and act like that's all there is to it.

---

2. The Fine-Tuning Methods Are Just a Few—Don't Let the Jargon Scare You

LoRA, QLoRA, Prompt Tuning—they sound fancy, but when you break them down, they're really simple.

Full fine-tuning—you update every parameter. Best performance, but even a 7B model needs tens of GB of VRAM. Unless you have an A100 at home, pretty much forget it.

That's why smart people came up with Parameter-Efficient Fine-Tuning (PEFT). The motto: spend little, get big results.

Prompt Tuning: Prepend a set of trainable virtual tokens to the input, and only train those tokens.
P-tuning / P-Tuning v2: Similar, but more flexible. v2 can add them at different layers.
LoRA: Officially, it's like attaching a low-rank matrix to the weights, training only that small matrix, then merging it back. Just remember—it's like bolting on a little external module to the model and only training that module. Currently the industry's most universally recommended approach.
QLoRA: LoRA plus 4-bit quantization. A 7B model can run on a single 16GB GPU. That's what I used this time.

Each method has its sweet spot. But if you're doing just one project, go with QLoRA without thinking twice. As for why it's called a "low-rank matrix"—doesn't matter. You'll use it fine. Learn the theory later.

---

3. Okay, Here's the Hands-On: How I Got It Working Step by Step

Base model: Qwen2.5-7B. Goal: turn it into a medical Q&A assistant.

Dataset: medical consultation pairs I compiled myself. About 3,000 entries covering lab report interpretation, common disease inquiries, health guidance—all scraped from real-world scenarios.

Hardware environment: Rented a 4090 (24GB VRAM) on AutoDL. With QLoRA, 4-bit, plus gradient checkpointing, batch size could only be set to 1 (don't laugh—I had to do this to avoid blowing VRAM). Training time: about forty minutes.

Framework: LLaMA-Factory. Excellent Chinese support. Just set up the dataset JSON and config, run one command, and you're off. QLoRA configurations are built-in, no need to write distributed training yourself.

Here's the step-by-step:

1. Data preprocessing—converted Q&A pairs into Alpaca format and added a system prompt to set the model's behavior.

2. Training—llamafactory-cli train. Watching the loss drop from around 3.x in the first few epochs to about 0.2 was incredibly satisfying.

3. Merging—After training, generated the LoRA adapter. Used LLaMA-Factory's export to merge it back into the base model. Got a complete HuggingFace-format model folder.

4. Download to local—zipped it up and transferred to my Windows machine for quantization.

5. Quantization—Used llama.cpp's converthftogguf.py to convert to FP16 GGUF, then llama-quantize to compress to Q4K_M. The 7B model went from about 14GB down to 4.2GB. Loss in precision? Nearly imperceptible.

6. Ollama deployment—Wrote a Modelfile, filled in TEMPLATE and SYSTEM prompt. ollama create to import. I adapted it from the official ChatML template and specifically added "Don't call yourself Tongyi Qianwen."

Last command to run: ollama run qwen-med:7b

Done.

The biggest pitfall I hit: path issues. The config generated by LLaMA-Factory requires absolute paths for model files, otherwise llama.cpp throws all kinds of errors. You also need to install all environment variables one by one—sentencepiece, torch—every single one matters.

---

4. Does Fine-Tuning Actually Work? Let's Throw Down Some Real Tests

I don't buy into vague claims like "it feels smarter." Let's put the base model and the fine-tuned model head-to-head with two problems.

Test 1: Code capability

Ask: "Write a binary search in Python."

Model	Output

Original Qwen2.5-7B	`def binary_search(arr, target): for i in range(len(arr)): return -1` —That's linear search, not binary search.

Fine-tuned model	Standard two-pointer: `left, right`, `while left <= right`, `mid = (left+right)//2`, complete and correct.

如何系统入门大模型微调并进行相关的实践? (English)

如何系统入门大模型微调并进行相关的实践? (English)

From GPT to Ollama: I Spent a Week and Blew My VRAM Three Times Before I Could Tell You This

1. A Little Backstory: The GPT vs. BERT "Palace Drama"

2. The Fine-Tuning Methods Are Just a Few—Don't Let the Jargon Scare You

3. Okay, Here's the Hands-On: How I Got It Working Step by Step

4. Does Fine-Tuning Actually Work? Let's Throw Down Some Real Tests

Cael Lee

Ready to get started?