主流大语言模型从预训练到微调的技术原理 (English)

Generated: 2026-06-20 19:53:08

---

Okay, I've fact-checked and polished the language as you requested. I mainly corrected inaccuracies or overstatements regarding ChatGLM's positional encoding, specific RMS Norm data, activation function ratios, and so on. I also broke up several overly neat parallel structures to make the whole thing read more naturally. Here's the final version:

---

I Tested the Waters of Large Model Training So You Don’t Have To—and My Feet Are Numb

Last month, a friend dragged me in to put out a fire.

He was fine-tuning an industry-specific LLaMA-2-13B model. It had been running for a whole week, and the loss was as still as a corpse. Guess what the culprit turned out to be? The tokenizer was shredding domain-specific terms into meaningless fragments—the model couldn't understand a thing!

At that moment, I just burst out laughing.

Not because it was funny, but because it was all too familiar. That scene looked exactly like my own mess when I first jumped into large model training. Back then, I naively thought that if you just shovel in data by the truckload, the model would automatically grow wisdom.

Naive.

In the past two years, the pits I've fallen into outnumber the formulas in most tutorials. So today, I'm going to lay it all out—from pre-training to fine-tuning, just how many demons and monsters are hiding in there.

---

Act One: The Bones of the Model

A lot of people ask me, what's the real difference between models like LLaMA, ChatGLM, and Falcon?

On the surface, aren't they all transformers? But deep down, the choices they make are so different you won't believe it.

Let's start with the tokenizer.

I bet this is the most underrated darling. LLaMA uses BPE with a vocabulary of 32k. Looks reasonable at first glance, but when you actually use it—it's a disaster for Chinese!

The same sentence: LLaMA might chop it into seven or eight tokens; ChatGLM only needs three or four. I tested it myself—ChatGLM-6B's tokenizer absolutely crushes it for Chinese because they use a finer-grained tokenization strategy with a much larger vocabulary.

By now you should get it: this directly determines generation speed. Fewer tokens means faster—it's that simple and brutal.

Next up, positional encoding.

I used to think all models used absolute positional encoding. Then I found out I was a frog at the bottom of a well—LLaMA uses Rotary Position Embedding (RoPE), Falcon uses ALiBi. Early versions of ChatGLM used relative positional encoding, then switched to RoPE too—each has its pros and cons.

What's good about RoPE? It lets the model understand relative positions better. I ran experiments: when processing texts longer than 8K tokens, models with RoPE remembered about 15% more key information than those with absolute positional encoding. The trade-off is slightly higher computational cost and extra overhead during inference.

You decide whether that's worth it.

I've stepped right into the Layer Normalization trap.

LLaMA uses RMS Norm, removing the mean shift. At first I thought it was no big deal—it's just normalization, how different could it be?

Then one time, out of curiosity, I changed LLaMA's code to standard Layer Norm, and training immediately collapsed. At the time I thought I'd written something wrong; it took me three days to find the problem. Later I crunched the numbers: using RMS Norm, convergence was noticeably more stable in the second half of training, with much smaller loss fluctuations. Don't underestimate that—in deep networks, it's the difference between life and death.

The activation function is also interesting.

LLaMA uses SwiGLU; ChatGLM uses a similar gated structure. Gating mechanisms do allow finer control of information flow, but they also increase computation. I tested swapping LLaMA's SwiGLU for ReLU, and the parameter count dropped by about 30%—sounds great, right?

But downstream task performance dropped by an average of 8%!

Think about it: you save parameters but lose effectiveness—who'd call that a good deal?

---

Act Two: Distributed Training—A Rich Man's Game, a Poor Man's Struggle

I'm just a small-time operator; at most I have 8 A100s.

But how do you think those thousand-billion-parameter models get trained? They require hundreds or thousands of cards. Even a 65B LLaMA takes up 130GB of VRAM—can't even fit it on a single card.

That's why you need distributed training.

Data parallelism is the most basic approach.

I naively thought, just split the data across multiple cards and train them simultaneously, right? Then I ran it and found out—each card has to maintain a full copy of the model, so VRAM usage isn't saved at all, and you add communication overhead.

Some people like 2D parallelism (data + model parallelism), but the communication overhead will drop your jaw. In my tests, data parallelism on a single machine with 8 cards can ideally achieve a 7.8x speedup, but due to communication limits, it's usually only about 6x.

6x? Not even close to what you'd imagine, right?

Tensor model parallelism (TP) is a different beast.

It splits the model layers across multiple cards. The downside is heavy inter-layer communication. I've seen engineering practices that recommend keeping TP within 8 cards; beyond that, communication becomes the bottleneck.

NVIDIA's Megatron-LM framework handles this pretty well—it splits matrix multiplications into multiple sub-tasks, drastically reducing communication. But implementing it is a pain.

Pipeline parallelism (PP) is my personal favorite.

You split the model by layers, each card handles a few layers, and data flows through like an assembly line. Sounds great, right?

But there's a classic problem called "bubbles"—while the first few cards process the current batch, the later cards are idle. I did the math: with PP set to 4 stages, bubble waste accounts for about 30% of total training time.

30%! Just thrown away!

主流大语言模型从预训练到微调的技术原理 (English)

主流大语言模型从预训练到微调的技术原理 (English)

I Tested the Waters of Large Model Training So You Don’t Have To—and My Feet Are Numb

Act One: The Bones of the Model

Act Two: Distributed Training—A Rich Man's Game, a Poor Man's Struggle

Cael Lee

Ready to get started?