!大模型LLM推理优化技术 (English)

Generated: 2026-06-20 19:01:40

---

Okay, I've read your article carefully. Most of the technical points are genuine insights from your own practice—no major flaws, the data is fairly solid, and only one or two expressions need a slight tweak to avoid misleading people. There's very little AI-generated flavor left in your original text; the main issue is that a few parallel structures could be broken up a bit more. Below, I'll follow your instructions and deliver a clean version with factual corrections, data calibration, AI-flavor removal, and structural reworking.

---

Let me start with a true story.

Last November, I was hunched over my desk, staring at that slowly climbing green line on the GPU monitor, thinking: "Damn, if this thing is really this slow, I'll be the next one the boss lays off." Guess what? Using LLaMA-7B to generate a 200-character reply, the prefill phase took just 0.3 seconds—nice. But when it came to token-by-token generation, each token needed 50 milliseconds. For 200 characters, that's a solid 10 seconds of waiting—who'd want to use that?

So this article is the real deal I dug out after months of failures, crouched over every line of logs. If you find it useful, keep it. If you think I'm bragging, feel free to scroll away.

---

First, Let's Be Clear What Inference Actually Computes

A lot of articles start throwing formulas at you right away—it's headache-inducing. Actually, the core boils down to just two things: Prefill and Decode.

Prefill is like ordering food—you break down the input "How's the weather today?" into tokens in one go, and the GPU computes the intermediate states in parallel. At this point, you’re cranking up the compute power—feels great.

Decode? That’s like eating the food—chewing one word at a time. Each generated token depends on the computation results of all previous tokens. Now the matrix operations become vector operations, and compute utilization plummets.

You see, that "clunky and slow" decoding phase is as painful as waiting for takeout delivery.

FlashAttention—One of the Smartest Optimizations I've Seen

The first time I read the FlashAttention paper, the title was as plain as boiled water. But after going through its tiled computation approach, I posted on my social feed that night: "This person is a genius."

What’s the problem with traditional attention computation? Complexity O(n²)—double the sequence length, quadruple the computation, plus intermediate results constantly written back to VRAM. It’s fundamentally an I/O bottleneck.

What’s FlashAttention’s slick trick? It splits the attention computation into blocks, each computed within GPU shared memory and then merged—saving the back-and-forth with VRAM. Based on my own test data: sequence length 2048 (mostly in the prefill phase), after switching to FlashAttention, inference speed improved by 3.2×. Same GPU, same model, nothing else changed.

But there's a catch: FlashAttention can crash on GPUs with small memory. I tried it on an RTX 3060—if the block parameters aren’t tuned right, it’s straight to OOM. So don’t blindly chase the new hotness; first check if your hardware can handle it.

Speculative Decoding—Great, But Not a Silver Bullet

The first time I saw speculative decoding, my immediate thought was: Isn’t this cheating? Let a small model guess first, then the big model verifies.

I set up a test environment: the draft model was a small model distilled from the big model (roughly 1/5 the parameters), the target model was LLaMA-13B, with 500 Chinese Q&A pairs. The result? First token generation speed didn’t change, but overall throughput increased by 2.3×.

It bets on an intuition: most of the time, the words the small model guesses will be accepted by the big model. Only when the small model is wrong does it roll back and retry.

From my tuning experience: the optimal number of candidate tokens for the draft is 5 to 8. Fewer than 5 gives too little benefit; more than 8 leads to frequent rollbacks, making it slower.

But if you’re working on vertical domains like legal documents or medical cases, where the knowledge gap between the small and big models is large, the acceptance rate for speculation might drop to as low as 40%. In that case, just running the big model straightforward is faster.

So my advice: first run a profiling to calculate the acceptance rate for your scenario. If it’s below 70%, don’t bother.

Decoding Parameters—Not as Mystical as You Think

Let me say something that might offend: some of today’s "parameter-tuning masters" are peddling mysticism that rivals feng shui.

Temperature, top-k, top-p—I’ll explain these three parameters in one sentence:

Temperature adjusts confidence. Low temperature means the model only dares to pick high-probability words—answers are precise but may be stiff; high temperature lets the model’s imagination run wild, but it also rambles.

Top-k and top-p limit the candidate pool. k=40 means picking from the 40 highest-probability words; p=0.9 means picking words until their cumulative probability reaches 90%.

When I write code, I set temperature to 0.1; when I write stories, I set it to 0.8. Why? After hundreds of rounds of testing, this combo works best for me.

So don’t listen to those "temperature 0.7 is perfect" claims. Every scenario and every model is different. Just run A/B tests—that’s better than anything.

What Really Differentiates Mainstream Models

Someone in a group chat once asked: GPT-4 and LLaMA-3—what’s the fundamental difference under the hood?

I answered like this: The basic architecture is essentially the same—both are Transformer variants. The differences mainly lie in three areas:

First, architectural details. GPT-4 is widely believed to use MoE (Mixture of Experts), LLaMA doesn’t. MoE’s advantage is that for the same compute, the model holds more knowledge.

Second, training data. This is the core difference. What data was GPT-4 fed? Nobody knows. But what I do know is that the same inference optimization technique can yield up to a 50% difference in effectiveness across models trained on different data.

Third, alignment strategy. How well RLHF is done directly determines model performance. I’ve tested several open-source models—the base model capabilities were similar, but after alignment, their conversational abilities could be worlds apart.

Who’s better? It depends on the scenario. For code generation, I pick GPT-4; for Chinese creative writing, I think some domestic models are actually more useful.

Five Optimization Directions—One Article Can’t Cover Them All

Start from a simple intuition: Time ≈ Computation ÷ Compute Power. To shorten inference time, you either reduce computation, utilize compute power more fully, reduce data transfer overhead, make hardware do only the most efficient work, or change the algorithm itself.

Direction 1: Reduce Computation

Quantization, sparsification, knowledge distillation are all here. I’ve run a few competitions and found that quantization is the most direct approach. Going from FP16 to INT8 cuts model VRAM usage in half, and on hardware that supports INT8 acceleration, speed can double—and many models today lose less than 1% accuracy after quantization.

But here’s the pitfall. When I first used PyTorch’s quantization API, the model just collapsed, outputting gibberish. I only later understood that different quantization methods suit different scenarios: GPTQ is good for offline deployment; AWQ considers activation distribution during quantization, making it friendlier for sensitive scenarios.

Direction 2: Improve Compute Utilization

This direction has the most diverse techniques, from FlashAttention to various parallelism strategies.

I tested Tensor Parallelism (TP) and Pipeline Parallelism (PP) on 8×A100. TP improved inference speed by 6.5×, while PP only managed 4.2×. But TP requires high communication bandwidth; if you’re using PCIe instead of NVLink interconnect, the performance gain is much smaller.

So don’t blindly trust the speedup numbers from official docs—only real measurements tell the truth.

Direction 3: Memory Access Optimization

KV Cache is the heavyweight here. During the decode phase, every generated token needs the previous keys and values. Without caching, you

!大模型LLM推理优化技术 (English)

!大模型LLM推理优化技术 (English)

First, Let's Be Clear What Inference Actually Computes

FlashAttention—One of the Smartest Optimizations I've Seen

Speculative Decoding—Great, But Not a Silver Bullet

Decoding Parameters—Not as Mystical as You Think

What Really Differentiates Mainstream Models

Five Optimization Directions—One Article Can’t Cover Them All

Direction 1: Reduce Computation

Direction 2: Improve Compute Utilization

Direction 3: Memory Access Optimization

Cael Lee

Ready to get started?