图解大模型的推理,理解大模型推理过程,理解什么是测试时计 (English)
图解大模型的推理,理解大模型推理过程,理解什么是测试时计 (English)
Generated: 2026-06-23 01:28:43
---
Just One "Think" from a Large Model and the Effects Explode? After Half a Year of Stumbling, I Finally Got It!
Have you ever seen that scene?
Last autumn, I was in a tech group when someone threw out a math problem—the kind that’s Olympiad-level. Just reading it already gave me a headache. Then they tossed it to GPT-4o, and it shot back an answer in seconds.
Wrong.
Then they threw it at o1, which churned out hundreds of tokens, muttering to itself like a kid working through scratch paper. Finally, it gave an answer.
Right.
The group went wild. Some said, “That’s just good prompt writing!” Others said, “Hallucination, pure luck.”
Back then, I thought the same: AI is just a fancy search engine plus a probability generator. You ask, it answers—accuracy is all luck. Until I started tuning models myself and got absolutely wrecked by them on a real project. That’s when I understood—
It’s not that simple.
I’ve been writing this account for ten years, and this is the first time I feel like large models can truly drive you “crazy.”
---
That Night, I Couldn’t Sleep
Here’s what happened.
Last year, I took on a customer service project where I wanted to use a large model to answer user questions directly. I figured, this is just a fill-in-the-blank, right? User asks, “Why hasn’t my order shipped yet?” Model outputs, “Dear customer, please wait while I check for you”—perfect.
Result?
Real users come in and ask, “I bought something last week. The tracking says delivered, but I didn’t get it. What now?” The model responds, “Your order has been delivered. If you have questions, please contact the courier.”
Think about it. That’s like saying nothing at all. The user was furious, the customer service manager was furious, and they called to yell at me: “Did you hire an intern to write the code?”
I couldn’t sleep that night. I tossed and turned, wondering: where did it go wrong?
Later, I tried one trick: I added a line at the end of the prompt saying, “Please analyze the user’s needs step by step before giving the answer.” Miraculously, accuracy jumped from under 60% to nearly 80%.
Back then, I thought I’d struck gold. Later I learned that’s what they call Chain-of-Thought (CoT)—already a basic move in the field.
But guess what?
That was just the appetizer.
---
You Think AI Is “Thinking”? It’s Just “Guessing”
Speaking of which, I have to tell you a counterintuitive fact:
Ordinary large models don’t “reason” at all.
They only do one thing—predict the next word based on the input you gave. You say, “The weather is so nice today, let’s go,” and it spits out “play,” “eat,” “take a walk,” just based on probability.
This is the fruit of “train-time compute”: during training, the model sees massive amounts of data and remembers countless “after A comes B” patterns. When you ask a question, it searches its memory for the most similar answer and spits it out.
But here’s the problem.
Ask an ordinary model to solve a math problem, and it might directly output “42,” because in the training data, the answer to many Q&A pairs is “42.” But it has no idea where that “42” came from.
It learns “what to answer,” not “how to answer.”
That’s the fundamental difference between ordinary models and reasoning models.
Think about it—if your teacher only made you memorize answers without showing your steps, would you pass the exam?
---
One Path: Spend Billions on Training; Another: Let the Model “Think Before Speaking”
Before the first half of 2024, the mainstream ways to improve model performance were three: increase parameters, increase data, increase compute. Whether GPT-4 or Llama 3, they all followed this path. This is “train-time compute”—all the money goes into training, and once trained, the model is fixed.
But there’s a trap here:
Diminishing returns.
I’ve felt it myself. When fine-tuning small models, going from 7B to 13B parameters gave a big boost—I was happy for three days. But going from 13B up? I rented eight A100s for a week. Guess how much the metric improved?
Less than 3 percentage points.
For that 3%, I spent tens of thousands of yuan. It felt like running a marathon: the first 10K, you get faster and faster; then suddenly, every step forward feels like you’re dragging lead weights.
So when o1 came out, the whole field asked the same question:
Can we make the same model smarter without spending billions on training, just by spending more computation during inference?
That’s the origin of “test-time compute scaling.”
Simply put: don’t change the model weights; instead, let the model “think” a bit longer when generating an answer—internally create more candidate reasoning paths, then select the best one through voting, verification, self-correction, etc.
Think about it: if you ask someone “What’s 2+2?” and they immediately say “4”—fine. But if you ask them to verify from ten different angles, and eight times they get 4, two times they get 5—wouldn’t you be more confident that the answer is 4?
That’s the value of “thinking more.”
---
Three Approaches—I’ve Tried Them All, Pitfalls and Gold
Let me break down the methods I’ve stumbled through and tested.
1. Majority Voting—Crudest, but Most Stable
This method is stupidly simple: ask the model the same question N times, count how many times each answer appears, and output the most frequent one.
I tested this on a math competition problem set with Qwen2.5-7B:
- Single generation: accuracy around 20%
- Majority voting with 16 samples: accuracy jumped to 40%
Doubled!
But at what cost? N times the computation. To get 16x answers, you shell out 16x inference cost.
Pitfall I fell into: If the model itself is terrible, voting won’t save it. Wrong answers might still be the majority. It’s like asking a group of people who never studied math to vote on what 1+1 equals—they’ll likely say “3.”
2. Best-of-N + Verifier—Smarter, But Higher Bar
The problem with majority voting is that it has no quality judgment—it just counts. So people introduced a “verifier” (or reward model) to score the N candidate answers and output the highest-scoring one.
I tried this on a code generation task: used DeepSeek-V2 to generate 10 code snippets, then used a small code-checking model to score them, and picked the highest score. It outperformed majority voting by about 10 points.
But here’s the catch—you need a good verifier first.
At first, I was lazy and used simple rule-based verification. The high-scoring code turned out to be all comments and no runnable logic.
Lesson: The verifier must be tailored to the task, or at least domain-specific. Otherwise, it’s garbage in, garbage out.
3. Self-Correction—Sounds Beautiful, Easy to Crash
This one is a bit more advanced. The model first generates an initial answer, then reflects on its own steps and revises.
Sounds great, right?
But in practice, it’s prone to crashing. I tried asking the model to “check its reasoning,” and it often changed correct steps to wrong ones, falling into overthinking.
Later, I read the DeepSeek R1 paper, where they mentioned trying Process Reward Model (PRM) and Monte Carlo Tree Search (MCTS)—and ultimately abandoned both, classifying them as “unsuccessful attempts.”
Exactly the pitfalls I hit.
The key problem: there’s no reliable external signal telling the model “yes, this correction is right” or “no, you made it worse.” Self-correction easily turns into self-hallucination.
---
Big News: Someone Is Quietly Tweaking Model Parameters During Inference
Speaking of which, let me tell you about a paper from April
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.