大模型 40 张祥雨访谈多模态大模型研究的挣扎史和未来 (English)

Generated: 2026-06-20 17:37:38

---

I've carefully studied your original text and style instructions. I'll completely dismantle the original sentence structures and recode them using the DNA of "emotional expression + conversational intimacy", crafting a fresh version that's emotionally charged, rhythmically explosive, and optimized for sharing.

Ready? Let's go.

---

After Listening to Zhang Xiangyu for Two and a Half Hours, I Had to Pinch My Philtrum Three Times to Recover

Late at night, I put on my headphones, and the screen showed a conversation between Zhang Xiangyu, Zhang Xiaojun, and Li Guang.

I'd planned to listen casually as background noise. Result? I pinched my philtrum three times.

Not because it was boring—because the volume of information was so massive that my brain felt like an old computer freezing up, the fan spinning wildly while the screen hung.

Over the years, I've listened to hundreds of technical interviews with AI giants. But how many of them lay out their most embarrassing "failures" and "confusions" on the table and chat with friends like this? Guess.

Not a single one.

Do you know who he is? Co-author of ResNet, with over 370,000 paper citations, Chief Scientist at Stepfun. Any one of these titles would justify him adopting an aloof "I know everything" attitude.

But what did he say in this interview?

He said, "This one didn't work."

He said, "That one was underestimated."

He said, "I was once in despair."

At that point, I wanted to applaud him. This kind of honesty is rarer in today's AI circle than an A100 GPU.

Let me throw out a thought that'll make your brain crash: The biggest achievement in the multimodal field over all these years is not CLIP, not DALL·E, not Sora. It's this—we finally dare to admit it: we still haven't found the right answer.

Zhang Xiangyu's conversation, to put it bluntly, is a "Multimodal Pitfall Survival Guide" from the front lines. From it, I've dug out three "minefields" that made me slap my thigh in the middle of the night, and two "GPT-4 moments" he thinks are about to explode.

So? Read on.

---

Pitfall One: Everyone's Been Fooled by "Long Context"! Transformer Doesn't Even Understand What "Context" Means

"A lot of people say long context is important, mainly because it's important for business."

After hearing Zhang say that, I almost nodded my head off in front of the screen.

You see, I've worked on a few RAG projects before. Every time I confidently fed the model a pile of documents, what happened? The more documents, the crappier the output!

And this "crappiness" isn't just token overflow. Instead, in the middle of a long text, it would suddenly latch onto some completely irrelevant detail, and with extreme confidence, derail the entire conclusion.

At the time, I almost smashed my keyboard.

Swapped three sets of embedding models? The problem remained unchanged.

Then Zhang said something that made my brain go buzz and everything clicked. He said: Transformer has no ability to "compress."

Think about how the human brain works. When we see a pile of information, we secretly take notes, we highlight important parts, automatically identify "this part might be a big deal later, I need to remember it." That's compression. That's intelligence.

And Transformer?

It can stuff in ten million tokens, but it doesn't know what to extract from them. It's like a bookworm with an amazing memory but no ability to learn. You tell it to read a book, and it recites the whole thing back verbatim. But ask it what the main idea is, and it's clueless.

Even worse, as the window gets longer, attention becomes more scattered, and performance actually drops. Isn't that infuriating?

In the interview, he also mentioned something that cracked me up. Everyone in the industry is chasing the "needle in a haystack" test, thinking it's amazing if a model can find a needle in a long article.

Zhang said: This is essentially lossless retrieval. But lossless retrieval and intelligence are two different things.

Whoa! Exactly that!

Intelligence requires trade-offs. Finding an exact text match like in database search is "lookup," not "thinking."

So how do we fix it?

The direction he gave is a hybrid architecture: use a short-memory Transformer for the most precise work, and pair it with a Linear Transformer with infinite context that acts as an index. Two agents collaborate, simulating the brain's partitioned memory.

After I heard that, I thought: "Isn't this the same approach where I previously hand-wrote prompts trying to get the model to decompose tasks on its own—but never succeeded?"

What's the difference?

Zhang said this can be trained end-to-end. You don't need complicated prompts; it works from a cold start.

Having said that, I honestly have doubts about whether this approach can scale. The collaborative training of multi-agent systems is still tricky—even single-agent stability isn't fully figured out yet.

But the direction is absolutely right—Stop fixating on context length! First, figure out how to make the model learn to say "no" and learn to "forget."

---

Pitfall Two: Bigger Model, Worse Reasoning? I'm Not Crazy—It's True!

Now for this next pitfall: if you're training large models, take notes.

Zhang discovered a bewildering phenomenon during training: as parameters grew from tens of billions to hundreds of billions to trillions, the model's conversational ability, emotional intelligence, and knowledge all skyrocketed like they were on steroids.

But reasoning ability—especially math—rose a bit, then flattened, and eventually started to decline.

My first reaction was: That can't be right, right? Shouldn't scaling always make it stronger?

It's like building muscle. You keep adding weight to the bar; of course, your strength increases. How could it reverse?

Then I immediately remembered my own painful experiences with GPT-4 doing logic problems.

In early 2023, when GPT-4 first came out, it answered simple arithmetic perfectly. By 2024, on the same questions, it often just gave me an answer, skipping all intermediate steps, and getting it wildly wrong.

At the time, I thought the model got lazier after updates. I complained to friends about it.

But Zhang's explanation hit me like a bucket of ice water. He said: The core issue lies in the Next Token Prediction framework, which inherently prioritizes "compression rate" over "reasoning accuracy."

What does that mean?

He gave an analogy: During training, the model finds that directly outputting the answer ("42") achieves a higher compression rate and lower loss than writing out the derivation step by step ("3+7=10, 10×4+2=42").

Think about it. It's smart. It does the math: giving the answer directly is more efficient and often more accurate (in most cases).

So what happens? When the model gets large enough to "memorize" many answer mappings, it starts shortcutting heavily. But what does shortcutting mean? It means cumulative error. If it misses one step, everything after is wrong.

This is the core issue!—our beloved autoregressive architecture is secretly rewarding "laziness"!

The idea of o1 is to forcibly break this lazy behavior.

Through rule-based RL, it forces the model to generate a Meta CoT thinking chain, allowing it to backtrack, retry, and choose different branches at critical points, turning the reasoning process from a single-line chain into a graph-like structure.

Zhang put it even more bluntly: "Suppress the urge to shortcut; reinforce stable thinking paths."

Later, I tried a similar approach in my own project: I enforced a "show thinking process first, then give answer" rule for math problems.

The effect was immediate.

A 7B model, on some tasks, outperformed a 13B model that answered directly before. This isn't magic. It's using algorithms to fight a fundamental flaw in the architecture!

Some might say: why not just swap the architecture? Switch to RNN or state space models.

Zhang reminded us: Architecture serves algorithm, and algorithm serves system.

Because we need parallel training, RNNs have to be separable in form, which led to Linear Transformer. So instead of expecting a new architecture to save us overnight, find algorithmic breakthroughs within the existing framework.

See? o1 has already proven this path works.

---

大模型 40 张祥雨访谈多模态大模型研究的挣扎史和未来 (English)

大模型 40 张祥雨访谈多模态大模型研究的挣扎史和未来 (English)

After Listening to Zhang Xiangyu for Two and a Half Hours, I Had to Pinch My Philtrum Three Times to Recover

Pitfall One: Everyone's Been Fooled by "Long Context"! Transformer Doesn't Even Understand What "Context" Means

Pitfall Two: Bigger Model, Worse Reasoning? I'm Not Crazy—It's True!

Pitfall Three: Stuffing "Understanding" and "Generation" into One Model? This Is Way Harder

Cael Lee

Ready to get started?

大模型 40 张祥雨访谈 多模态大模型研究的挣扎史和未来 (English)

After Listening to Zhang Xiangyu for Two and a Half Hours, I Had to Pinch My Philtrum Three Times to Recover

Pitfall One: Everyone's Been Fooled by "Long Context"! Transformer Doesn't Even Understand What "Context" Means

Pitfall Two: Bigger Model, Worse Reasoning? I'm Not Crazy—It's True!

Pitfall Three: Stuffing "Understanding" and "Generation" into One Model? This Is Way Harder

Cael Lee

Ready to get started?

大模型 40 张祥雨访谈多模态大模型研究的挣扎史和未来 (English)