多模态模型分数高≠落地强，系统能力才是真门槛 (English)

Generated: 2026-06-24 16:10:37

---

Stop Staring at Multimodal Model Scores—I Learned That the Hard Way After Three Mistakes

Friend, have you ever been like this—

You see a new model drop, and your first thought is: What's its ranking? What's its benchmark score? Can it beat the last one?

Honestly, I used to do the same thing. Back in 2023, when someone asked me "which multimodal model is the best," I'd immediately throw a benchmark table at them and add, "See? This one has the highest score."

But if you're still picking models this way today, I have to say—you've been fooled by the scores.

Yes, you heard that right. People who are still comparing parameter sizes and benchmark scores have basically lost the plot on how the game has changed.

The competition in multimodal large models is no longer about "who has the higher score"—that's a student mentality. The real battlefield is system-level capability: Can you tie together "seeing images, understanding speech, writing code, using tools, and running an entire workflow" into a single cohesive thread?

I'm not making this up. If you look at recent survey papers, many of them point to the same conclusion: multimodal has split into three tracks—understanding models, generation models, and omni/agent systems. And the real key to winning is soldering these three tracks together.

Below, I'll tell you three stories based on the mistakes I made.

---

The First Mistake: Architectures That Sound Amazing But Fall Apart in Practice

A lot of people love to geek out over architectures when they talk about multimodal.

"Look at this model—it's natively multimodal, Early Fusion!"

"Gemini uses a unified token space!"

"DeepSeek Janus-Pro, the same Transformer backbone—isn't that incredible?"

…Incredible, and then what?

Let me tell you a true story.

Last month, I took a financial report PDF and tested several models side by side. It had complex tables, line charts, and a management discussion section.

Qwen2.5-VL did okay. It could extract numbers from the tables and roughly describe the chart trends. But then I asked a follow-up—

"Gross margin dropped two points, but R&D expenses went up. Can you analyze this in the context of the management discussion?"

It froze. It just listed the numbers side by side and stopped there. The management discussion clearly said "short-term cost increase due to new product line investment," but it couldn't connect the dots.

What about Gemini 2.5 on the same task?

It directly pulled in the context from the management discussion, wrote a Python script on the fly to calculate quarter-over-quarter changes, and then gave me a complete analysis: the gross margin drop was due to increased R&D spending, and the new product line hadn't yet achieved economies of scale.

See, this isn't about which model "understands images better." Both models could understand the charts and text. The difference is—which one can complete deeper reasoning within a single system and leverage tools to support that reasoning.

Gemini 2.5 did one thing right: understand multimodal input → call a code execution tool → integrate the results and output a complete answer. That's system-level capability. A strong single skill does not equal a strong closed-loop system. Many open-source models have been chasing benchmarks aggressively—their OCR accuracy and image understanding scores are decent when tested in isolation. But throw them into a real workflow—like asking them to automatically write SQL to query a database, generate a chart, then translate it into a report—and they start spinning in place.

You scored 99 on a single test, but you can't actually do the job. What's the use?

Before 2025, you could still get away with pretending this wasn't an issue. But now, the ceiling has been hit.

---

The Second Mistake: Audio Reasoning—Many Models Aren't Actually "Listening"

This one is particularly interesting.

Last year, I ran an audio Q&A test. There were plenty of products bragging about "audio understanding," so I took a recording of a meeting and asked:

"Based on the tone of the last few sentences from this project manager, does it sound a bit impatient?"

Guess what happened?

Several models directly returned a text transcript and then inferred the tone from the text.

Buddy, a transcript loses all the intonation, pauses, and emphasis. Judging tone from text is like guessing the weather with your eyes closed.

Later, I read that audio reasoning survey paper from CUHK, and one sentence hit me: Many so-called audio reasoning tasks can be solved correctly by models just using the text transcript or surface-level cues—they don't need to actually hear the sound.

In other words, you think the model is "listening," but it's actually just "reading the text and guessing with common sense."

Real audio reasoning must be anchored in continuous, granular acoustic evidence. For example, in a conversation, how long someone pauses between two utterances, or how much their pitch suddenly rises—all that information is lost in text.

Right now, not many models can pull this off. Qwen-Omni and Audio-Reasoner do a decent job on grounding, but in my own testing, once you throw in complex scenarios—like multiple people talking at once or background noise—their stability still lags behind closed-source systems.

The lesson here is simple: Don't just check whether a model supports an input format. Check whether the model actually touches the original modality's evidence during reasoning. Otherwise, it's a wolf in sheep's clothing—it's "pretending to listen."

---

The Third Mistake: Demos That Dazzle, Then Crash in Production

This last one hurt the most.

I've helped a few teams integrate multimodal LLMs into real business applications—medical report generation, e-commerce product page review, automated meeting minutes archiving.

These scenarios share a common pattern: User uploads an image or video → Model needs to understand the content → Model calls a backend API to perform an action → Output structured results.

Sounds simple, right?

In practice, everything falls apart.

Take e-commerce product page review. The model needs to understand the product in the image, the price tag, whether the text is compliant, and then write the result into the database.

We started with an open-source model. When we tested just the "extract content from image" step in isolation, the accuracy was above 90%. Looked great, right?

But as soon as we plugged it into the review workflow, all kinds of problems emerged:

— The model kept outputting a bunch of fluff text, causing JSON parsing failures.

— When calling the database API, the parameter format was often wrong. In short, it had no idea what the interface contract for "write to database" looked like.

— When handling repeated requests, there was no state tracking. It would review the same page twice and give inconsistent results.

Later, we switched to Claude's tool-calling approach, plus some prompt engineering, and these problems basically disappeared.

But this process made me realize one thing: The value of a multimodal LLM in a real business doesn't depend on what it can do on its own—it depends on whether it can smoothly coordinate with external tools, environments, and workflows.

The 2025 market report says China's multimodal LLM market is projected to grow from 9.09 billion RMB in 2023 to 66.23 billion RMB in 2028, with a CAGR of nearly 49%. Capital and industries are betting big.

But if you ask the teams that have already deployed these models, their biggest headache is often not that the model isn't capable enough—it's that the model is too hard to integrate with the system, too unstable, and too complex to maintain.

So here's my take: Right now, multimodal LLMs are like a super-intern with extremely broad knowledge. They seem to know a little about everything, but when you actually hand them a specific task, you have to hover over them the whole time to make sure it gets done.

What's the real valuable skill?—Can you get the model to run an entire workflow on its own?

---

Someone Might Ask: Aren't Open-Source Models Catching Up to Closed-Source Ones All the Time?

True.

Meta Llama 4 Scout has a 10-million-token context window. DeepSeek R1 pushed reasoning capabilities to new heights with reinforcement learning. The Qwen 3 series is also rewriting leaderboards.

But I still want to pour some cold water on

多模态模型分数高≠落地强，系统能力才是真门槛 (English)

多模态模型分数高≠落地强，系统能力才是真门槛 (English)

Stop Staring at Multimodal Model Scores—I Learned That the Hard Way After Three Mistakes

The First Mistake: Architectures That Sound Amazing But Fall Apart in Practice

The Second Mistake: Audio Reasoning—Many Models Aren't Actually "Listening"

The Third Mistake: Demos That Dazzle, Then Crash in Production

Someone Might Ask: Aren't Open-Source Models Catching Up to Closed-Source Ones All the Time?

Cael Lee

Ready to get started?