多模态大模型主流架构介绍:从 LLaVA 到 Qwen3-VL,解构多模态大模型的演进之路 - (English)

Generated: 2026-06-20 04:14:06

Do you believe that in the AI world, vision and language "got married before they fell in love"?

Back in 2022, DeepMind released Flamingo. How do you combine vision and language? They came up with a brute-force method: hang an "external antenna" on top of the language model, letting it periodically pick up signals from the visual side. This was "shallow fusion"—like hiring a translator in your home so Chinese and English can shout across the room.

Everyone thought this counted as integration.

Not really.

Then in 2023, LLaVA came out. Even simpler: just wire them up directly. A single linear projection layer plugged the output of the CLIP image encoder straight into the input of LLaMA. And guess what—as long as the semantic spaces aligned, even the simplest linkup could make a large model "look at pictures and talk." But honestly, it was still two skins grafted together. Vision and language, each living its own life.

That same year, the first Qwen-VL followed the same path. Visual information was treated like "foreign loanwords" crammed into the input sequence. The gap was never truly broken.

The turning point came in 2024. All of a sudden, Qwen2-VL had a "change of heart."

It got rid of that adapter wire and switched to a deeply coupled ViT+Merger architecture. The visual encoder was upgraded to SigLIP-2, trained with a method called sigmoid loss, and performance shot up. More importantly, it could now handle hours-long videos—vision and language had finally started "living together."

At this point, you might think that's the ceiling.

Then came Qwen3-VL in 2025, and it tells you straight up: everything before was just foreplay.

It did three things, and every single one defied common sense.

First: DeepStack. Previously, all visual information was dumped into the language model in one big lump. Not anymore—now it first breaks the image into multiple layers: edges, textures, semantic concepts. Each layer then "talks privately" with a specific depth layer of the language model. Think of it like a company meeting—no more flooding everyone with everything, but targeted department-to-department communication.

Second: MoE sparsification. Go from 160 experts and select six, while keeping one shared expert. By Qwen3, the shared expert is completely gone—128 routed experts, each time only activating a single-digit number. Here's an analogy: the company used to keep a standing army; now it's all contractors, called in as needed. With batch-level load balancing, efficiency doubles while quality remains intact.

Third: MRoPe. They extended positional encoding into multiple dimensions—space, time, and frame number. So when watching a six-hour-plus video, the position of every frame is crystal clear. Note: six hours, not twenty minutes.

And then there's the attention mechanism overhaul: MLA. It compresses the KV cache by several times. You can binge-watch videos and the memory won't blow up.

In other words, Qwen3-VL is no longer just a "chatbot that can see images." It can control computers, phones, robots; it can recognize 130 scripts; it can zero-shot count objects; it can even automatically adjust how densely it samples video frames based on motion speed.

Look back at this trajectory: LLaVA's "external patch-up" → Qwen2-VL's "deep coupling" → Qwen3-VL's "native unification." On the other side, Gemini 1.5 Pro brute-forced it with a million-token context, and GPT-4o went for end-to-end full modality. Different paths, same destination.

You think they're building "AI that can see pictures." Actually, they're building a unified world model.

One last line:

From linear projection to deep synergy took only three years. It proves an iron law: optimizing the internal information flow to the extreme is itself a shortcut to intelligence.

多模态大模型主流架构介绍:从 LLaVA 到 Qwen3-VL,解构多模态大模型的演进之路 - (English)

多模态大模型主流架构介绍:从 LLaVA 到 Qwen3-VL,解构多模态大模型的演进之路 - (English)

Cael Lee

Ready to get started?