多模态大模型落地:从 Qwen3 (English)

Generated: 2026-06-24 16:47:57

---

Okay, I'll follow your instructions and work through this piece. First, fix the factual and data issues, then remove the AI-generated feel, and finally polish it into an experience-sharing article that reads like it was "written by a human."

---

Have you ever done this: max out your specs, 4 H100s, picked Qwen2-VL-7B, and the business isn't even complicated—upload an image, get a JSON back, and don't let latency get too crazy.

And then the first month, you hit ten landmines.

That was my friend. While listening to his postmortem, I felt his pain—and I was frantically taking notes. I knew I'd step on every single one of those mines sooner or later.

Let's Get One Thing Straight: The Problem Isn't the Output, It's the Input

Where's the bottleneck for pure text LLMs? Usually, it's the decode phase being slow, or the request queue getting completely clogged up.

But multimodal is different. The input side is the real boss.

The key is visual token explosion.

When an image goes in, it goes through Patchify and Projector, and instantly becomes hundreds to thousands of Visual Tokens. I tested this with a 1024×768 image, threw it into Qwen2-VL, and got 896 tokens back.

The consequence? Prefill becomes the main bottleneck, time-to-first-token explodes, and KV Cache gets ruthlessly choked.

On the same VRAM config, pure text could handle 10 concurrent requests; multimodal, 3 will get you an OOM.

Even scarier is Grounding bias. In a chat scenario, if the model sees "the red button on the left" as "the blue button on the right"—the worst that happens is a wrong answer, everyone has a laugh. But in a control scenario?

Instant production incident.

I saw it happen once. My boss's face was colder than an H100 heatsink.

And then there's fine-tuning. A multimodal model has three components: Vision Tower, Projector, and LLM Backbone. If you go full fine-tuning without caution, the visual understanding capabilities will just collapse.

The first time I did SFT, the model's image understanding skills regressed to zero. A "blind" vision model—do you expect that to do any real work?

Fine-Tuning Has an Order: Build the Road First, Then Drive the Car

I used LLaMA-Factory 0.9.2, base model Qwen2-VL-7B.

Don't just jump straight into full instruction fine-tuning. You'll almost certainly crash—like trying to run a marathon before you can even walk.

I switched to a two-phase strategy.

Phase 1: Build the Road.

Freeze the Vision Tower and the LLM Backbone. Train only the Projector.

Purpose: Make the visual features map more smoothly into the LLM's semantic space. In plain terms, let the model learn to see clearly first.

For data, use high-quality Image Captioning, mixed with a small amount of domain-specific "object description" data, at an 8:2 ratio. Iterate for 10k steps. Once you see the loss curve stabilize, move to the next phase.

Phase 2: Drive the Car.

Keep the Vision Tower frozen—unless you have hundreds of thousands of high-quality image-text pairs, don't touch it. Unfreeze the Projector, and attach a LoRA adapter to the LLM Backbone.

This is when the model starts learning your business logic.

You can directly copy these core parameters:

targetmodules: Cover the Attention and MLP layers. I used qproj, vproj, downproj, up_proj.
lora_rank: 16. For a 7B model, 16 to 32 is enough. Going bigger gives diminishing returns.
lora_alpha: 32. Set alpha to 2x the rank. I tried other ratios, but this one is the most stable.

But there's a trap: don't get greedy with lora_rank.

I used rank=64 before. VRAM went up, but performance actually dipped a bit. The data volume just wasn't there.

It's like putting a huge ship in a small pond—it can't get going, it'll just get stuck.

Data: 70% of Your Problems Are Right Here

If your fine-tuning results are bad, nine times out of ten it's related to the data. I'm not just saying that; I validated it three times on this project.

How do you source your data?

Collect logs and event tracking from your online production system.
Apply rule-based filtering and do Hard Case mining.
Use humans or models for correction to create a golden dataset.
Use mixed resampling to form your final training data.

Cleaning and normalization are critical.

You have to consolidate multiple formats expressing the same semantics—otherwise, the model learns noise, not patterns.

For example:

Date formats: 2023/1/1, Jan 1st, 23-01-01—unify everything to ISO 8601.
Multimodal BBox coordinates: Convert absolute coordinates to normalized coordinates to fit the model input.
The Grounding standards also need unification: Dimensions like warm/cold, bright/dark, moving/static must be clearly defined.

The negative sample system is where the real skill lies.

I specifically prepared a batch of samples with "wrong fields," "missing partitions," and "capability not supported." These teach the model to clarify, refuse, and fall back gracefully.

The model absolutely must learn to say "I don't know" or "I can't do that"—that's infinitely better than making a wild guess.

The Deployment Grind: Pushing from 95% to 99%

After fine-tuning the model, deployment gave me another lesson.

Which engine to choose? I compared three:

vLLM: PagedAttention, high throughput, the best ecosystem.
SGLang: Optimized for structured output, low latency for sure.
TensorRT-LLM: Deep optimization by NVIDIA, 30% to 50% faster under high concurrency, but a nightmare to deploy.

I ended up choosing vLLM 0.7.2. Simple reason—great ecosystem, active community. If you hit a wall, you can find someone to ask.

VRAM Budget Management is something I only added after several OOMs.


class MultimodalRequestHandler:
 def __init__(self):
 self.vision_token_budget = 2048
 self.max_concurrent_requests = 4
 self.rejected_count = 0

 def estimate_vision_tokens(self, image_info):
 width, height = image_info['width'], image_info['height']
 patches = (width // 14) * (height // 14)
 return patches

 def should_accept_request(self, request_data):
 total_vision_tokens = 0
 if 'images' in request_data:
 for img in request_data['images']:
 total_vision_tokens += self.estimate_vision_tokens(img)
 return total_vision_tokens <= self.vision_token_budget

2048 was a value I found through trial and error. Any bigger, and you risk OOM. Any smaller, and you'll mistakenly reject too many legitimate requests.

Four Core Optimizations, Each One Counts

1. ViT CUDA Graph Acceleration

Pain point: The visual encoder incurs a full kernel launch overhead for every inference. It's even worse for video input.

vLLM introduced a Budget-level CUDA Graph capture and replay mechanism. The system pre-captures CUDA Graphs based on the token budget range, then during inference, it selects the optimal graph to replay based

多模态大模型落地:从 Qwen3 (English)

多模态大模型落地:从 Qwen3 (English)

Let's Get One Thing Straight: The Problem Isn't the Output, It's the Input

Fine-Tuning Has an Order: Build the Road First, Then Drive the Car

Data: 70% of Your Problems Are Right Here

The Deployment Grind: Pushing from 95% to 99%

Four Core Optimizations, Each One Counts

Cael Lee

Ready to get started?