垂直领域大模型的思考 (English)

Generated: 2026-06-20 18:22:13

---

It Took Me Two Years to Really Understand What “Vertical Large Language Models” Actually Are

Have you noticed?

Lately, when I scroll through my feed, eight out of ten posts are about ChatGPT, Wenxin Yiyan, or Tongyi Qianwen. You ask if these models are any good? Well, they can chat about anything—write poems, tell stories, recommend movies—they've got a whole routine. But you ask them to do real work—like drafting a contract or analyzing a lab report—and they just brush you off.

"That's exactly when I started questioning my life." — That's what a lawyer friend of mine said last month, slamming the table as we drank together.

Let me roll up my sleeve and show you what I've been through.

I've stepped into plenty of traps over the past two years, and my soles are thick with calluses. Today, I'm not going to bore you with abstract theories. I'm going to break down all those hard-learned lessons, plain and simple.

---

Guess What? Two Years Ago I Almost Wrote Off "Vertical Large Language Models"

Back in late 2023, GitHub was buzzing.

"Legal LLM," "Medical LLM," "Finance LLM"... I clicked into no fewer than dozens of them. Want to know what I thought?

Four words: my heart sank.

90% of them were nothing but a ChatGPT wrapper with a dozen prompts slapped on, pretending to be something special. You ask, "What medication should be used for a patient with myocardial infarction and diabetes?" and it spits out nonsense like "Eat a balanced diet and get plenty of rest."

At that moment, I really thought: the so-called vertical LLM is probably a false proposition.

But later? I ended up eating my own words.

This year, I tested a model called ChatLaw (the one open-sourced by Peking University). I started off skeptical—wasn't it just LangChain? But the more I looked, the more confused I got.

Here's what makes it interesting: It doesn't just search for answers the way ordinary people ask questions. Instead, it trains a tiny specialized model that does only one thing: keyword extraction.

Can you feel the difference?

The traditional way of extracting keywords: just find high-frequency words and proper nouns. "What to do if the company owes my wages" — they'd extract just "owes wages."

But ChatLaw's little module? It does something called "semantic supplementation." With the same query, it doesn't just extract "owes wages"; it automatically generates related keywords like "labor arbitration," "Provisions on Payment of Wages," and "economic compensation" before searching.

See the difference?

I tried a specific scenario. I asked a general-purpose model about "circumstances under which a shareholder resolution is invalid." It listed a bunch of legal provisions—looked professional, but if you read carefully, you could tell they were straight out of a textbook. ChatLaw, on the other hand, directly retrieved five real court cases involving invalid rulings and highlighted the core points of dispute.

That's what I call "calling up real stuff," not "repeating common knowledge."

So here's the thing: Vertical LLMs are not a false proposition, but the ones that are badly done? Those definitely are.

A truly effective vertical domain model has to put serious work into three things: deep cleaning of domain data, fine-grained design of the retrieval logic, and alignment of the output style. Miss any one of the three, and it's just fluff.

---

Let Me Ask You: Do You Think It's Realistic for a Small Company to Jump Straight into Fine-Tuning?

Let me put it bluntly: Don't set your sights on training your own model from scratch right away—that's what the big guys do.

Last year, I took on a project that I'll never forget.

A client wanted to build a medical consultation assistant. The client said straight up, "We want to train our own medical LLM." I asked about the budget, and they replied very seriously, "The boss said we can spend tens of thousands."

I barely held back a laugh, but inside I was crying.

Think about it: with tens of thousands, you can barely train a small model with a couple billion parameters. And you want to go head-to-head with those hundred-billion or trillion-parameter general-purpose models?

I stopped them right there: Don't rush. Don't jump into the pit yet. Here's my advice: use the API of an existing LLM and build a RAG system.

What's RAG? In plain English: Imagine you have a super smart assistant who knows nothing about new information. You lock it in a library full of documents. When you ask a question, it reads the relevant books first and then answers.

How did we do it?

I helped them clean about 2,000 medical records, clinical guidelines, and drug instructions. We chunked them and stored them in a database. When a user asked a question, we first retrieved the 5 to 8 most relevant chunks, then fed them to a large model (back then, we used Qwen-72B) and asked it to answer based on the retrieved material.

It ran for a month. Want to know the results?

Even the doctors thought it was "pretty good." Accuracy for common conditions reached over 85%.

That number might not mean much to some people, but if you've worked in this field, you know—reaching that accuracy in a specialized domain is a pleasant surprise.

But don't think fine-tuning is useless either.

The scenarios where you really need fine-tuning are these: you have extremely strict requirements on output format; you need the model to "understand" the jargon and logic of your domain.

For example, generating legal documents. You must ensure that the contract terms output by the model are legally rigorous and follow a standard format. RAG can't handle that—because the LLM itself doesn't understand the legal relationship between "breach of contract by Party B" and "compensation cap."

Later on, I did add fine-tuning to that medical project, mainly in two directions:

One was teaching the model "rejection." For instance, if a patient asks, "How much does this medicine cost?" the model should recognize this as a non-medical question and politely tell them to ask the pharmacy.

The other was style alignment. The way doctors talk and the way patients talk are completely different. I gathered 2,000 real doctor notes written by actual physicians. After fine-tuning, the model's output stopped being verbose and just gave the diagnosis and treatment plan directly.

So my experience boils down to this: first use RAG to quickly validate the value, then use fine-tuning to optimize for key scenarios. Don't look down on either path.

---

When It Comes to Deployment… Let Me Tell You, It's a Barrel of Tears!

Let's start with document chunking. That thing really tortured me.

You want to build a medical knowledge base. You throw in a bunch of PDFs. You think you can just chop them up and be done?

Naive.

Here's a real blood-and-tears story. There was a 120-page guide called "Guidelines for Primary Care of Hypertension." I had an intern do the usual routine: cut every 500 characters.

Result? The model asked, "What are the contraindications for ACE inhibitors?" The retrieved chunk only contained the second half of a sentence about "precautions for using ACE inhibitors." The key information that said "contraindicated" in the first half? It got cut into the previous chunk and was never retrieved.

Because of that, the model's answer was completely off.

Later, I used a method called "semantic chunking"—splitting by chapter headings and paragraph logic. It was more work, but accuracy jumped nearly 20 percentage points.

Now let's talk about retrieval.

A lot of people think vector search is a silver bullet. I'm here to tell you: not necessarily.

You ask, "What lawsuits have involved the legal person of a subsidiary of a certain company?" Pure vector search will choke. Why? Because the three concepts "subsidiary," "legal person," and "lawsuit" are scattered in the vector space. A single vector has a hard time hitting all three dimensions at once.

I later tried hybrid retrieval—combining vector search with traditional BM25 keyword matching. It worked much better. But what really impressed me was Graph RAG. You need to extract information like corporate annual reports and equity structures into a knowledge graph in advance

垂直领域大模型的思考 (English)

垂直领域大模型的思考 (English)

It Took Me Two Years to Really Understand What “Vertical Large Language Models” Actually Are

Guess What? Two Years Ago I Almost Wrote Off "Vertical Large Language Models"

Let Me Ask You: Do You Think It's Realistic for a Small Company to Jump Straight into Fine-Tuning?

When It Comes to Deployment… Let Me Tell You, It's a Barrel of Tears!

Cael Lee

Ready to get started?