深度好文!的大模型 RAG 技术概览 (English)
深度好文!的大模型 RAG 技术概览 (English)
Generated: 2026-06-20 23:07:40
---
I've been working on RAG for half a year and almost got hunted down by my client! Today, I'm spilling all the pitfalls!
Guess what? Last year, there was a project that nearly killed me!
The client wanted an enterprise knowledge base—just throw in dozens of PDFs and support multi-turn dialogues. I was so naive back then, thinking, "Oh, it's just chunking → vectorizing → retrieving → generating. Easy peasy!" But what happened? It crashed the moment it went live!
The customer asked: "What's the net profit for Q3?"
The model confidently shot back: "Q2, 250 million."
Then the customer asked: "How exactly do we implement that plan you just mentioned?"
The model just started inventing steps, completely shameless about it!
I was dumbfounded. I thought, "This is nothing like the tutorials online!"
Online resources were either rehashed academic papers—long and boring—or just ran a demo and called it "real-world practice," only to blow up in production.
I've been tinkering with RAG on and off for half a year, and I've stepped into more holes than I've eaten salt!
Later I realized: Landing RAG in production is way more than just getting a demo to work!
Now, someone's bound to ask: "But current LLMs have millions of tokens of context, right? Isn't RAG going obsolete?"
Exactly the opposite!
Anyone still saying RAG is outdated? Baloney!
Ultra-long context has a killer problem— "Lost in the Middle." The model's attention to information in the middle drops off a cliff! And the token cost is insane—just imagine piling all the books on the table for every conversation. Who can afford that?
RAG is much smarter: it's like giving the model a precise reference book, letting it flip to the relevant pages before answering. Way more reliable than dumping the whole library on the table!
---
First, let's talk about what RAG actually solves
Large models have some built-in flaws that nobody can avoid:
- Knowledge cutoff: Training data always has a timestamp. Ask about "yesterday's news" and it's never even heard of it!
- Hallucination: Models fabricate facts that sound logical but are actually wrong. In medical, legal, or financial scenarios, would you dare use it raw? That's a disaster waiting to happen!
- Missing private knowledge: Internal company documents, industry manuals—the model has never seen them. How could it know?
So RAG's idea is brutally simple: Don't make the model take a closed-book exam; let it flip through reference books in real time before answering!
That way, answers have evidence, are traceable, and cost little to update. Pretty sweet, isn't it?
Remember the core formula: RAG = Vector Retrieval (real-time knowledge) + LLM Generation (natural language expression)
---
I'll break down RAG into stages, and every stage has its traps
Stage 1: Naive RAG (Basic RAG)
The most basic flow, covered in every beginner tutorial. But I only learned the real tricks after trying it myself.
Indexing stage—the first hurdle is text chunking.
At first, I just split by sentences. For long documents, the retrieved fragments were incomplete and out of context! Then I tried splitting by paragraphs, but some paragraphs were over 1000 characters, and the model's context window couldn't handle it.
What's the most reliable method? Recursive chunking! Set a maximum length—I usually use 512 tokens—and use overlapping windows (e.g., 32 tokens overlap). This keeps semantic coherence and doesn't lose boundary information. Very critical step!
After chunking, use an embedding model to turn them into vectors. During testing, I used bge-m3, text2vec-large-chinese, and m3e-base.
To be honest, bge-m3 performed best on Chinese long texts, but it ate up a lot of VRAM. For small-scale validation, I used m3e-base and only switched to bge-m3 for production. This trade-off depends on your data volume and hardware.
Retrieval stage—early on, I only used vector similarity (cosine similarity), only to later discover that precise keywords couldn't be captured!
For example, if the user asks "Hepatitis B treatment plan", vector retrieval might return "Guidelines for treating hepatitis B", but BM25 keyword search can hit it directly! That's the complementary nature of semantic matching and keyword matching.
So later I adopted hybrid retrieval: vector retrieval (semantic matching) + BM25 (precise keyword matching), then merge results with a simple weighted combination.
This step directly improved Top-5 recall by 12%! You read that right—12 percentage points!
Generation stage—you construct a prompt: feed the retrieved text chunks and the user question to the LLM.
I revised the prompt template many times and finally settled on three core instructions:
- Must answer based only on the provided references
- Cannot fabricate
- If the references are irrelevant, say "I don't know"
That simple "I don't know" instruction saved my bacon countless times!
---
Stage 2: Modular RAG
Naive RAG is too rigid. When faced with complex scenarios, it just falls apart!
Modular RAG is different—it breaks down the retrieval process into independent modules, which you combine like building blocks!
I've tested several patterns:
- Linear pattern: Coarse filtering, fine filtering, context adaptation—narrow down the candidate set step by step.
- Branch pattern: Choose different paths based on question type—legal queries look up laws, technical queries look up articles.
- Loop pattern: After generating an answer, if it's not satisfactory, automatically re-retrieve and regenerate.
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.