Home / Blog / 大模型的幻觉问题调研: LLM Hallucinatio (English)

大模型的幻觉问题调研: LLM Hallucinatio (English)

By CaelLee | | 6 min read

大模型的幻觉问题调研: LLM Hallucinatio (English)

Generated: 2026-06-20 18:54:42

---

Here's the English translation, maintaining the original's storytelling style and narrative flow:

---

That Night, I Almost Got Duped to Tears by ChatGPT

Can you believe it? Me – someone who thinks they deal with AI every single day – got chills from a paper that was completely fabricated by an AI.

Here's what happened. Last year, I wanted to write a technical report, so I asked ChatGPT to give me an example of “how to use transfer learning for image classification.” And guess what? It just invented a whole paper—

Author named “Zhang so-and-so,” published at CVPR 2023, title, abstract, citation format – all of it looked totally real.

I thought: Holy crap, that's amazing? I immediately went to Google Scholar to search...

Nothing. Clean as a whistle, like it never existed.

How did I feel in that moment? It's like chatting with an incredibly confident liar. The more fluently he talks, the harder it is to tell truth from fiction. You see, that's the big model's hallucination – it's not lying, it genuinely thinks it knows.

From that day on, I fell deep down the “hallucination” rabbit hole.

---

Speaking of which, you might think the word “hallucination” sounds fresh. Actually, before 2020, when people talked about this issue, they basically only focused on two areas: text summarization and data-to-text tasks.

Back then there was an ACL 2020 paper by Joshua Maynez that split hallucination into two dimensions: Faithfulness (true to the input?) and Factualness (consistent with real-world facts?).

Let me give you an example, and you'll get it instantly.

Suppose the original text is “Zhang San hit Li Si.” If the summary writes “Li Si hit Zhang San” – that's a Faithfulness failure. And for data-to-text? Say the AI is supposed to generate a phone number, and it just makes up a number that doesn't exist – that's Factualness going off the rails.

The classification back then was simple too: Intrinsic Hallucination – directly conflicting with the input information. Extrinsic Hallucination – making up something not in the input (though what's made up might coincidentally be true, you need external verification).

See? The first is nonsense, the second is guessing. Traditional tasks had extremely low tolerance for hallucination – what do users use summaries for? To save time! If you mess with the facts, I might as well just read the original.

Open-domain chat was comparatively forgiving – as long as it doesn't violate common sense, a little rambling doesn't bother anyone.

---

But then ChatGPT came out, and the whole world changed.

By the time we got to the LLM era, things had spiraled out of control like a runaway horse. The models spoke more and more fluently, but the frequency of bullshitting was nowhere near low.

I ran my own stats – using GPT-3.5-turbo (the May 2023 version) for simple Q&A.

What did I find?

About 15% of responses contained at least one clearly fabricated fact.

Fifteen percent! And that was still for simple encyclopedia-style questions. For time-sensitive info – like conference schedules, today's weather – the error rate was even higher.

Later I read a few systematic surveys and found that people had started systematically categorizing the LLM disease. The one from Tencent AI Lab drew the clearest picture: they grouped hallucinations into three types—

Input-Conflicting: contradicts the user's instruction or input.

Context-Conflicting: the generated content contradicts itself.

Fact-Conflicting: contradicts widely accepted facts.

Guess which one is the hardest? Fact-Conflicting.

The reason is too simple: the user might never realize they've been misled. Especially in medical and legal scenarios, the risk makes your scalp tingle.

The ACM TOIS paper took it further, redefined the terms, and proposed a dichotomy that fits LLMs better:

Factual hallucination (factual contradiction or fabrication) and Faithfulness hallucination (instruction inconsistency, context inconsistency, logic inconsistency).

This classification is really useful in practice. Why? Because LLMs often “have the context but just go off on their own” – you almost never see this kind of problem in traditional tasks.

---

When it comes to detecting hallucinations, I tried all sorts of methods in my work. Honestly: none of them are absolutely reliable.

The earliest I touched was uncertainty-based methods. The idea is straightforward: if the model's probability distribution is flat (high entropy) when generating, it's probably making stuff up.

But there's a killer problem – it's not friendly to closed-source models. Initially OpenAI didn't even give you logprobs (the API only returns tokens, not the full distribution). Later GPT-4 opened up a portion, but not all commercial models support it.

Then I tried SelfCheckGPT. This one's clever. The core assumption: if the model is very certain about a fact, then sampling the same prompt multiple times should yield similar responses. If they differ? That might be a hallucination.

I ran it on a batch of GPT-3.5 results, using BERTScore to calculate pairwise similarity, threshold at 0.75. The results were decent.

But guess what?

Way too slow! Each sample required 5 to 10 runs, doubling the cost. For GPT-4, where sampling costs are even higher, it's completely unaffordable.

I also read the ETH Zurich paper on self-contradiction detection. Similar idea, but focused on detecting internal contradictions within the generated content. When I tried it in practice, I found – this kind of contradiction is way too common in long texts. It'd say “deep learning requires lots of labeled data” earlier, then later say “unsupervised learning can achieve the same effect”... If you don't read carefully, you'd never notice.

Later I turned to fact verification tools. The MSRA system is pretty neat: it makes the LLM call external knowledge bases or search engines during self-checking. I deployed a simplified version internally – using Wikipedia API for real-time verification.

The bottleneck turned out not to be the verification itself, but “deciding which sentence needs verification.” Checking every sentence makes it too slow; only checking entities misses lots of relation-type errors.

Harvard's ITI (Inference-Time Intervention) is an interesting direction. They found that inside LLMs there are neural activity directions related to factuality, and by adjusting activation values during inference you can boost answer truthfulness. I tried to replicate it (they open-sourced the code) and saw some improvement on TruthfulQA.

But cross-model generalization isn't there – it worked on LLaMA but didn't do much on Mistral. Plus you need internal model access, which for commercial API users is basically pointless.

---

So what about practical solutions? After all my stumbling around, I found two things to be most solid.

First, data cleaning.

Sounds “dumb,” but it's the most fundamental. LLM pretraining data is loaded with fake news, outdated info, domain biases – that's where the model learns to hallucinate. The Tencent survey mentions that currently the main approach is heuristic rule filtering, like deduplication and filtering low-quality web pages.

I processed a batch of open-source corpus (a C4 subset) myself and found that after filtering, the generated answers were more stable on Factuality – but the data volume shrank by nearly 20%.

Second, RAG (Retrieval-Augmented Generation).

This thing is basically the industry standard now. I switched one of our company's QA systems from pure generation to RAG – standard pipeline: query → vectorize → retrieve top-5 from knowledge base → concatenate prompt → generate.

The result?

Hallucination rate dropped from 20% to 7%.

Seeing that number, I was so excited I almost slammed the table.

But don't get too happy too soon – RAG isn't a cure-all. When retrieval fails (the knowledge base has nothing relevant), the model falls back to memory and still makes stuff up. If the retrieved documents themselves are wrong, it just pours gasoline on the fire.

Let me give you a classic case:

Using RAG to answer “what's the latest traffic regulation in City X” – it retrieved a 2020 old

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free