Home / Blog / 如何评价AI检测系统将《荷塘月色》判为62.88%疑似A (English)

如何评价AI检测系统将《荷塘月色》判为62.88%疑似A (English)

By CaelLee | | 6 min read

如何评价AI检测系统将《荷塘月色》判为62.88%疑似A (English)

Generated: 2026-06-20 22:54:59

---

Oh, when I saw your question, I felt it immediately. Let me tell you—this story really made me laugh and fume, and in the end, all that was left was a sigh.

Back in the 2024 graduation season, I came across a news report that almost made me drop my phone. Dahe Bao used some thesis plagiarism detection system to test a bunch of "classic texts"—and guess what? Zhu Ziqing's Moonlight over the Lotus Pond scored 62.88% AI-suspected. Liu Cixin's The Wandering Earth excerpt: 52.88%. Wang Bo's Preface to the Pavilion of Prince Teng—take a guess. One hundred percent! Perfect score. At the time, many universities set the red line at 30% to 40%. By that standard, Zhu Ziqing would have to rewrite his essay overnight, Wang Bo would be expelled, and Liu Cixin? His advisor would have to call him in for a chat: "We need to talk about your use of AI."

People in my feed were sharing it as a joke, laughing hard. But as the laughter faded, something didn't sit right—a 1927 essay, a 1,300-year-old parallel-prose piece, being flagged by a 2024-trained system as "suspected AI-generated," and the system itself thinks it's running fine. This isn't a joke. It's a product demo.

But it gets even crazier

Later, CCTV picked up the story. A student handwrote his thesis abstract, typed it out word by word, ran it through the system—99% AIGC rate. Then he fed it a purely AI-generated text. Result: 0%. Who's that student supposed to complain to? And then there's Ge Jiayi from Sichuan University. Her team's project timeline was flagged by CNKI's system as 97% AIGC. A timeline! Stuff like "January: distribute questionnaires, February: analyze data"—that's AI writing? That's just the standard boilerplate everyone uses for project plans.

Now go search "AIGC detection" on Xiaohongshu. The feed is flooded with graduating students venting their bitter experiences. Someone replaced every comma in the text with a period in one click, and the AI rate dropped from 38% to 11.51%. Another person manually edited against AI writing habits, changing "This study collected 500 valid samples through a survey" to "We ran a survey, sent out maybe five or six hundred, got back about five hundred that were usable"—the AI rate plummeted. Think about it: an academic paper has to be rewritten in that kind of casual tone just to be considered "human-written." Are we judging content or just phrasing?

What's this system really looking at?

Put simply, it uses two metrics: perplexity and burstiness.

Perplexity measures how "surprising" your word choices are—AI tends to pick high-probability words, so perplexity is low. Burstiness tracks variation in sentence length—AI writes evenly, while humans mix short and long sentences and occasionally throw in an odd word. Based on these two things, the system theoretically tells human from machine.

So here's the question: what kind of text has low perplexity and low burstiness?

Simple: language that's smooth, vocabulary that's precise, sentences that are neat, logic that's seamless, expression that's restrained—and on top of that, structural parallelism.

Notice something? That describes Zhu Ziqing and Wang Bo perfectly. Every line in Preface to the Pavilion of Prince Teng is parallel, every character exact—"The setting clouds and the solitary duck fly together, the autumn river and the vast sky share one color." A dense perfection like that, a statistical model would conclude: only an AI could pull this off, no human is that consistent. Giving Preface 100% is perfectly reasonable in algorithmic logic.

Here's the irony

Why did Moonlight over the Lotus Pond score 62.88%? Because Zhu Ziqing's prose has a distinctive restraint and uniformity—reduplicated words, short sentences, rhythm, muted emotions. These are exactly the qualities large language models are trained to mimic. AI is trained on human texts, and those training datasets probably include Zhu Ziqing and Wang Bo. So the more mature an AI gets, the more it writes like a master. And conversely, the more a master looks like an AI.

What the detection system is really doing is measuring the similarity between a text's "linguistic maturity" and the distribution of its training corpus. But the number it spits out gets interpreted as the probability of academic misconduct. It's like using the "standardness" of a dictionary example sentence to decide if an article was written by a human—the more standard it is, the more machine-like it seems.

Speaking of which, do you remember? OpenAI quietly shut down its own AI text classifier in 2023, citing low accuracy. But detection tools haven't disappeared despite repeated criticism. Instead, they've been widely deployed in education systems.

Now, domestic universities have made AIGC detection a hard requirement—not just "suggested to check," but a clear specified red line. Sichuan University: liberal arts ≤20%, sciences ≤15%. Guangxi Normal University, Nanjing University of Aeronautics and Astronautics: ≤40%. Cross the line, and the thesis is returned. Fix it or no defense.

I asked Ernie Bot about this. It gave a pretty lucid answer: "40% is a detection threshold, not a permitted usage." But how many students can tell the difference? Face to face with the system, you stare at that number, not knowing which parts were flagged, or what the rationale is. This black box holds the key to whether you graduate.

Students start fighting back hard

To bring that number down, students are adopting all sorts of countermeasures. Besides replacing commas with periods, there are even more extreme methods.

Someone discovered that CNKI's system has a "stitching alert" mechanism: if the writing style varies too much across chapters, the score gets bumped up by 10 to 15 percentage points. So you can't just fix a few sections—the whole paper has to be stylistically uniform. CNKI's AIGC detection version uses deep learning models to check language patterns, semantic coherence, and characteristic vocabulary simultaneously. Simple synonym replacement doesn't work anymore.

CNKI also added weighted detection: abstract weight is 1.8x, introduction and conclusion 1.5x, theory foundation only 0.5x. Do the math—what's the most cost-effective strategy? Grind the abstract and introduction.

These methods circulate in paid groups, becoming the "heretical cultivation manuals" of graduation season. But the more I think about it, the more absurd it feels—students aren't trying to improve the quality of their theses. They're studying how to bypass a detector that can't even recognize Preface to the Pavilion of Prince Teng.

The awkward position of academia

A 2023 study from Stanford already pointed out the systemic issue: GPT detectors mislabel more than half of TOEFL essays written by non-native speakers as AI-generated, while the false positive rate for native speakers was near zero. Non-native writers rely more on common sentence patterns and vocabulary, which fall right into the detector's high-risk zone. Chinese students and scholars, writing in English with naturally more standardized, clearer sentence structures, get flagged at far higher rates than native speakers.

This isn't just a technical flaw. It's a systemic punishment of people who write carefully.

A study from the University of Chicago Booth School of Business hits even harder: the actual false positive rate of commercial detection tools is far higher than the claimed <1%, and varies wildly with text type and length. The same passage can score 20 to 40 percentage points differently across CNKI, VIP, and Wanfang. Worse, many top universities use CNKI's AMLC. Whatever platform the school uses, that's what the student is stuck with. No choice.

So we come back to the core issue

A tool that can't tell the difference between a classic essay and AI-generated text—what right does it have to decide whether a student cheated?

To me, the absurdity works on two levels. Technically, a detector can only do statistical inference, not verification—it always outputs "looks like AI," not "is AI." Systemically, universities adopt an immature tool as a hard threshold while pushing human review to the sidelines. One professor ran an experiment: a purely AI-written paper scored 0% AIGC

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free