0%完成率!Claude、GPT、Gemini 全灭,S (English)

Generated: 2026-06-20 14:16:06

---

Frustrating! So damn frustrating!

Let me tell you, last weekend I totally lost it. I spent an entire weekend trying to get the most powerful AI to help me with a small task—porting a few hundred lines of code from scratch to another language. Guess what? After two days of messing around, I ended up rolling up my sleeves and rewriting the core logic from scratch myself.

Back then, I was still naively thinking maybe my prompt wasn't good enough, or I hadn't set up the scaffolding right. Then I saw an evaluation of AI coding abilities, and it all clicked—

It's not that I suck, it's that this thing genuinely has no clue how to do real engineering!

---

Nine of the newest and most powerful models—including Claude 3 Opus, GPT-4o, and Gemini 1.5 Pro—tested on 200 real open-source projects. Full pass rate: 0%.

Zero. A big fat zero.

You heard me right. All the models that top the leaderboards went in—and every single one was wiped out.

Here's how the test worked: you're given an executable file plus documentation, and you have to rebuild a program with exactly the same functionality. The criterion is called "behavioral equivalence"—no restrictions on language, no requirement to preserve the original architecture. As long as the input-output matches perfectly, you pass. There are over 200,000 test cases, and you can't get a single one wrong. Pretty lenient, right?

Still zero.

When this news broke, several of my friends in AI infrastructure shared it with the exact same two-word comment: "Stop hyping it."

---

1. Perfect Scores on Benchmarks, But Useless in the Real World

Honestly? I wasn't surprised.

You've probably seen it too: you give GPT-4 a description of an algorithm, and it spits out code that's more elegant than anything you'd write. Makes you want to bow down right there. But give it even a slightly structured project—just a CLI tool with a few subcommands, a config file, and a data processing pipeline—

And it falls apart.

It starts cramming all the logic into a single file, completely ignores module separation, handles exceptions purely by luck, forgets one thing while editing another across files, and eventually the build just fails.

There's one chart in the evaluation report that really stings: the models do produce partially correct code on many individual tasks, and a few even come close to passing. But as soon as the behavioral equivalence test covers all the edge cases? They all collapse.

Like a student who can answer most multiple-choice questions correctly but can't even hold the pen when it comes to a comprehensive problem.

Let me put it bluntly: the model isn't writing a program—it's filling in the blanks. That's miles away from actual software engineering.

---

2. Worse Than a Zero: Not Even Taking the Test

And here comes the ironic twist.

There was a model that scored over 80% on SWE-bench, and plenty of big names sang its praises, saying it deserved a major version bump. But in this evaluation? It didn't even get a zero—it flat-out refused all 200 tasks.

All of them. Refused!

Why? The model's cybersecurity classifier thought "rebuilding a compiled binary" crossed a line, triggered self-protection, and simply refused to execute.

Imagine you're in the middle of a project, going full speed, and suddenly the system tells you, "Sorry, I can't do this." And gives you no reason at all.

Remember that whole "dumbing-down scandal" a while back? Anthropic admitted they quietly made the model less intelligent on certain queries. Now it's gotten so dumb it won't even try.

I talked to a startup team doing AI code review, and they said the biggest headache is exactly this kind of "silent strike." The model just refuses to work, no explanation, and the entire delivery schedule falls apart.

So tell me, what's more soul-crushing than a zero? Not even attempting the test.

---

3. The One You Should Really Blame Isn't the Model—It's the "House" Behind It

You're probably thinking: this is just a set of tasks designed to trip up models, right? Real projects are never that extreme.

I partly agree. But look at the projects used in the evaluation—they're all simplified versions or subsets of popular open-source tools like ffmpeg, SQLite, and ripgrep. They're not asking you to rebuild the entire system, just write a functionally equivalent program from scratch.

For any experienced engineer, this is doable.

And the models? They stuff all the logic into a single file, show almost zero architectural planning, and introduce new bugs every time they touch a different file.

LangChain once ran a fascinating comparison experiment: the same model, not a single weight changed—they only optimized the engineering structure around it: scaffolding, feedback loops, validation mechanisms, tool calls. The score jumped from 52.8 to 66.5, and the ranking shot from outside the top 30 into the top five.

The model wasn't changed. The infrastructure was.

Even OpenAI told this story: three people used Codex to write a million lines of code and merge 1,500 PRs in five months. But guess what their focus was? It had long since shifted from "writing code" to "designing the environment, clarifying intent, building feedback loops, letting the agent see its own results, verify, and fix."

Their own conclusion: a bare model simply can't do serious engineering.

So I really believe it's a huge mistake to pin all your hopes on the next smarter model. What you should really bet on is giving it the right reins.

Think of it this way: you have a thoroughbred horse. It can run like the wind, but without reins, without a saddle, without direction, it'll just gallop aimlessly across the plains. What you need to do is give it good roads, a navigation system, and a repair crew.

That's what they call "Harness Engineering"—putting a harness on this wild horse, equipping it with tools, constraints, and feedback. Otherwise, you can crank the context window up to 1 million tokens, and it still can't manage a ten-file project.

---

4. So What Should We Actually Do?

Don't mythologize it.

Don't lose hope either.

Look, this evaluation result is actually a good thing. It pops part of the bubble. Next time someone tells you "AI can already replace engineers," just show them this: nine top-tier models, 200 real projects, pass rate 0%.

But at the same time, remember this: those same models, paired with the right engineering framework, can let a three-person team build in a few weeks what used to take two months.

That's what I do now: I use small models to write code snippets, run tests, and generate documentation. I use big models for high-difficulty reasoning. But the core architecture design, module decomposition, and multi-file coordination—I still do that myself.

And I always add an automated fuzzing layer in the loop, so every time the agent makes a change, it crashes itself against validation instead of just telling me "it's done."

AI is still a long way from being a real software engineer. That fact itself isn't scary. What's scary is pretending it doesn't exist—and then when delivery time comes, realizing no one can conquer the world with a single monolithic file.

Oh, one last thing.

That model that turned in a blank test in this evaluation? It wasn't because it lacked ability. It was because it didn't dare. It wouldn't even take the paper. A zero at least means you showed up. That... was a forfeit.

0%完成率!Claude、GPT、Gemini 全灭,S (English)

0%完成率!Claude、GPT、Gemini 全灭,S (English)

Cael Lee

Ready to get started?