AI工程范式的三次演化:Prompt Engineeri (English)
AI工程范式的三次演化:Prompt Engineeri (English)
Generated: 2026-06-23 03:24:01
---
Guess what? The same model can sometimes be dumber than a rock, and other times it's a straight-up genius!
Speaking of which, I need to come clean about something — for the past six months, I've been losing sleep over a single, stupid problem.
Have you ever had this experience? You're using the exact same top-tier AI model. One day it's like an absolute pro, knocking out your work in no time flat. The next day? Oh boy — it's going around in circles on the same old problem, and you just want to reach into the computer and smack it!
At first I thought I just wasn't writing good prompts. Then I figured maybe I wasn't giving it enough background material. Until I spent three days devouring a pile of engineering practice reports — and realized: Ha! The pit I was falling into wasn't even the real pit!
First faceplant, and I started questioning everything
Let me tell you about one of my embarrassing moments.
Back at the end of 2023, I was putting serious effort into my prompts. I mean, I was writing them like they were college entrance exam essays — "You are a senior engineer", "Please think through this step by step", "Output format must strictly follow these requirements"…
And the results? Hit or miss, total luck of the draw.
One time it really drove me up the wall. I had an agent troubleshoot a production crash — I pasted in the full stack trace, error logs, wrote a few hundred words of prompt: "You are a top-tier SRE, please analyze the root cause of the crash, pay special attention to xxx"…
And guess what it did?
It went flailing around the codebase like a headless chicken! Dug through tons of irrelevant files, and churned out a report that looked thorough but was completely useless. I stared at the screen, blood pressure through the roof.
Then I ended up troubleshooting it myself. Took me an entire afternoon to untangle the issue. And the root cause? Dead simple — a symbol rename had broken a call chain. That's it. A total beginner-level problem, and this so-called code-savvy AI couldn't handle it.
That's when I started to think: something's off. Are we missing the point here?
All that effort, and it performed like a clown
Later I switched my approach, and the result made me slap my forehead.
I stopped stacking up prompt instructions. Instead, I added something to the agent — let's call it a "crash analysis skill." Nothing fancy, just a handful of concrete steps:
- When you hit a crash, first search for these keywords
- First call this tool (I'd pre-configured a script index)
- Check the git history to track when the term was introduced
- Look at how similar cases were solved in the past
That's it. When I ran it again, the difference was night and day — the agent stopped wandering around blindly. It followed a clear path: first search keywords to narrow things down, then use the tool to analyze the call chain, then check the git log. Pinpointed the problem in half an hour.
I had a full-on rethink: What the agent was missing wasn't "explain the task again more clearly" — it was "a clear, actionable investigation path."
When I say that, do you get that "ohhh, so that's it" feeling too?
Three things, and they're not the same thing at all
From that day on, I started thinking hard about what these three things actually are.
Layer 1: Prompt. It's about "how to explain the task clearly." Is it important? Yes. But don't mythologize it. No matter how smart the model is, deep down it's a statistical machine. It's not going to become more cautious just because you wrote "please analyze carefully." You believe that?
Layer 2: Context. It's about "what to feed the model." Logs, code, cases, documentation — if you feed it the right stuff, the improvement is clear. But it only optimizes "cognitive input" — knowing more doesn't guarantee executing more reliably.
Layer 3: Harness. This is the real game-changer. It's about "how to make the model reliably do useful work in a real environment."
Think about it — before, when an agent performed poorly, what was my first reaction? "Write a longer prompt."
Now? My first reaction is: what is it really missing — a clear objective, background materials, or engineering scaffolding?
Most of the time, it's the scaffolding.
A case study that made my jaw drop
Let me tell you a true story. A top-tier AI team overseas ran an experiment: they had an agent build a large-scale production system under the supervision of three engineers, eventually producing millions of lines of code with almost zero manual input. The key was a highly counterintuitive rule — no manual code entry allowed.
This wasn't showing off. It was forced constraint.
When you absolutely cannot write code yourself, you have no choice but to convert all that tacit knowledge in your head into explicit rules the system can follow: which architecture patterns are allowed, what checks must pass, how codebase conventions are unified… All the stuff that used to rely on intuition, experience, gut feeling — all written out as hard constraints.
That's the essence of Harness Engineering: turning human implicit knowledge into rules the agent can follow.
Think about how mind-blowing that is.
New research says: stop wasting your time on the wrong thing
A review paper published this year put it pretty clearly. It introduced the concept of "binding constraints" — for long-horizon tasks, performance variance mainly comes from the Harness, not the model itself.
What's the evidence? One team only adjusted the editing tool and architecture configuration, keeping the model weights completely unchanged, and improved coding benchmark scores by nearly 10x across 15 different models. Another team kept the same model, just rewrote the system prompt and added intermediate validation steps, and the end-task score jumped from 52.8% to 66.5%. Yet another team used auto-optimization to tune the architecture and hit 76.4%.
Compare that to model iteration itself, which usually brings only a couple of percentage points of improvement.
Think about it: swapping models without swapping the Harness only gets you so far. And I'm not making this up.
How I do things now
Let me tell you my current workflow — it's completely reversed from before.
I used to write the prompt first, then add materials, and only at the end think about how to run it. Completely backwards. Now? I start with the Harness: What runtime environment does this agent need? How should the toolchain be set up? How do I manage state? How do I recover from failure? Then I design the context flow: what should it see at each step, what memories to keep, what to forget. Only at the end do I write the prompt, defining the objective and output format.
Reverse the order, and the difference is real.
One last honest truth
Think about it — as models get more capable, what do they really need? It's not "stricter control," but "a stage where they can do their thing freely."
Humans need to build that stage, not constantly correct every move they make.
If you're using top models and seeing poor results, odds are it's not that the model isn't smart enough — it's that you haven't given it a good enough environment to run in. Even the best engine is useless in a broken system.
This might be the real battleground for future programmers — not writing code, but designing the environment that enables agents to produce high-quality output.
Let me leave you with this:
**If you want to make an agent better, you either upgrade the model or build a better Harness. Model iteration is already slowing down. The room for improvement on the Harness side? Probably bigger than you think.**
I recently completely overhauled how my team thinks about agent troubleshooting. Most of the problems come down to the Harness.
It's not the model that's failing.
It's that you haven't built the stage yet.
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.