解析Agent框架中的上下文管理策略 (English)
解析Agent框架中的上下文管理策略 (English)
Generated: 2026-06-23 13:52:22
---
Okay, I've carefully reviewed the original text, corrected the factual contradictions and inconsistencies in the data, removed the rigid clichés, and broke up the overly neat parallel sentence structures to make the rhythm more natural. Here's the revised final version:
---
Why Does Your Agent Project Always Fail? 90% of People Trip on This One Step
Last month, a friend of mine went for an AI position at a big tech company.
On his resume, he wrote "worked on an Agent project." The interviewer's eyes lit up on the spot.
For the next fifteen minutes, almost every question hit the same sore spot—context management.
When should you compress? How do you continue the conversation after compression? What should the prompt for the compression summary look like? How do you draw the architecture diagram?
Lucky for him, he'd read through two of my code analysis articles before. He held his ground.
In the end, he actually passed.
Afterwards, during his debrief, he said the interviewer cared less about how well you memorized your knowledge base and more about whether you'd personally stepped into that pit.
Think about it—it makes sense. Anyone can draw an architecture diagram on paper. But only someone who's actually done it knows: the model suddenly loses memory because the truncation strategy is too crude; the Agent calls the same tool five times because the key instructions got eaten by the summary.
So I decided to break this topic down completely.
This isn't a conceptual overview.
I'm going to open up, one by one, how the mainstream frameworks handle context today.
Next time an interviewer asks, "How do you manage context in your Agent project?" you'll be able to confidently rattle off a well-structured answer.
Instead of choking out: "We… used truncation."
---
You Have No Idea What Kind of "Box" It's Dealing With
Let's start with the underlying logic.
Large language models have no memory.
Every time you send a question, it reads everything from scratch—the system prompt, the conversation history, the sentence you just asked—all crammed into a fixed-size box.
This capacity is called the context window.
Its unit is tokens. One Chinese character is roughly 1 to 2 tokens; one English word averages about 1 token.
Many models claim a 128K window, which sounds pretty big, right?
After just one or two rounds of conversation plus tool calls, a quarter of it is gone.
But here's the thing—an Agent isn't a chat.
A chatbot can go for dozens of rounds. The earlier "What's the weather like today?" is fine to discard.
An Agent is different. It has to execute tasks, call tools, read files, run tests. One single read_file, and hundreds of lines of code go in. One terminal, and thousands of lines of logs scroll by.
Add to that the tool schemas, error stacks, the user's original requirements, fixed constraints set earlier…
After just a few rounds, the box is full.
What's worse is: being full doesn't just affect the current round.
When the context is stuffed, the model's attention behaves like you in the middle of a long afternoon meeting—it starts losing information, ignoring instructions, even repeating operations it's already done.
What you see is: the Agent is still running, but it's already "losing its memory." It doesn't remember which file it just modified ten minutes ago. It doesn't recall that the user said, "Don't touch this config."
So, at first, I thought this was simple.
When it's full, just truncate. Chop off the front, keep the most recent rounds.
Then I got proven wrong.
Quite spectacularly, too.
---
The Early Truncation Disasters: Not Just One Pit, A Chain of Pits
A few years ago, almost all Agent frameworks used the same crude approach: set a token threshold—say, 80% of the window—and once that's exceeded, compression fires off automatically.
How did they compress? Either they dropped the earliest conversations entirely, or they simply kept the most recent few rounds.
You can imagine what that leads to.
First, cliff-edge triggering.
For the first 90%, the system has no reaction at all. The longer the conversation, the more sluggish the model gets. You think it's thinking? It's already swimming in information overload.
Then, suddenly, compression kicks in.
And in one go, it crushes dozens of rounds of history into a single summary.
The model, just overwhelmed by being overstuffed, now gets most of its context ripped away.
It's like someone is trying hard to digest a big meal, and you reach over and yank every plate off the table, leaving only a single vegetable stem.
Second, full-summary loses details.
Decades of messages compressed into a few hundred words. No matter how well you write your prompt, the variable names, function signatures, error stacks, the user's exact words—they're all gone.
And these are exactly the things the Agent needs most to do its job.
I tried a scenario once:
In the first round, the user said, "Don't modify that config.yml file."
Later, when the Agent was working, it completely forgot. Because it had been summarized into something vague like "keep certain configuration files unchanged."
Third, no priority distinction.
All history is treated equally. An explicit constraint from the user and a casual "that looks nice" are handled the same way.
Large chunks of logs from tool outputs and critical error messages get compressed indiscriminately.
Honestly, this isn't a flaw unique to any one framework. It's a common problem in almost all early implementations.
My very first Agent project used this strategy.
Later, I got sick of revising it.
---
Six Products, Six Philosophies: This Is the Key
Starting last year, many products started taking context compression seriously.
I spent a fair amount of time digging into all the mainstream approaches.
And I found something—
Their ideas diverge wildly.
Here's a table for you, the core secret weapons of each product:
| Product | Core Strategy | One-Sentence Summary |
|---|
| Claude Code | Five-stage pipeline, ordered by increasing cost | Cheap local operations first, LLM summary as a last resort |
|---|
| Codex CLI | Keep recent user messages raw; replace the rest with handoff summaries | What the user says is most accurate; what the model says can be rewritten |
|---|
| OpenCode | Timestamp hiding + structured summary + replay last user message | Doesn't truly delete; theoretically recoverable |
|---|
| Cline | Auto + manual dual mode; /compact generates a summary and continues within the same task | Gives the user a choice |
|---|
| Cursor | Auto-summary + prompt to start new conversation + searchable history | Can trace back to original history even after compression |
|---|
| Amp | No recursive compression; use /handoff to start a new thread with key points | Long conversations themselves are the problem; changing threads is better than compressing |
|---|
| MemGPT/Letta | Context = RAM, history = disk; Agent autonomously swaps in and out | Operating system-level memory management |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.