Context Engineering: Managing the Scarcest Resource in Agent Systems
How advanced agent systems manage finite context windows
Consider an agent running a 50-step coding task.
- Step 1: The agent reads a file (2,000 lines of code).
- Step 5: It executes a shell command (500 lines of output).
- Step 30: It has accumulated dozens of tool results.
The context window is now full—not with useful information, but with the "residue" of past actions. This leads to three primary failure modes:
- Overflow: The agent halts mid-task, wasting all prior work.
- Dilution: Attention spreads across stale data, degrading reasoning quality.
- Cost: 200K tokens are sent when 40K would have sufficed.
The solution is not a bigger context window—it's better architecture. Context engineering is the strategic curation and maintenance of an optimal set of tokens during inference[1]. In practice, it is the architectural discipline of managing this expensive, volatile resource across the lifecycle of a long-running agent[2].
We analyzed four production systems—OpenClaw, Manus, Claude Code, and Codex CLI—alongside LangChain/LangGraph. Despite their independent origins, these systems have converged on a unified structural model.
The Layered Context Model #
Advanced systems treat the context window as a composition of distinct layers, each with its own lifecycle and priority.
| Layer | Content | Persistence | Cost |
|---|---|---|---|
| 1 | System Prompt / Identity | Permanent | Fixed (2-5K) |
| 2 | Conversation History | Truncated/Compacted | Growing (10-100K) |
| 3 | Tool Results | Volatile | High Volatility (5-200K) |
| 4 | External Memory | On-demand | Occasional |
Inside the model, each layer has a different lifecycle. System prompts are permanent—they define the agent's identity. Conversation turns have a half-life of tens of steps—recent turns matter, old turns are trimmed or compacted. Tool results have a half-life of one to several steps—they're maximally useful immediately after execution and decay rapidly. External memory lives outside and is fetched only when relevant.
This layered model gives every system a shared framework for deciding what stays and what goes. But every decision about what to keep, trim, or evict has a hidden cost axis that cuts across all four layers: the KV-cache.
The Cache Cost Axis #
LLM providers cache the key-value attention matrices from previous tokens. When the next request shares the same prefix, those cached computations are reused—saving both latency and cost. With cached tokens at $0.30/M versus $3.00/M uncached for Claude Sonnet, the difference is an order of magnitude[3].
Manus calls KV-cache hit rate "the single most important metric for a production-stage AI agent." It enforces five practices to maximize cache hits[3]:
- Stabilize prompt prefixes. No timestamps, no session IDs at the start of the system prompt. Even a single-token difference invalidates everything downstream.
- Keep context append-only. New observations extend the context; nothing is rewritten in place.
- Mask tools, don't remove them. Tool definitions sit near the front of the context. Removing tools invalidates the cache. Instead, Manus uses token-level logit masking at decoding—all tools stay in the prompt, but the model is constrained in which ones it can select. Tool names use consistent prefixes (
browser_*,shell_*) so groups can be masked efficiently. - Ensure deterministic serialization. Many languages don't guarantee stable key ordering when serializing JSON objects. Non-deterministic serialization silently breaks cache because the token sequence differs between calls even when the content is identical.
- Set explicit cache breakpoints. For frameworks without automatic incremental prefix caching, manually insert cache breakpoints at the system prompt boundary. Use session IDs to route requests to consistent workers in distributed deployments.
This axis cuts across every context management decision in the layers below—pruning, compaction, trimming—and shapes the context engineering architecture.
Layer 1: System Prompt — Assembly and Stability #
The system prompt is the agent's operating manual. A naive approach is to write one giant static prompt. Every production system we studied takes a modular approach instead—assembling the prompt from sections, including only what's relevant for the current session.
The implementations vary—OpenClaw uses programmatic section builders, Codex CLI and Claude Code use markdown files (AGENTS.md and CLAUDE.md), Manus freezes tool definitions at the front of the prompt—but the principle is consistent: assemble the prompt from parts, include only what's relevant.
Here is an example of OpenClaw. Each section builder independently decides whether to contribute based on the session context:
function buildSkillsSection({ skillsPrompt, isMinimal }) {
if (isMinimal) return []; // Subagents skip this entirely
if (!skillsPrompt) return []; // No eligible skills → no section
return ["## Skills (mandatory)", "Scan <available_skills>...", skillsPrompt];
}
function buildMemorySection({ isMinimal, availableTools }) {
if (isMinimal) return [];
if (!availableTools.has("memory_search")) return []; // No tool → no section
return ["## Memory Recall", "Before answering about prior work: run memory_search..."];
}
Codex CLI takes a file-system approach: AGENTS.md files are concatenated from ~/.codex/ (global) down to the project directory, with closer files overriding earlier guidance[4]. Claude Code's CLAUDE.md serves the same role—persistent project context that loads automatically, providing "free context that survives restarts"[5].
The rationale is the same across all approaches: stable, reusable context belongs in files, not in conversation history. Project conventions, coding standards, and architectural decisions don't change between turns. By loading them once from a file, they cache efficiently at the token level and never compete with the agent's working context for space.
Layer 2: Conversation History — Trimming, Compaction, and Focus #
Conversation history is the most natural form of context, but it grows unboundedly. Systems typically implement two mechanisms to keep it bounded: trimming (dropping old turns) and compaction (summarizing them).
Trimming #
The approaches split into three philosophies:
- Turn-based trimming (OpenClaw, LangGraph): count backward from the most recent message and keep a fixed number of turns. The model loses the exact wording of early turns, but their effects are often already reflected in current state.
- Token-budget preservation (Codex CLI, Claude Code): keep a fixed token budget of recent context (typically 20-25K), which adapts better when message sizes vary.
- Append-only history + result compression (Manus): preserve conversation turns and aggressively compress stale tool outputs instead.
Compaction #
When trimming isn't enough, systems summarize older context into a compressed narrative. Every system implements compaction, but the trigger policy differs:
| System | Trigger policy |
|---|---|
| OpenClaw | Adaptive ratio (window size minus reserve tokens) |
| Claude Code | Percentage threshold (~75%, reduced from 95%) |
| Codex CLI | Absolute token thresholds (about 180K-244K, model-dependent) |
| Manus | Per-step stale-result compaction instead of one global threshold |
Compaction is not free. Summarization is lossy, and the summary itself consumes tokens—compressing 100K tokens might produce a 5K-token summary that permanently occupies the window. Systems that compact repeatedly accumulate summary-of-summary chains, each generation losing nuance while the summaries claim an increasing share of context. Trigger too late, and there's no room to work after compression. Trigger too often, and accumulated information loss degrades reasoning.
Both trimming and compaction modify earlier context, which invalidates the KV-cache from the edit point onward. This is the fundamental tension: freeing tokens through pruning costs cache locality. Manus sidesteps this by never trimming conversation turns—instead they aggressively compress stale tool results (Layer 3) while keeping conversation append-only.
Focus #
Trimming and compaction manage conversation size. But there's a subtler problem: even within the budget, the model forgets its original objective.
The "lost-in-the-middle" problem is well-documented—models attend more strongly to tokens at the beginning and end of the context, with weaker attention to the middle[3]. In a 50-step agent trace, the original task description sits at the very beginning and the current step is at the very end. The 49 tool calls in between occupy exactly the zone where attention is weakest.
Manus's solution: the agent maintains a todo.md file that it rewrites at each step, "reciting its objectives into the end of the context, pushing the global plan into the model's recent attention span"[3]. Codex CLI converged on a similar pattern with its update_plan tool[6].
The trade-off is real. Manus found that ~30% of all agent actions were todo.md updates—tokens spent on plan maintenance rather than task execution[7]. Worse, constant rewriting destroyed KV-cache locality. They evolved to an on-demand pattern: a dedicated Planner sub-agent returns a structured Plan object only when the agent appears to drift—converting overhead from O(n) per step to O(1) per drift event.
Layer 3: Tool Results — Degradation and Isolation #
Tool results are the largest and most volatile component of the context window. A single file read can produce 5,000 tokens. A shell command might return 10,000. After a dozen tool calls, tool results can consume 80%+ of the available context.
The critical insight shared across all systems: tool results have a decay curve. They are maximally useful immediately after execution and become progressively less relevant as the agent moves on.
Lossy Compression vs. Restorable Compression #
The systems diverge in how they manage this decay:
| Approach | Systems | Mechanism | Trade-off |
|---|---|---|---|
| Lossy compression | OpenClaw, Claude Code, Codex CLI | Trim, clear, or summarize tool outputs | Simpler, but discarded detail may be needed later |
| Restorable compression | Manus | Replace old outputs with compact references while preserving full data on disk | More robust recovery, but higher implementation complexity |
Neither lossy nor restorable compression is strictly better. Lossy compression is simpler and works for shorter sessions. Restorable compression prevents the failure mode where the agent needs data that was already discarded—a real problem in 50+ step tasks.
However, one refinement applies to both: don't compress failed tool results. Manus deliberately keeps errors in context because "when the model sees a failed action—and the resulting observation or stack trace—it implicitly updates its internal beliefs"[3]. Removing error context causes the agent to repeat the same failed approaches. Error context costs tokens but saves the much larger cost of repeated failures.
Note the cache tension here: soft-trimming modifies earlier content, which invalidates the KV-cache from the trim point onward. Manus's restorable compression sidesteps this by appending compact references rather than editing in place—preserving the cached prefix while still freeing effective attention from stale results.
Sub-Agent Isolation: Preventing Results from Entering #
When degradation isn't enough, there's a more radical strategy: prevent tool results from entering the main window at all. Every production system except Codex CLI implements sub-agents primarily as a context isolation mechanism—not for specialization, but for keeping the main agent's window clean.
Manus's co-founder frames it via the Go concurrency principle: "share memory by communicating, don't communicate by sharing memory"[3].
A sub-agent might make 30 tool calls, encounter 5 errors, and read 20 files to produce a 3-paragraph answer. Without isolation, all of that would accumulate in the main agent's window. With isolation, the main agent receives only the conclusion—"since context is your fundamental constraint, subagents are one of the most powerful tools available"[5].
OpenClaw's implementation shows the mechanics:
// OpenClaw subagent-announce.ts: only the conclusion flows back
const reply = await readLatestAssistantReply({ sessionKey: childSessionKey });
const triggerMessage = [
`A background task "${label}" just ${statusLabel}.`,
"Findings:",
reply || "(no output)",
`Stats: runtime ${formatDurationShort(duration)} • tokens ${formatTokenCount(total)}`,
].join("\n");
Manus adds graduated sharing: for simple discrete tasks, only instructions pass to the sub-agent. For complex interdependent tasks, the planner shares its full context but the sub-agent maintains its own action space[7]. This avoids the overhead of re-establishing context when the sub-task genuinely depends on the parent's history.
Layer 4: External Memory — Persistence Outside the Window #
The first three layers manage what's inside the context window. External memory manages what's outside—a persistent knowledge store that the agent can query when needed, never consuming context space by default.
Every system implements external memory, but they occupy different points on a complexity spectrum—from simple file loading to semantic retrieval.
At the simple end, Codex CLI loads AGENTS.md files and session transcripts wholesale at startup[4]. Claude Code adds a second tier: CLAUDE.md for stable project knowledge and session_memory files for volatile session state, with compaction summaries persisted automatically[5]. Manus treats the entire sandbox file system as infinite memory—agents write intermediate results to files and retain only paths in context[3].
At the sophisticated end, OpenClaw maintains markdown files indexed into a SQLite vector store with hybrid search (70% vector similarity + 30% BM25). Memory is pulled into context on demand via memory_search, never loaded by default. LangChain/LangMem provides the framework abstraction: episodic, procedural, and semantic memory types with embedding-based retrieval[2].
The connection to compaction is direct: compaction is lossy, so anything important should be externalized before the context is compressed. OpenClaw implements this as a pre-compaction memory flush—when context nears the compaction threshold, a silent agentic turn fires, asking the model to persist important facts (user preferences, blocking dependencies, architectural decisions) to memory files. The persisted facts survive summarization and remain recoverable via search. This turns the context window from a lossy buffer into a system with explicit persistence semantics.
Key Takeaways #
These systems were built independently. They still converged on the same core lesson: context is not just a model limit, it is the primary systems constraint in long-running agents.
Three system design tensions show up repeatedly:
-
Fidelity vs. capacity. Every compression step (trimming, compaction, tool-result pruning) trades information for space. The strongest designs do not compress everything equally. They compress by decay curve: system prompts stay stable, tool outputs degrade quickly, error traces are preserved, and durable facts are pushed to external memory before compaction.
-
Pruning vs. caching. Freeing tokens helps capacity, but editing earlier tokens breaks KV-cache locality and increases cost and latency. Append-only strategies preserve cache but accumulate stale context. Aggressive pruning recovers space but invalidates cached prefixes. The practical direction is cache-aware pruning: deterministic serialization, fewer rewrite points, and append-style markers where possible.
-
Autonomy vs. overhead. Planning aids (
todo.md,update_plan) and sub-agents improve focus, but they consume tokens, time, and coordination effort. The right frequency depends on task length and coupling. Short tasks can stay simple; long tasks often need explicit planning and isolation.
The practical takeaway is architectural, not prompt-level: treat context like a managed resource with lifecycles, budgets, and persistence rules.
References #
Citation #
If you found this useful, please cite this blog as:
Non-linear AI. (Feb 2026). Context Engineering: Managing the Scarcest Resource in Agent Systems. non-linear.ai. https://non-linear.ai/blog/context-engineering/
or
@article{ai2026contextengineering,
title = {Context Engineering: Managing the Scarcest Resource in Agent Systems},
author = {{Non-linear AI}},
journal = {non-linear.ai},
year = {2026},
month = {Feb},
url = {https://non-linear.ai/blog/context-engineering/}
}