Advanced agent architecture: context is the scarce resource
The foundational post covers the loop, tools, and evals; the patterns post covers topologies. This one is the next layer down — the discipline that actually separates a demo agent from one that survives a long, real task. And it has one subject: the context window is the scarce resource, and managing it is the whole advanced game.
Why context is where it's won or lost
Three things go wrong with a naive, ever-growing context, and all of them get worse the longer the agent runs:
- It fills with stale junk. Old tool outputs, completed sub-tasks, dead-end reasoning — all still in the window, all still being re-read and re-billed every turn.
- The signal gets buried. Models reason best about the start and end of a long window and skim the murky middle ("lost in the middle"). Bury the critical function at token 400,000 and it gets skimmed.
- You re-send the world. Every turn re-processes the whole transcript. Without intervention, cost and latency grow with the square of the conversation.
Everything below is a tool for the same job: keep the window small, relevant, and stable.
The advanced toolkit
Compaction. When the conversation approaches the window limit, summarize the earlier history (server-side, where supported) into a compact block so the loop can keep going past the raw window. The critical implementation detail: preserve the compaction block in your message history — append the full response, not just the text — or you silently lose the summarized state.
Context editing. The complement to compaction: instead of summarizing old content, prune it. Clear stale tool results and completed thinking blocks past a threshold. Compaction condenses; editing removes. Long tool-heavy loops want both — edit the dead weight, compact when you near the limit.
Memory tiers. Be honest about what "memory" you need — most agents need three plain layers, not a vector database:
- Short-term — the conversation, in the window.
- Scratchpad — a file for the current task's working notes.
- Durable — cross-session facts worth persisting (conventions, decisions). A directory of markdown beats a bespoke memory system until a profiler says otherwise. (Software as memory is this layer, governed.)
Reach for a vector "memory" store only when a user complaint — not a roadmap — proves you need it. The classic over-build is a long-term memory layer shipped six months early and maintained forever.
Sub-agent context isolation. The single most powerful advanced move. Fan out work to subagents, each with a clean, scoped context, and have the orchestrator keep only their summaries — not the raw firehose. Searching a big codebase or exploring three approaches in parallel shouldn't pollute the main thread's window. This is the orchestrator-worker topology seen through the context lens: isolation is the point, parallelism is the bonus.
Programmatic tool calling. When the agent needs many sequential tool calls or produces large intermediate results, let it write a script that calls the tools instead of round-tripping each one through its context. The intermediates live in the running code; only the final output returns to the model. A 50-result search that gets filtered down to 3 in a loop costs you 3 results of context, not 50.
Tool search. Don't load 100 tool schemas into every request. Let the agent discover the few relevant tools on demand — and crucially, append them rather than swapping, so the prompt cache survives. Big fixed tool sets are a context tax you pay on every turn.
Caching is part of the architecture, not an afterthought
The stable prefix — system prompt, tool definitions, the codebase — is a design surface. Keep it byte-stable so it caches: no timestamps in the system prompt, deterministic tool ordering, volatile content after the breakpoint. An advanced agent treats "what is the cacheable prefix" as a first-class architectural question, because it's the difference between linear and near-flat cost (the caching lever).
The lazy advanced stack
Advanced doesn't mean maximal. Turn things on as the task demands them:
- Compaction on for anything that might exceed the window.
- Context editing for long, tool-heavy loops.
- A file scratchpad, and a markdown memory directory for cross-session facts.
- Subagents for fan-out, keeping the orchestrator's window clean.
- Caching protected — the prefix is sacred.
- Add programmatic tool calling, tool search, and a real memory store only when a profiler proves the simpler stack ran out of room.
The throughline: you don't make an agent smarter by giving it more context — you make it smarter by giving it the right context and ruthlessly evicting the rest. The frontier model is rented; the context discipline is the part you own, and it's where the hard problems are actually solved. The natural next step is spending that context budget across model tiers — the local-first cascade.