latest / architecture
← all posts
// architecture · architecture

Advanced agent architecture: context is the scarce resource

The foundational post covers the loop, tools, and evals; the patterns post covers topologies. This one is the next layer down — the discipline that actually separates a demo agent from one that survives a long, real task. And it has one subject: the context window is the scarce resource, and managing it is the whole advanced game.

Why context is where it's won or lost

Three things go wrong with a naive, ever-growing context, and all of them get worse the longer the agent runs:

  • It fills with stale junk. Old tool outputs, completed sub-tasks, dead-end reasoning — all still in the window, all still being re-read and re-billed every turn.
  • The signal gets buried. Models reason best about the start and end of a long window and skim the murky middle ("lost in the middle"). Bury the critical function at token 400,000 and it gets skimmed.
  • You re-send the world. Every turn re-processes the whole transcript. Without intervention, cost and latency grow with the square of the conversation.

Everything below is a tool for the same job: keep the window small, relevant, and stable.

The advanced toolkit

Compaction. When the conversation approaches the window limit, summarize the earlier history (server-side, where supported) into a compact block so the loop can keep going past the raw window. The critical implementation detail: preserve the compaction block in your message history — append the full response, not just the text — or you silently lose the summarized state.

Context editing. The complement to compaction: instead of summarizing old content, prune it. Clear stale tool results and completed thinking blocks past a threshold. Compaction condenses; editing removes. Long tool-heavy loops want both — edit the dead weight, compact when you near the limit.

Memory tiers. Be honest about what "memory" you need — most agents need three plain layers, not a vector database:

  • Short-term — the conversation, in the window.
  • Scratchpad — a file for the current task's working notes.
  • Durable — cross-session facts worth persisting (conventions, decisions). A directory of markdown beats a bespoke memory system until a profiler says otherwise. (Software as memory is this layer, governed.)

Reach for a vector "memory" store only when a user complaint — not a roadmap — proves you need it. The classic over-build is a long-term memory layer shipped six months early and maintained forever.

Sub-agent context isolation. The single most powerful advanced move. Fan out work to subagents, each with a clean, scoped context, and have the orchestrator keep only their summaries — not the raw firehose. Searching a big codebase or exploring three approaches in parallel shouldn't pollute the main thread's window. This is the orchestrator-worker topology seen through the context lens: isolation is the point, parallelism is the bonus.

Programmatic tool calling. When the agent needs many sequential tool calls or produces large intermediate results, let it write a script that calls the tools instead of round-tripping each one through its context. The intermediates live in the running code; only the final output returns to the model. A 50-result search that gets filtered down to 3 in a loop costs you 3 results of context, not 50.

Tool search. Don't load 100 tool schemas into every request. Let the agent discover the few relevant tools on demand — and crucially, append them rather than swapping, so the prompt cache survives. Big fixed tool sets are a context tax you pay on every turn.

Caching is part of the architecture, not an afterthought

The stable prefix — system prompt, tool definitions, the codebase — is a design surface. Keep it byte-stable so it caches: no timestamps in the system prompt, deterministic tool ordering, volatile content after the breakpoint. An advanced agent treats "what is the cacheable prefix" as a first-class architectural question, because it's the difference between linear and near-flat cost (the caching lever).

The lazy advanced stack

Advanced doesn't mean maximal. Turn things on as the task demands them:

  • Compaction on for anything that might exceed the window.
  • Context editing for long, tool-heavy loops.
  • A file scratchpad, and a markdown memory directory for cross-session facts.
  • Subagents for fan-out, keeping the orchestrator's window clean.
  • Caching protected — the prefix is sacred.
  • Add programmatic tool calling, tool search, and a real memory store only when a profiler proves the simpler stack ran out of room.

The throughline: you don't make an agent smarter by giving it more context — you make it smarter by giving it the right context and ruthlessly evicting the rest. The frontier model is rented; the context discipline is the part you own, and it's where the hard problems are actually solved. The natural next step is spending that context budget across model tiers — the local-first cascade.

#architecture#agents#context