// architecture · architecture

AI agent architectures that don't fall over

$ Jakub Jirák · Jun 16 · 10 min

It is very easy to build an agent that demos well and very hard to build one that survives contact with real users. The gap is almost never the model. It's the architecture around it — the unglamorous machinery that decides what the model sees, what it's allowed to do, and how you know it's working.

An agent is a loop, not a model

The whole thing reduces to:

# the agent loop
observe   -> gather context (prompt, retrieved docs, tool results)
decide    -> the model proposes an action (a tool call or a final answer)
act       -> execute the tool, capture the result
repeat    -> until done or a budget is hit

Everything that makes agents hard is hidden in observe and act. The model is the easy part — you rent it. The architecture is the part you actually own.

Context engineering beats prompt engineering

The single biggest lever on quality is what goes in the window, not how cleverly you word the instructions.

Retrieve narrow, not wide. More context is not better context. Irrelevant chunks are noise the model has to fight through. Precision in retrieval beats recall almost every time.
Put the important thing where the model looks — the start and end of the window, not the middle. (The Gemini post covers the "lost in the middle" effect; it applies to every model.)
Cache the stable parts. System prompt, tool definitions, the codebase — these don't change turn to turn. Prompt caching turns a linear cost into a near-flat one and is the difference between a viable agent and an expensive one.

Most "the model isn't smart enough" complaints are "I gave it the wrong context" in disguise.

RAG vs. fine-tuning vs. a bigger window

The eternal team argument. The decision is usually simpler than the debate:

Need	Reach for
Inject knowledge that changes often	RAG
Teach a behavior / format / tone	Fine-tune
Reason over one big artifact at once	Long context
Cite sources, control freshness	RAG

RAG is the default because knowledge changes and retrieval is cheap to update — you edit a document, not a model. Fine-tune when you need the model to behave differently, not know differently. Reach for a long window when the problem is a single large thing. These compose; it's rarely either/or.

Tools: the part with teeth

The moment your agent can run code, hit APIs, or touch a filesystem, you've left "language model" and entered "software with a blast radius."

Least privilege, always. Scope every tool to the minimum it needs. An agent that can delete prod will, eventually, on a bad sample.
Make destructive actions confirm. A human in the loop for anything irreversible isn't friction — it's the design. (This is exactly why Claude Code gates side effects and Codex runs in a sandbox.)
Validate at the boundary. Treat tool inputs the model generates like untrusted user input, because that's what they are. Sanitize before they hit a shell or a query.
Return structured, bounded results. A tool that dumps 50k tokens of output back into the loop has just poisoned the context. Summarize, paginate, truncate.

Memory, honestly

Most agents don't need a vector-database "memory" system. They need:

The conversation (short-term, in the window).
A scratchpad for the current task (a file, a few notes).
Occasionally, durable facts worth persisting across sessions.

Reach for a real memory store only when a profiler — sorry, a user complaint — proves you need it. A bespoke long-term memory layer is the classic thing built six months early and maintained forever.

Evals are the whole game

You cannot improve what you cannot measure, and "it felt better" is not measurement.

Build a golden set early — even 20 real tasks with known-good outcomes. It's the difference between engineering and vibes.
Test the loop, not just the model. Swapping the model is one variable; your retrieval, your tools, and your prompts are the others. Hold them fixed and measure one at a time.
Track regressions. The change that fixes today's bug quietly breaks last week's. Without an eval set you find out from users.

The lazy, correct stack

If you're starting today, resist the urge to build the cathedral:

One model, called in a loop.
Narrow retrieval over a plain document store. No vector DB until search quality actually demands it.
A handful of well-scoped, validated tools.
Prompt caching on from day one.
A 20-task eval set you actually run.

That's a real agent. Everything fancier — multi-agent orchestration, custom memory, a framework with sixty abstractions — should have to justify itself against a profiler and a user, not a roadmap. The landscape post is where these principles show up in shipping products.

#architecture#agents#rag