Observability for agents: you can't operate what you can't see
You wouldn't run a distributed system in production with no logs, no metrics, and no tracing. An agent is a distributed system with the added joy of being nondeterministic and occasionally creative. Yet most teams ship one and "monitor" it by vibes — "it felt worse this week." That's not operations; it's superstition. Here's what to instrument.
Why agents are hard to see into
A single agent task is a loop: prompt → model → tool call → result → model → … → answer. Things that make it opaque:
- Nondeterminism. The same input can take a different path. A bug that reproduces one time in five is invisible without traces.
- Multi-step. The failure might be in step 7 of 12 — a tool that returned junk, a retrieval that fetched the wrong file — and the final answer just inherits it.
- Cost is hidden. A task that "worked" might have cost ten times what it should because the cache silently stopped hitting.
What to instrument
Three layers, in order of payoff:
1. Per-step traces. The single highest-value thing. For every task, record the chain: each prompt, each tool call with its arguments, each tool result, each model decision. When something goes wrong, you read the trace and see which step lied — instead of guessing. This is OpenTelemetry-style tracing applied to the agent loop; several LLM-tracing tools speak the OTel format now.
2. Token and cost accounting. Per task, not per month: input/output tokens, cache-read ratio, model used, dollars. Two reasons this is non-negotiable:
- A collapsing cache-hit ratio is the most common silent cost blowup, and it's invisible without this metric (the caching lever only works if you can see it working).
- Cost-per-task is the number your pricing strategy lives or dies on.
3. Outcome metrics. Did it actually work? Tool error rates, retry counts, human-override rate (how often a person had to step in or redo it), and task success against a golden set.
"It felt better" is not a metric. The agent that quietly regressed last week did so on a number you weren't watching — usually cache-hit ratio or override rate.
Evals are the control plane
Telemetry tells you what happened; evals tell you whether a change made things better or worse. They're the same discipline from the agent-architecture post, viewed as an operational tool:
- A golden set — even 20 real tasks with known-good outcomes — that you run on every change to the prompt, the tools, the retrieval, or the model.
- Regression tracking. The change that fixes today's bug quietly breaks last week's. Without an eval set, you hear about it from users; with one, you catch it in CI.
- Test the loop, not just the model. Swapping the model is one variable; your retrieval, tools, and prompts are the others. Hold them fixed and change one at a time, or you're measuring noise.
The lazy, correct stack
Resist buying an "LLM observability platform" before you've earned the need:
- Structured logs of every step (prompt, tool calls, results, tokens, cost) to wherever your logs already go. This alone beats 90% of teams.
- A cost-per-task number on a dashboard you actually look at.
- A 20-task eval you run on every change.
That's real observability. Add the fancy tracing UI, automated LLM-judges, and dashboards when a profiler — sorry, a production incident — proves the basics weren't enough (ponytail applies to your tooling too). You instrument to answer specific questions: which step failed, what did it cost, did this change help? Build exactly enough to answer those, and not the cathedral.
Inspired by the observability and measurement writing at tomaskubica.cz.