// observability · observability

Observability for agents: you can't operate what you can't see

$ Jakub Jirák · Jun 1 · 9 min

You wouldn't run a distributed system in production with no logs, no metrics, and no tracing. An agent is a distributed system with the added joy of being nondeterministic and occasionally creative. Yet most teams ship one and "monitor" it by vibes — "it felt worse this week." That's not operations; it's superstition. Here's what to instrument.

Why agents are hard to see into

A single agent task is a loop: prompt → model → tool call → result → model → … → answer. Things that make it opaque:

Nondeterminism. The same input can take a different path. A bug that reproduces one time in five is invisible without traces.
Multi-step. The failure might be in step 7 of 12 — a tool that returned junk, a retrieval that fetched the wrong file — and the final answer just inherits it.
Cost is hidden. A task that "worked" might have cost ten times what it should because the cache silently stopped hitting.

What to instrument

Three layers, in order of payoff:

1. Per-step traces. The single highest-value thing. For every task, record the chain: each prompt, each tool call with its arguments, each tool result, each model decision. When something goes wrong, you read the trace and see which step lied — instead of guessing. This is OpenTelemetry-style tracing applied to the agent loop; several LLM-tracing tools speak the OTel format now.

2. Token and cost accounting. Per task, not per month: input/output tokens, cache-read ratio, model used, dollars. Two reasons this is non-negotiable:

A collapsing cache-hit ratio is the most common silent cost blowup, and it's invisible without this metric (the caching lever only works if you can see it working).
Cost-per-task is the number your pricing strategy lives or dies on.

3. Outcome metrics. Did it actually work? Tool error rates, retry counts, human-override rate (how often a person had to step in or redo it), and task success against a golden set.

"It felt better" is not a metric. The agent that quietly regressed last week did so on a number you weren't watching — usually cache-hit ratio or override rate.

Evals are the control plane

Telemetry tells you what happened; evals tell you whether a change made things better or worse. They're the same discipline from the agent-architecture post, viewed as an operational tool:

A golden set — even 20 real tasks with known-good outcomes — that you run on every change to the prompt, the tools, the retrieval, or the model.
Regression tracking. The change that fixes today's bug quietly breaks last week's. Without an eval set, you hear about it from users; with one, you catch it in CI.
Test the loop, not just the model. Swapping the model is one variable; your retrieval, tools, and prompts are the others. Hold them fixed and change one at a time, or you're measuring noise.

The lazy, correct stack

Resist buying an "LLM observability platform" before you've earned the need:

Structured logs of every step (prompt, tool calls, results, tokens, cost) to wherever your logs already go. This alone beats 90% of teams.
A cost-per-task number on a dashboard you actually look at.
A 20-task eval you run on every change.

That's real observability. Add the fancy tracing UI, automated LLM-judges, and dashboards when a profiler — sorry, a production incident — proves the basics weren't enough (ponytail applies to your tooling too). You instrument to answer specific questions: which step failed, what did it cost, did this change help? Build exactly enough to answer those, and not the cathedral.

Inspired by the observability and measurement writing at tomaskubica.cz.

#observability#agents#evals