latest / observability
← all posts
// observability · observability

Observability for agents: you can't operate what you can't see

You wouldn't run a distributed system in production with no logs, no metrics, and no tracing. An agent is a distributed system with the added joy of being nondeterministic and occasionally creative. Yet most teams ship one and "monitor" it by vibes — "it felt worse this week." That's not operations; it's superstition. Here's what to instrument.

Why agents are hard to see into

A single agent task is a loop: prompt → model → tool call → result → model → … → answer. Things that make it opaque:

  • Nondeterminism. The same input can take a different path. A bug that reproduces one time in five is invisible without traces.
  • Multi-step. The failure might be in step 7 of 12 — a tool that returned junk, a retrieval that fetched the wrong file — and the final answer just inherits it.
  • Cost is hidden. A task that "worked" might have cost ten times what it should because the cache silently stopped hitting.

What to instrument

Three layers, in order of payoff:

1. Per-step traces. The single highest-value thing. For every task, record the chain: each prompt, each tool call with its arguments, each tool result, each model decision. When something goes wrong, you read the trace and see which step lied — instead of guessing. This is OpenTelemetry-style tracing applied to the agent loop; several LLM-tracing tools speak the OTel format now.

2. Token and cost accounting. Per task, not per month: input/output tokens, cache-read ratio, model used, dollars. Two reasons this is non-negotiable:

  • A collapsing cache-hit ratio is the most common silent cost blowup, and it's invisible without this metric (the caching lever only works if you can see it working).
  • Cost-per-task is the number your pricing strategy lives or dies on.

3. Outcome metrics. Did it actually work? Tool error rates, retry counts, human-override rate (how often a person had to step in or redo it), and task success against a golden set.

"It felt better" is not a metric. The agent that quietly regressed last week did so on a number you weren't watching — usually cache-hit ratio or override rate.

Evals are the control plane

Telemetry tells you what happened; evals tell you whether a change made things better or worse. They're the same discipline from the agent-architecture post, viewed as an operational tool:

  • A golden set — even 20 real tasks with known-good outcomes — that you run on every change to the prompt, the tools, the retrieval, or the model.
  • Regression tracking. The change that fixes today's bug quietly breaks last week's. Without an eval set, you hear about it from users; with one, you catch it in CI.
  • Test the loop, not just the model. Swapping the model is one variable; your retrieval, tools, and prompts are the others. Hold them fixed and change one at a time, or you're measuring noise.

The lazy, correct stack

Resist buying an "LLM observability platform" before you've earned the need:

  • Structured logs of every step (prompt, tool calls, results, tokens, cost) to wherever your logs already go. This alone beats 90% of teams.
  • A cost-per-task number on a dashboard you actually look at.
  • A 20-task eval you run on every change.

That's real observability. Add the fancy tracing UI, automated LLM-judges, and dashboards when a profiler — sorry, a production incident — proves the basics weren't enough (ponytail applies to your tooling too). You instrument to answer specific questions: which step failed, what did it cost, did this change help? Build exactly enough to answer those, and not the cathedral.

Inspired by the observability and measurement writing at tomaskubica.cz.

#observability#agents#evals