latest / autonomy
← all posts
// autonomy · autonomy

Long-running autonomous agents: letting it work while you sleep

The interesting line being crossed in 2026 isn't model IQ — it's duration. The progression has gone from autocomplete, to an interactive assistant you babysit turn by turn, to an agent you can hand a fenced task and walk away from for an hour. That last step is the one that changes your day, and it's also the one with the sharpest failure modes. Here's how to actually let it run.

What "autonomous" buys, and what it costs

When an agent can grind unattended — branch, edit, test, read failures, retry, for many minutes — you unlock two things:

  • Parallelism of you. Kick off three scoped tasks, do something else, review the diffs when they land. The agent's wall-clock time stops being your wall-clock time. (Codex's cloud agent is built around exactly this.)
  • Depth. A task that needs twenty iterations to get the suite green is fine if you're not watching each one.

The cost is symmetric: a long unattended run amplifies error as faithfully as it amplifies progress. Fifteen minutes down a wrong path is fifteen minutes of confident, plausible, wrong work — and you weren't there to stop it at minute two.

The four things that make it safe to walk away

Autonomy is leverage when the task is verifiable and a liability when it isn't. Everything below is about making sure you're holding the leverage.

  • A real oracle. The agent must be able to check itself: a passing test suite, a property test, a script that exits non-zero on failure. No oracle, no autonomy — you're just generating diffs and hoping. This is the single precondition.
  • Tight fences. "Only touch src/payments," "don't change public APIs," "no new dependencies." Constraints are the steering wheel for a process you're not watching.
  • A sandbox. Long unattended runs and an unguarded shell are a bad pairing. Isolation means the worst case is bounded whether or not you're at the keyboard (sandboxing the agent).
  • A budget. A token/turn ceiling (a task budget, --max-turns) so a stuck agent fails cheap instead of looping until your bill notices.

Read the log, not just the diff

The discipline that separates "I trust the overnight run" from "I got burned": when the agent hands back a green checkmark, read the trace of what it tried, not just the final diff. The log tells you whether the tests pass because the change is correct or because the agent quietly learned to route around a failing test. A fresh-context verifier — a separate agent that judges the artifact, not the conversation that produced it — catches the self-deception the author can't (the evaluator loop).

When to let it run vs. babysit

A simple decision rule:

  • Let it run when the task is fenced and verifiable — a migration with a suite, "make these flaky tests deterministic," a broad mechanical change. You'll evaluate the end state, and the end state is checkable.
  • Babysit when the task is exploratory or underspecified — figuring out an approach, debugging something with no clear oracle, anything where "done" is a judgement call. Here the turn-by-turn loop is the point.

The mistake is letting a vague task run long. Length amplifies whatever you gave it; give it ambiguity and you get a lot of confident ambiguity back.

The operational reality

Long-running agents are token-heavy by nature, so the cost levers and observability stop being optional — you want cost-per-task and a cache-hit ratio you can see, because an overnight run is exactly where a silent cache miss turns into a surprising invoice. Budgets cap the downside; traces explain it after the fact.

Used well, an autonomous agent is the closest thing to hiring a tireless junior who works the night shift. The skill isn't prompting it — it's building the fences, the oracle, and the budget that let you trust the work you weren't awake to watch.

Inspired by the autonomous-agent coverage at vibecoding.cz.

#autonomy#agents#codex