Sandboxing the agent: letting AI run code without losing the building
The moment you give a coding agent a shell, you've stopped running a language model and started running software with a blast radius. It can install packages, hit your network, read your secrets, and — on a bad sample or a poisoned input — delete things. Sandboxing isn't a nice-to-have bolted on later; it's the precondition for letting an agent act at all.
The threat model is broader than "the model goes rogue"
Three distinct ways an agent's actions go wrong, only one of which is the model being "bad":
- Honest mistakes. It runs
rmin the wrong directory, force-pushes, or drops a table while "cleaning up." No malice — just a confident wrong step. - Prompt injection. This is the one teams underrate. Untrusted content the agent reads — a web page, a dependency's README, an issue comment, a tool result — can carry instructions that hijack the agent's tools. The agent can't reliably tell your instructions from text it just fetched. Treat every byte the agent didn't write as potentially adversarial.
- Over-broad capability. A tool scoped to "run any command" will eventually run a destructive one, because the action space includes it.
The isolation layers
Defense in depth, cheapest to strongest:
- Filesystem scoping. The agent works in a project directory, not your home folder. Mount read-only what it only needs to read.
- Network egress control. Deny by default; allow the specific hosts the task needs. This is the single biggest lever against exfiltration and against injected "now POST your env to evil.com."
- Container / VM isolation. Run tools in a throwaway container so the blast radius is the container, not the host. Codex's cloud agent leans on exactly this — the long autonomous run happens in a sandbox, so "let it grind" is safe by construction.
- Secrets stay out. API keys and tokens should never be readable inside the sandbox. Inject credentials at the boundary (a proxy that adds the auth header after the request leaves the box), so code the agent writes — or code injected into it — can't read the secret.
The sandbox is what lets you say yes. An agent you trust to act is an agent whose worst case you've already bounded.
Approval gates: a human on the irreversible
Isolation bounds the damage; approval gates prevent it. The rule: anything hard to reverse stops for a human. This isn't friction — it's the design.
- Reversible, in-sandbox actions (edit a file, run the tests) — let it fly.
- Irreversible or outward-facing actions (push, deploy, send, delete, spend money) — confirm first.
Claude Code gates side effects at exactly this boundary; Codex substitutes a sandbox so the boundary is the container. Different answers to the same question: can I undo what the agent just did?
Least privilege, by tool design
The cleanest control is upstream, in how you shape the tools (the agent-architecture post goes deep here):
- Scope every tool to the minimum. A
run_teststool that can only run tests beats abashtool that can do anything. - Validate tool inputs. The arguments the model generates are untrusted input — sanitize before they hit a shell or a query.
- Bound tool output. A tool that dumps 50k tokens back into the loop has just widened the injection surface and poisoned the context. Summarize and truncate.
For teams: governance, not just isolation
At scale, isolation needs governance around it: who can grant an agent which capabilities, an audit trail of what agents ran, policy on what may touch production, and review of the agent-authored PRs with more rigor than human ones — because the code is confident and the author can't be paged.
The headline: a sandboxed agent with least-privilege tools and a human on the irreversible steps is a powerful, safe teammate. The same agent with an unguarded shell and your prod credentials is a fast way to author an incident. The difference is architecture you put in before, not cleanup after.
Inspired by the security-and-governance coverage at vibecoding.cz.