Headroom: a compression layer between your agent and the model
Of all the cost levers, context compression is the one most people skip because it sounds risky — compress the input and surely the model gets dumber? Headroom is an open-source project built to call that bluff: it sits between your agent and the LLM and shrinks the context before it's sent, reporting accuracy held steady on standard benchmarks.
What it does
The problem it targets is token bloat: coding agents fire enormous tool outputs, logs, RAG chunks, and whole files at the model, most of which is structurally redundant. Headroom runs locally as a compression layer and routes content to type-aware compressors instead of treating everything as a blob of text:
- JSON gets structurally crushed (a dedicated
SmartCrusher). - Code is compressed via AST parsing — Python, JS, Go, Rust, Java, C++ — so structure survives.
- A small model trained on agentic traces handles the general case.
- A cache aligner stabilizes prefixes so provider-side prompt caching keeps hitting.
The numbers it reports on real workloads are the headline:
| Workload | Before → after | Saved |
|---|---|---|
| Code search (100 results) | 17,765 → 1,408 tok | ~92% |
| Incident debugging | 65,694 → 5,118 tok | ~92% |
| GitHub issue triage | 54,174 → 14,761 tok | ~73% |
General range: 60–95% fewer tokens, with near-identical scores on GSM8K, TruthfulQA, SQuAD v2, and BFCL.
The cheapest token is the one you never send. Headroom's bet is that most of your context is filler the model never needed — and on tool-heavy agent loops, it's mostly right.
How you actually run it
The thing I like: it meets you where you are. Three integration modes, in increasing invasiveness:
pip install "headroom-ai[all]"
# 1. zero-code: a local proxy your agent points at
headroom proxy --port 8787
# 2. wrap an agent CLI directly
headroom wrap claude # or codex / cursor / aider / copilot
# 3. library, if you want control
# from headroom import compress
It also runs as an MCP server (headroom_compress, headroom_retrieve, headroom_stats) and has special handling for the GitHub Copilot CLI subscription mode. The proxy mode is the lazy win — point your OpenAI-compatible client at localhost:8787 and you're compressing with no code change.
The clever parts
Two design choices stand out:
- Reversible compression (CCR). It caches the originals locally, so the model can call
headroom_retrieveto pull the full content back on demand. You compress optimistically and only pay the full tokens when the model actually needs the detail — the best of both. - Output shaping too. Beyond input,
HEADROOM_OUTPUT_SHAPER=1appends terseness guidance and routes effort down on routine steps — the same idea as Caveman and Ponytail, built in. Andheadroom learnmines failed sessions to write corrections into yourCLAUDE.md/AGENTS.md.
Where it fits — and where it doesn't
The honest scoping, which the project states itself:
- Reach for it when you run multiple coding agents daily and want savings with no code changes, or you want cross-agent memory and reversible compression.
- Skip it for single-provider setups that already rely on the provider's native compaction, or sandboxed environments where a local proxy doesn't fit.
In the cost stack, Headroom is the input-compression lever — and a strong one. Pair it with prompt caching (it's explicitly built to keep caches hitting), narrow retrieval, and the output-side tools, and you're well into order-of-magnitude territory. The capstone wires it all together.