latest / tooling
← all posts
// tooling · tooling

Headroom: a compression layer between your agent and the model

Of all the cost levers, context compression is the one most people skip because it sounds risky — compress the input and surely the model gets dumber? Headroom is an open-source project built to call that bluff: it sits between your agent and the LLM and shrinks the context before it's sent, reporting accuracy held steady on standard benchmarks.

What it does

The problem it targets is token bloat: coding agents fire enormous tool outputs, logs, RAG chunks, and whole files at the model, most of which is structurally redundant. Headroom runs locally as a compression layer and routes content to type-aware compressors instead of treating everything as a blob of text:

  • JSON gets structurally crushed (a dedicated SmartCrusher).
  • Code is compressed via AST parsing — Python, JS, Go, Rust, Java, C++ — so structure survives.
  • A small model trained on agentic traces handles the general case.
  • A cache aligner stabilizes prefixes so provider-side prompt caching keeps hitting.

The numbers it reports on real workloads are the headline:

WorkloadBefore → afterSaved
Code search (100 results)17,765 → 1,408 tok~92%
Incident debugging65,694 → 5,118 tok~92%
GitHub issue triage54,174 → 14,761 tok~73%

General range: 60–95% fewer tokens, with near-identical scores on GSM8K, TruthfulQA, SQuAD v2, and BFCL.

The cheapest token is the one you never send. Headroom's bet is that most of your context is filler the model never needed — and on tool-heavy agent loops, it's mostly right.

How you actually run it

The thing I like: it meets you where you are. Three integration modes, in increasing invasiveness:

pip install "headroom-ai[all]"

# 1. zero-code: a local proxy your agent points at
headroom proxy --port 8787

# 2. wrap an agent CLI directly
headroom wrap claude   # or codex / cursor / aider / copilot

# 3. library, if you want control
#    from headroom import compress

It also runs as an MCP server (headroom_compress, headroom_retrieve, headroom_stats) and has special handling for the GitHub Copilot CLI subscription mode. The proxy mode is the lazy win — point your OpenAI-compatible client at localhost:8787 and you're compressing with no code change.

The clever parts

Two design choices stand out:

  • Reversible compression (CCR). It caches the originals locally, so the model can call headroom_retrieve to pull the full content back on demand. You compress optimistically and only pay the full tokens when the model actually needs the detail — the best of both.
  • Output shaping too. Beyond input, HEADROOM_OUTPUT_SHAPER=1 appends terseness guidance and routes effort down on routine steps — the same idea as Caveman and Ponytail, built in. And headroom learn mines failed sessions to write corrections into your CLAUDE.md / AGENTS.md.

Where it fits — and where it doesn't

The honest scoping, which the project states itself:

  • Reach for it when you run multiple coding agents daily and want savings with no code changes, or you want cross-agent memory and reversible compression.
  • Skip it for single-provider setups that already rely on the provider's native compaction, or sandboxed environments where a local proxy doesn't fit.

In the cost stack, Headroom is the input-compression lever — and a strong one. Pair it with prompt caching (it's explicitly built to keep caches hitting), narrow retrieval, and the output-side tools, and you're well into order-of-magnitude territory. The capstone wires it all together.

#tooling#cost#context