The architecture that cuts 99% of your LLM bill
"Cut 99% of costs" sounds like a landing-page lie. It isn't — but it's not one magic setting either. It's five independent levers that multiply, and the headline number is what you get when you actually stack them. Let's build it up honestly, with the math.
Lever 1 — prompt caching (the big one)
Most agent and RAG workloads re-send a huge, identical prefix every turn: the system prompt, tool definitions, the codebase, the retrieved documents. Prompt caching charges that prefix at roughly 0.1× after the first write. On a loop where 90% of the input is a stable prefix, you've cut input cost by ~80–90% with zero quality change.
# the silent cache killer
system = f"Current time: {now()}\n{BIG_STABLE_PROMPT}" # ← invalidates everything
# fix: put volatile content AFTER the stable, cached prefix
Caching is a prefix match — one byte change anywhere in the prefix invalidates the rest. A timestamp or unsorted JSON in your system prompt silently turns caching off. Verify with the cache-read counter; if it's zero, you have a leak. This lever alone is often the difference between a viable agent and an unaffordable one.
Lever 2 — model routing (don't pay Opus to rename a variable)
Not every call needs the frontier. Route by difficulty:
- Cheap/fast model for boilerplate, classification, simple edits, summarizing tool output.
- Frontier model only for the genuinely hard reasoning.
If 80% of your calls drop from a $15/Mtok model to a $1/Mtok one, your blended input price falls by more than half before any other lever. The benchmark is sorted by price for exactly this — pick the cheapest model that clears the task's bar, not the most capable one available.
Lever 3 — the Batch API (50% off, for free)
Anything not latency-sensitive — overnight evals, bulk classification, embeddings, report generation — runs on the Batch API at half price, same models, same quality. If a third of your volume is asynchronous, that's a flat 50% off a third of the bill for the cost of one config change.
Lever 4 — context compression (send fewer tokens in)
The cheapest token is the one you never send. Tool outputs, logs, RAG chunks, and file dumps are full of bytes the model doesn't need. A compression layer like Headroom sits between agent and model and reports 60–95% fewer input tokens on real workloads while preserving accuracy. Combine with narrow retrieval — don't fetch 20 chunks when 3 answer the question.
Lever 5 — output shaping (send fewer tokens out)
Output tokens cost 3–5× input. Two moves cut them hard:
- Terse prose. Caveman drops filler and reports ~65% fewer output tokens; a brevity instruction in your system prompt does a lighter version for free.
- Less code. Ponytail makes the agent write the laziest solution that works — measured at 47–77% less cost because it stops generating speculative abstractions you'd delete anyway.
The math, honestly
Levers multiply on the fractions that survive. A worked example for a cache-heavy coding agent:
| Lever | Cost surviving |
|---|---|
| Prompt caching (90% stable prefix) | ~0.20× input |
| Model routing (cheap model for 80% of calls) | ~0.45× blended |
| Context compression (Headroom, ~70% off input) | ~0.30× input |
| Output shaping (Caveman + Ponytail) | ~0.40× output |
You cannot just multiply all of those to "0.5% remaining" — they overlap (caching and compression both act on input) and not every call hits every lever. But even discounting heavily for overlap, stacking three or four of these routinely lands real bills at 1–5% of the naive cost — an order of magnitude, sometimes two. That's the "99%" — a compounded ceiling, not a single switch.
No single lever saves 99%. Five honest levers, multiplied, get you most of the way there — and every one of them is config or a thin proxy, not a model downgrade.
The build order (lazy first)
1. Turn on prompt caching and audit it. Biggest win, lowest effort. Do this before anything else.
2. Route the easy calls to a cheap model. One if statement's worth of logic.
3. Move async work to the Batch API. A flag.
4. Add compression and output shaping once a profiler — sorry, a bill — says the first three weren't enough.
Don't build the compression proxy before you've turned on caching. That's the ponytail rule: the laziest lever that moves the number is the one to pull first. The capstone post stacks all of these into one concrete setup.