latest / savings
← all posts
// savings · savings

Stacking it all: ultra token savings at the same quality

Each post in this series pulls one lever. This is the one where we stack them. The thesis: token savings compound, and the quality cost is near zero — because every lever here either removes tokens that carried no information, or swaps a model down on tasks that didn't need the frontier. Done right, the output the user sees is the same; only the bill changes.

The full stack

Five layers, input to output, each covered in its own post:

your code
   │
   ├─ 1. model routing        cheap model unless the task is genuinely hard
   ├─ 2. prompt caching       0.1× on the stable prefix you resend every turn
   ├─ 3. narrow retrieval     fetch 3 relevant chunks, not 20 plausible ones
   ├─ 4. context compression  Headroom: 60–95% fewer input tokens
   │                            ▼ (the model)
   ├─ 5a. terse prose          Caveman: ~65% fewer output tokens
   └─ 5b. lazy code            Ponytail: 80–94% less code, 47–77% cheaper

Layers 1–4 shrink what goes in; 5a/5b shrink what comes out. They're orthogonal — each acts on a different part of the request — which is exactly why they multiply instead of overlapping into nothing.

A concrete setup

Here's the whole thing wired for a coding agent, lazy-first:

# 4 — context compression in front of the agent (zero code change)
headroom proxy --port 8787
export OPENAI_BASE_URL=http://localhost:8787

# 5a + 5b — output shaping, on every session
/caveman full       # terse prose
/ponytail full      # lazy code
# 1 + 2 — routing + caching in your own calls
model = "haiku" if is_simple(task) else "opus"      # route by difficulty
messages = [
    {"role": "system", "content": STABLE_PROMPT,    # cached prefix — never edit mid-session
     "cache_control": {"type": "ephemeral"}},
    *history,                                        # volatile content AFTER the breakpoint
]
# 3 — retrieve narrow: top-3 reranked chunks, not top-20 (see /p/rag-that-retrieves)

That's it. A proxy, two skill commands, and a routing if. No model downgrade on the hard tasks, no quality knob touched.

The honest math

You cannot multiply every fraction and claim 99.9% — the levers overlap (caching and compression both act on input) and not every call hits every layer. But here's a defensible accounting for a cache-heavy agent loop:

LayerActs onSurviving cost
Prompt cachinginput prefix~0.20×
Context compressionuncached input~0.30×
Model routingblended across calls~0.45×
Output shaping (prose + code)output~0.35×

Input side: caching and compression together land uncached-input cost well under 0.1× of naive. Output side: ~0.35×. Blend in routing, and a typical agent bill settles around 2–5% of where it started — a 20–50× reduction. Not the literal 99.9% the naive product suggests, but unmistakably an order of magnitude, often two. That's the number the cost-architecture post promised, delivered.

The reason quality survives: not one of these levers makes the model dumber. They remove filler tokens and route easy work to cheap models. The frontier still does the hard thinking — it just stops reading and writing things nobody needed.

Why quality holds (the part that matters)

Each lever is quality-neutral by construction:

  • Caching is byte-identical content at a lower price. Zero effect on output.
  • Routing uses the frontier exactly where it's needed; the cheap model only handles work it was always capable of. Calibrate with the benchmark.
  • Compression is reversible — the model pulls full detail back on demand, and accuracy holds on standard evals.
  • Terse prose often improves answers (brevity reverses performance hierarchies) — just keep it on the working channel, not customer-facing text.
  • Lazy code is a better default — less to maintain, fewer 3am incidents — with security and validation explicitly protected.

The build order

Don't install everything on day one. Pull levers in order of payoff-per-effort:

1. Turn on prompt caching, audit the cache-read counter. Biggest win, lowest effort.

2. Route easy calls to a cheap model. One if.

3. Add Ponytail and Caveman. Two commands, output cost drops immediately, code quality improves.

4. Add Headroom and tighten retrieval when a bill — not a hunch — says 1–3 weren't enough.

That ordering is itself the ponytail rule: pull the laziest lever that moves the number first, and stop the moment the bill stops hurting. You rarely need all five. But when you do, they're sitting right here, and they stack.

Questions, corrections, or your own numbers? ai@jakubjirak.com.

#savings#cost#tooling