// cost · cost

Local-first, last-mile-paid: the model cascade that runs mostly free

$ Jakub Jirák · May 25 · 11 min

Here's a claim worth taking seriously: most steps of most agentic tasks don't need a frontier model. Drafting, filtering, summarizing tool output, classifying, simple edits — a capable local model handles all of it at zero marginal cost. If you architect for that, you spend paid tokens only on the genuinely hard last mile. This is the model cascade, and it's the most underused cost architecture in the field.

The ladder

Arrange your models cheapest-to-dearest and do each piece of work at the lowest tier that can do it. Prices from the benchmark:

# the cascade — work flows up only when a tier can't clear the bar
LOCAL   (free)        drafting, filtering, summarizing tool output,
                      classification, retrieval pre-filter, simple edits
  │  escalate on: verification fails / low confidence / known-hard
  ▼
HAIKU   ($1 / $5)     cheap-paid: structured tasks the local model fumbles,
                      quick reasoning, cleanup of a local draft
  ▼
SONNET  ($3 / $15)    the workhorse: real reasoning, agentic coding,
                      multi-file changes
  ▼
OPUS    ($5 / $25)    the last mile only: the subtle bug, the gnarly refactor,
  / FABLE             the thing every cheaper tier already failed

The shape that matters: a wide free base, a narrow expensive tip. If local clears 70–80% of steps and the paid middle handles most of the rest, Opus touches a few percent of the work — and your blended cost looks nothing like "Opus for everything."

Don't pay Opus prices to rename a variable, summarize a log, or draft boilerplate. Pay frontier prices for frontier-hard work, and run everything else for free on hardware you already own.

The escalation triggers — how a tier decides to give up

The cascade lives or dies on when to step up. Four signals, best first:

Verification-gated (the gold standard). The local model produces; a deterministic check runs — tests pass, the JSON validates, the linter is clean, a script exits zero. Pass → done, for free. Fail → escalate one tier. When you have an oracle, this is the cleanest, cheapest router there is: you only pay for capability you provably needed.
Confidence / self-consistency. No hard oracle? Sample the local model a few times; if the answers disagree or it hedges, that's your signal to escalate. Agreement is cheap evidence; disagreement buys a frontier call.
Difficulty classification. Some tasks are obviously hard — route them straight to the right tier instead of wasting a doomed local round-trip. A tiny classifier (or a heuristic on task type) earns its keep.
Retry-ladder. The simplest fallback: fail at tier N, retry at tier N+1, repeat. Crude but effective when paired with a verifier.

What it looks like in code

# verification-gated cascade: cheapest tier that passes wins
tiers = [local, haiku, sonnet, opus]
for model in tiers:
    result = model.run(task)
    if verify(result):          # tests / schema / linter / property check
        return result           # paid for exactly the capability needed
# nothing passed -> surface to a human, don't keep burning frontier tokens

The verifier is the whole architecture. With a good one, escalation is automatic and you never over-pay; without one, you're guessing, and the cascade degrades to "hope the cheap model was right."

The economics, and the honest catch

Economics: a base that's free plus a tip that's rare is the steepest cost curve in this blog. Combined with the other levers — caching, batch, compression — it's how a real bill lands at a few percent of naive (the full stack).

The catch — name it, don't hide it:

A failed cheap attempt is wasted work. If local fails often and you re-do at Sonnet, you paid the latency of both. The cascade only wins when the cheap tiers succeed often enough — measure your escalation rate and tune the tiers to it.
Local latency. Prompt-processing on Apple Silicon or a local GPU is slower to ingest a big prompt than a frontier API. For latency-critical interactive steps, the free tier can be the wrong default.
Operational weight. Keeping a local model warm and a verifier honest is real work. Worth it at volume; overkill for occasional use.

Where it shines, where it doesn't

Shines: high-volume, verifiable, repetitive agentic work — exactly the workloads where a flat paid bill hurts most. It's also the continuity hedge (local can't be cut off) and the answer to subscription pricing that breaks under agentic load.
Doesn't: a one-off genuinely hard task. Don't build a four-tier cascade to fix one subtle bug — just call Opus. The cascade is infrastructure; it pays back on repetition, not on a single shot.

The lazy version

You do not start with four tiers and a trained classifier:

1. Two tiers. Local model drafts/filters the easy stuff; frontier does the rest. One if and a verifier. This alone captures most of the win.

2. Add a paid middle tier (Haiku or Sonnet) only when you measure local failing often enough that re-doing at Opus is expensive.

3. Add confidence/difficulty routing only when verification alone leaves money on the table.

That ordering is the ponytail rule again: the simplest cascade that moves the bill, then stop. Most teams discover two tiers — free local for the 80%, frontier for the hard 20% — is the whole answer, and the elaborate ladder was a fantasy. Build the two-tier version, measure, and let the bill tell you whether the middle tiers are worth it.

#cost#local#routing