Running capable code models locally: Ollama, llama.cpp, vLLM
There are two honest reasons to run a model locally: the code legally can't leave your network, or you want zero marginal cost and full control of the loop. "It's cheaper than the API" is usually not one of them once you price in the hardware and your time — so be clear which reason is yours before you buy a GPU.
What's actually runnable, by tier
The open-weight coding models in 2026 are genuinely good — not frontier, but good enough that the gap is a trade, not a cliff.
- Laptop (16–32 GB RAM, no discrete GPU or an Apple Silicon unified-memory box). Quantized 7B–14B coding models run usably. Think capable autocomplete and single-file help, not repo-scale agents. Apple Silicon punches above its weight here because the memory is unified.
- Workstation (one 24–48 GB GPU). 30B–70B class at 4-bit quantization. This is the sweet spot: a 70B coder at int4 is a real assistant, and it's all yours.
- Server (multi-GPU). The big open coders — 200B–480B mixture-of-experts models — at speed, for a team. This is where local stops being a compromise and starts being a deployment.
The three runtimes, and when each is right
You don't need to evaluate ten tools. There are three that matter:
- Ollama — the lazy default.
ollama run qwen3-coderand you're talking to a model in one line. Great DX, model library, an OpenAI-compatible endpoint so your existing tooling just works. Start here. Don't build anything else until it hurts. - llama.cpp — the engine under much of the ecosystem (Ollama included). Drop to it when you need control: specific quantization, CPU/Metal/CUDA tuning, embedding in your own app. It's the screwdriver behind the appliance.
- vLLM — the serving tier. When you need throughput for many users — continuous batching, paged attention, high concurrency — vLLM is the answer. It's a server, not a desktop tool; reach for it when "one user at a time" stops being enough.
Ollama for one developer, vLLM for a team, llama.cpp when you need to open the hood. Picking a heavier tool than your tier needs is the most common local-LLM mistake.
Quantization: the lever that makes it fit
Quantization is what turns "needs a data-center GPU" into "runs on the thing under my desk." It stores weights at lower precision — int8, int4, and below — shrinking memory and speeding inference.
# rough memory for a 70B model
fp16 ~140 GB -> data-center territory
int8 ~70 GB -> multi-GPU
int4 ~35 GB -> one big GPU, and it's fine
The honest part: you trade accuracy for the fit. The pleasant surprise is how little you trade down to about 4-bit — int4 is the widely-agreed sweet spot where the quality loss is small and the footprint collapse is huge. Below 4-bit the degradation gets real, fast. The GGUF format is the lingua franca for these quantized weights; it's what Ollama and llama.cpp consume.
If you remember one rule: int4 is the default, fp16 is for benchmarking, sub-4-bit is for when you're desperate.
The realistic verdict
- For privacy / air-gapped work, local is not a compromise — it's the only option, and 2026's open coders make it a workable one. A 70B int4 on a workstation is a real teammate for most tasks.
- For raw capability, the frontier hosted models (the benchmark shows the gap) are still ahead, and will be. If the task is hard and the code can leave, the API wins on quality-per-effort.
- For cost, do the math honestly. Heavy daily use across a team amortizes hardware fast; occasional use never pays back the GPU. The cloud's per-token meter is brutal at scale and a bargain at low volume — local inverts both.
The lazy recommendation: start with Ollama and a quantized Qwen-class coder before you spend a cent on hardware. Run it on what you already own, see whether the quality clears your bar, and only then decide if local is your story. Most people discover they want a hosted frontier model for the hard 20% and a local model for the private or high-volume 80% — and that hybrid is a perfectly good answer. The landscape post and the benchmark are where to calibrate that split.