latest / hardware
← all posts
// hardware · hardware

What hardware actually runs these models — decently

Everyone asks "what GPU do I need to run an LLM?" It's the wrong first question. The right one is: how many gigabytes does the model weigh after quantization, and does that fit in fast memory? Get that number and the hardware answer falls out.

The one rule: it has to fit in memory

A model runs well only when its weights sit in memory the compute can reach at high bandwidth — GPU VRAM, or on a Mac, unified memory. Spill past that and you're swapping to system RAM or disk, and tokens-per-second falls off a cliff. So sizing is arithmetic:

# bytes of memory ≈ params × bytes-per-weight (+ ~20% for KV cache & overhead)
7B  @ int4  ≈  4–5 GB     -> a laptop
32B @ int4  ≈  18–22 GB   -> one 24GB GPU / 32GB Mac
70B @ int4  ≈  38–44 GB   -> one 48GB GPU / 64GB Mac
235B MoE @ int4 ≈ 120 GB+ -> workstation / multi-GPU / 128–192GB Mac

Quantization is the lever that makes these fit — int4 is the sweet spot (see the quantization section). Everything below is just "what hardware gives you that many GB of fast memory, cheapest."

The tiers

Laptop, no discrete GPU (16–32 GB). 7B–14B coding models at int4, usefully. Good for autocomplete and single-file help, not repo-scale agents. On Windows/Linux without a GPU you're on CPU — slow but functional for small models.

Apple Silicon (M-series, 16–192 GB unified). The quiet winner for local LLMs, and why this blog leans Apple. Because memory is unified — CPU and GPU share one fast pool — a 64 GB MacBook can hold a 70B int4 model that would need a $5k+ GPU on the PC side. No 24 GB VRAM ceiling, no PCIe shuffling. Bandwidth (especially on Max/Ultra chips) is high enough that tokens-per-second is genuinely usable. A 128 GB Studio runs models most single GPUs can't load at all. The deep dive on the Apple stack — MLX and Core ML — is its own post.

One NVIDIA GPU (16–48 GB VRAM). The throughput king per dollar when it fits. A 24 GB card (RTX 4090-class) runs 32B int4 fast; 48 GB runs 70B int4. CUDA + vLLM gives the best concurrency for serving a team. The catch is the hard VRAM wall — exceed it and you fall back to system RAM and lose the speed that justified the card.

Multi-GPU / server. For the big mixture-of-experts open models (200B–480B) at speed, or many concurrent users. This is a deployment, not a desk toy — and the point where "local" stops being a compromise.

Pick the tier by the model you actually want to run, then buy the cheapest way to get those gigabytes of fast memory. For most individuals in 2026 that sentence ends in "a Mac."

Two numbers that decide speed

Once it fits, two specs predict the experience:

  • Memory bandwidth (GB/s) — token generation is memory-bound; this sets your tokens-per-second more than raw FLOPs. It's why an Ultra-class Mac feels faster than its TFLOPs suggest, and why a bandwidth-starved config disappoints even when the model "fits."
  • Prompt-processing compute — matters for time-to-first-token on long prompts (big RAG context, whole files). Apple is relatively weaker here than a beefy NVIDIA card; a 100k-token prompt takes longer to ingest on a Mac than to generate from.

The honest buying advice

  • Already on a Mac with 32 GB+? You can run real models today. Install Ollama, pull a quantized coder, spend nothing.
  • Buying for local LLMs specifically? Apple Silicon with as much unified memory as you can afford is the least-fuss path for a single developer; a 48 GB NVIDIA card is the move if you need raw serving throughput or CUDA-only tooling.
  • Don't buy for the model you might run. Buy for the one you'll run weekly. The 70B that needs a second GPU is usually a worse daily driver than the 32B that's instant. YAGNI applies to hardware too.

The frontier still lives in the cloud (benchmark). But "decent local coding" is now a solved problem on hardware a lot of developers already own — and increasingly, that hardware has an Apple logo.

#hardware#apple#local