Apple Silicon, MLX, and Core ML for on-device LLMs
Apple Silicon turned out to be accidentally great at LLM inference, and Apple has since leaned in hard. If you're running models on a Mac — or shipping AI features in a Mac or iOS app — you'll meet two frameworks: MLX and Core ML. They solve different problems and people constantly pick the wrong one. Here's the map.
Why the Mac is good at this
One reason: unified memory. On Apple Silicon, CPU and GPU (and the Neural Engine) share a single high-bandwidth memory pool. There's no "copy weights from system RAM to VRAM" step and no separate, smaller VRAM budget. A 64 GB Mac can load a model that needs ~40 GB of weights; the equivalent on a discrete GPU means a card with 48 GB of VRAM. That single architectural choice is most of why a laptop can run a 70B model at all (the hardware guide has the sizing math).
MLX — for running and training models now
MLX is Apple's open-source array framework — think NumPy/PyTorch shaped for Apple Silicon, with lazy evaluation and unified-memory awareness baked in. It's the right tool when your goal is inference or fine-tuning of LLMs on a Mac, today.
mlx-lmruns and serves text models in a couple of lines, with an OpenAI-compatible server so your existing tooling just points atlocalhost.- It loads the community's quantized weights directly (the
mlx-communityrepos on Hugging Face), and does 4-bit/8-bit quantization itself. - It can fine-tune (LoRA/QLoRA) on-device — a genuinely useful thing a single Mac can do overnight.
# run a quantized coder, OpenAI-compatible, in one line
pip install mlx-lm
mlx_lm.server --model mlx-community/Qwen3-Coder-30B-4bit
# now any OpenAI client hits http://localhost:8080/v1
For most readers of this blog, MLX (often via Ollama or LM Studio, which use MLX/llama.cpp under the hood) is the answer for local coding models. It tracks new architectures fast and squeezes the hardware well.
MLX is for you running a model on your Mac. Core ML is for shipping a model inside an app to someone else's device.
Core ML — for shipping models in apps
Core ML is the framework for embedding ML in a shipping macOS/iOS app. Its whole reason to exist is one thing MLX doesn't prioritize: the Apple Neural Engine (ANE) — a dedicated low-power accelerator that Core ML can schedule work onto, alongside GPU and CPU.
Reach for Core ML when:
- You're shipping an AI feature in an app and want it to run on every user's device, efficiently, offline.
- Battery and thermals matter — the ANE does inference at a fraction of the power of the GPU, which is the difference between a feature and a battery complaint on an iPhone.
- You want OS integration: background execution, on-device privacy guarantees, App Store distribution without bundling a runtime.
The cost is friction: models must be converted to the Core ML format (coremltools), and not every operation or huge LLM converts cleanly. Apple ships first-party converted models and the Foundation Models framework for the system LLM, which sidesteps conversion for many app use cases entirely.
Choosing, in one table
| You want to… | Use |
|---|---|
| Run a 30B–70B coder on your Mac for dev | MLX (often via Ollama/LM Studio) |
| Fine-tune a model on-device | MLX (LoRA/QLoRA) |
| Ship an on-device AI feature in a Mac/iOS app | Core ML (or Foundation Models) |
| Squeeze battery/thermals on iPhone via the ANE | Core ML |
| Use Apple's built-in system LLM in your app | Foundation Models framework |
The practical workflow
For development and personal use: install Ollama or LM Studio (both lean on MLX/llama.cpp), pull a quantized model, point your editor or agent at the local OpenAI-compatible endpoint. You get private, zero-marginal-cost inference with almost no setup. When you need control — a specific quantization, a custom fine-tune — drop to mlx-lm directly.
For shipping apps: prototype with MLX to validate the model, then convert to Core ML (or use Foundation Models) for the production on-device path, so you get the ANE's efficiency and the OS's privacy and distribution story.
The headline: between unified memory and these two frameworks, Apple has quietly built the most frictionless on-device LLM stack for individual developers. The frontier still lives in the cloud — but a surprising amount of real work now fits on the laptop you already carry.