latest / apple
← all posts
// apple · apple

Apple Silicon, MLX, and Core ML for on-device LLMs

Apple Silicon turned out to be accidentally great at LLM inference, and Apple has since leaned in hard. If you're running models on a Mac — or shipping AI features in a Mac or iOS app — you'll meet two frameworks: MLX and Core ML. They solve different problems and people constantly pick the wrong one. Here's the map.

Why the Mac is good at this

One reason: unified memory. On Apple Silicon, CPU and GPU (and the Neural Engine) share a single high-bandwidth memory pool. There's no "copy weights from system RAM to VRAM" step and no separate, smaller VRAM budget. A 64 GB Mac can load a model that needs ~40 GB of weights; the equivalent on a discrete GPU means a card with 48 GB of VRAM. That single architectural choice is most of why a laptop can run a 70B model at all (the hardware guide has the sizing math).

MLX — for running and training models now

MLX is Apple's open-source array framework — think NumPy/PyTorch shaped for Apple Silicon, with lazy evaluation and unified-memory awareness baked in. It's the right tool when your goal is inference or fine-tuning of LLMs on a Mac, today.

  • mlx-lm runs and serves text models in a couple of lines, with an OpenAI-compatible server so your existing tooling just points at localhost.
  • It loads the community's quantized weights directly (the mlx-community repos on Hugging Face), and does 4-bit/8-bit quantization itself.
  • It can fine-tune (LoRA/QLoRA) on-device — a genuinely useful thing a single Mac can do overnight.
# run a quantized coder, OpenAI-compatible, in one line
pip install mlx-lm
mlx_lm.server --model mlx-community/Qwen3-Coder-30B-4bit
# now any OpenAI client hits http://localhost:8080/v1

For most readers of this blog, MLX (often via Ollama or LM Studio, which use MLX/llama.cpp under the hood) is the answer for local coding models. It tracks new architectures fast and squeezes the hardware well.

MLX is for you running a model on your Mac. Core ML is for shipping a model inside an app to someone else's device.

Core ML — for shipping models in apps

Core ML is the framework for embedding ML in a shipping macOS/iOS app. Its whole reason to exist is one thing MLX doesn't prioritize: the Apple Neural Engine (ANE) — a dedicated low-power accelerator that Core ML can schedule work onto, alongside GPU and CPU.

Reach for Core ML when:

  • You're shipping an AI feature in an app and want it to run on every user's device, efficiently, offline.
  • Battery and thermals matter — the ANE does inference at a fraction of the power of the GPU, which is the difference between a feature and a battery complaint on an iPhone.
  • You want OS integration: background execution, on-device privacy guarantees, App Store distribution without bundling a runtime.

The cost is friction: models must be converted to the Core ML format (coremltools), and not every operation or huge LLM converts cleanly. Apple ships first-party converted models and the Foundation Models framework for the system LLM, which sidesteps conversion for many app use cases entirely.

Choosing, in one table

You want to…Use
Run a 30B–70B coder on your Mac for devMLX (often via Ollama/LM Studio)
Fine-tune a model on-deviceMLX (LoRA/QLoRA)
Ship an on-device AI feature in a Mac/iOS appCore ML (or Foundation Models)
Squeeze battery/thermals on iPhone via the ANECore ML
Use Apple's built-in system LLM in your appFoundation Models framework

The practical workflow

For development and personal use: install Ollama or LM Studio (both lean on MLX/llama.cpp), pull a quantized model, point your editor or agent at the local OpenAI-compatible endpoint. You get private, zero-marginal-cost inference with almost no setup. When you need control — a specific quantization, a custom fine-tune — drop to mlx-lm directly.

For shipping apps: prototype with MLX to validate the model, then convert to Core ML (or use Foundation Models) for the production on-device path, so you get the ANE's efficiency and the OS's privacy and distribution story.

The headline: between unified memory and these two frameworks, Apple has quietly built the most frictionless on-device LLM stack for individual developers. The frontier still lives in the cloud — but a surprising amount of real work now fits on the laptop you already carry.

#apple#mlx#coreml