// benchmark · coding-llms

The LLM coding benchmark

Frontier and open coding models on one screen: how big a window, how well they close real GitHub issues, and what they cost. Click any column to sort.

open weights you can self-host closed API only SWE-bench Verified = real-issue solve rate

Model	Weights	Context	SWE-bench Verified	Input /M	Output /M	Best for
Claude Opus 4.8 Anthropic	closed	1M	78%	$5	$25	Frontier agentic coding
GLM-5 Zhipu AI	open	200K	77.8%	$1	$3	Strong open coder, 77.8% SWE-Verified
GPT-5.5 Codex OpenAI	closed	400K	77%	$5	$30	Tuned for the Codex agent
GPT-5.5 OpenAI	closed	400K	76%	$5	$30	OpenAI flagship
Claude Sonnet 4.6 Anthropic	closed	1M	73%	$3	$15	Best capability/price for agents
Qwen3.7-Max Alibaba	closed	1M	72.5%	$2.5	$7.5	1M-context agent model
Grok 4.3 xAI	closed	256K	71%	$1.25	$2.5	Cheap frontier, fast
Gemini 3.1 Pro Google	closed	1M	70%	$2	$12	1M context, tiered pricing
Kimi K2.6 Moonshot AI	open	256K	68%	$0.95	$4	Open agentic coder, 1T MoE
DeepSeek V4 Pro DeepSeek	open	128K	68%	$0.44	$0.87	Open frontier, very cheap
Qwen3-Coder 480B Alibaba	open	256K	66%	self-host	self-host	Top open code specialist
MiniMax M3 MiniMax	open	200K	63%	$0.3	$1.2	Cheap, agentic, open
DeepSeek V4 Flash DeepSeek	open	128K	60%	$0.14	$0.28	Cheapest frontier-class
Mistral Large 3 Mistral	closed	256K	58%	$0.5	$1.5	Cheap European flagship
Gemini 3.5 Flash Google	closed	1M	56%	$0.3	$2.5	Cheap, huge context, fast
Claude Haiku 4.5 Anthropic	closed	200K	55%	$1	$5	Fast & cheap
Llama 4 Maverick Meta	open	1M	55%	self-host	self-host	Open, long context
Codestral 25.x Mistral	open	256K	51%	$0.3	$0.9	Lightweight code specialist
GLM-5.2 Zhipu AI	open	200K	n/a	$1	$3	Released Jun 2026 — no benchmarks yet

Methodology & caveats. Figures are indicative, compiled from public reports and vendor disclosures as of June 2026, and they move constantly — treat this as a shape-of-the-field snapshot, not a live leaderboard. SWE-bench Verified measures the share of real GitHub issues a model resolves with a passing patch; scores depend heavily on the agent scaffold around the model, so cross-vendor numbers are directional, not head-to-head. Prices are list API rates per million tokens and ignore caching, batching, and volume discounts — per-task cost depends on how many tokens your loop actually burns. Self-host models have no per-token price but a real hardware-and-ops cost (see running models locally). Claude Fable 5 / Mythos 5 are omitted: they were withdrawn globally on 2026-06-12 under a US national-security order and are currently unavailable (the fallout). n/a = the vendor published no benchmark (e.g. GLM-5.2 shipped without one). Always verify against primary sources before a buying decision: SWE-bench, Aider polyglot, LMArena, and each vendor's pricing page.