// benchmark · coding-llms

The LLM coding benchmark

Frontier and open coding models on one screen: how big a window, how well they close real GitHub issues, and what they cost. Click any column to sort.

open weights you can self-host closed API only SWE-bench Verified = real-issue solve rate
ModelWeightsContextSWE-bench VerifiedInput /MOutput /MBest for
Claude Opus 4.8
Anthropic
closed 1M 78% $5 $25 Frontier agentic coding
GLM-5
Zhipu AI
open 200K 77.8% $1 $3 Strong open coder, 77.8% SWE-Verified
GPT-5.5 Codex
OpenAI
closed 400K 77% $5 $30 Tuned for the Codex agent
GPT-5.5
OpenAI
closed 400K 76% $5 $30 OpenAI flagship
Claude Sonnet 4.6
Anthropic
closed 1M 73% $3 $15 Best capability/price for agents
Qwen3.7-Max
Alibaba
closed 1M 72.5% $2.5 $7.5 1M-context agent model
Grok 4.3
xAI
closed 256K 71% $1.25 $2.5 Cheap frontier, fast
Gemini 3.1 Pro
Google
closed 1M 70% $2 $12 1M context, tiered pricing
Kimi K2.6
Moonshot AI
open 256K 68% $0.95 $4 Open agentic coder, 1T MoE
DeepSeek V4 Pro
DeepSeek
open 128K 68% $0.44 $0.87 Open frontier, very cheap
Qwen3-Coder 480B
Alibaba
open 256K 66% self-host self-host Top open code specialist
MiniMax M3
MiniMax
open 200K 63% $0.3 $1.2 Cheap, agentic, open
DeepSeek V4 Flash
DeepSeek
open 128K 60% $0.14 $0.28 Cheapest frontier-class
Mistral Large 3
Mistral
closed 256K 58% $0.5 $1.5 Cheap European flagship
Gemini 3.5 Flash
Google
closed 1M 56% $0.3 $2.5 Cheap, huge context, fast
Claude Haiku 4.5
Anthropic
closed 200K 55% $1 $5 Fast & cheap
Llama 4 Maverick
Meta
open 1M 55% self-host self-host Open, long context
Codestral 25.x
Mistral
open 256K 51% $0.3 $0.9 Lightweight code specialist
GLM-5.2
Zhipu AI
open 200K n/a $1 $3 Released Jun 2026 — no benchmarks yet

Methodology & caveats. Figures are indicative, compiled from public reports and vendor disclosures as of June 2026, and they move constantly — treat this as a shape-of-the-field snapshot, not a live leaderboard. SWE-bench Verified measures the share of real GitHub issues a model resolves with a passing patch; scores depend heavily on the agent scaffold around the model, so cross-vendor numbers are directional, not head-to-head. Prices are list API rates per million tokens and ignore caching, batching, and volume discounts — per-task cost depends on how many tokens your loop actually burns. Self-host models have no per-token price but a real hardware-and-ops cost (see running models locally). Claude Fable 5 / Mythos 5 are omitted: they were withdrawn globally on 2026-06-12 under a US national-security order and are currently unavailable (the fallout). n/a = the vendor published no benchmark (e.g. GLM-5.2 shipped without one). Always verify against primary sources before a buying decision: SWE-bench, Aider polyglot, LMArena, and each vendor's pricing page.