The LLM coding benchmark
Frontier and open coding models on one screen: how big a window, how well they close real GitHub issues, and what they cost. Click any column to sort.
| Model | Weights | Context | SWE-bench Verified | Input /M | Output /M | Best for |
|---|---|---|---|---|---|---|
Claude Opus 4.8 Anthropic |
closed | 1M | 78% | $5 | $25 | Frontier agentic coding |
GLM-5 Zhipu AI |
open | 200K | 77.8% | $1 | $3 | Strong open coder, 77.8% SWE-Verified |
GPT-5.5 Codex OpenAI |
closed | 400K | 77% | $5 | $30 | Tuned for the Codex agent |
GPT-5.5 OpenAI |
closed | 400K | 76% | $5 | $30 | OpenAI flagship |
Claude Sonnet 4.6 Anthropic |
closed | 1M | 73% | $3 | $15 | Best capability/price for agents |
Qwen3.7-Max Alibaba |
closed | 1M | 72.5% | $2.5 | $7.5 | 1M-context agent model |
Grok 4.3 xAI |
closed | 256K | 71% | $1.25 | $2.5 | Cheap frontier, fast |
Gemini 3.1 Pro Google |
closed | 1M | 70% | $2 | $12 | 1M context, tiered pricing |
Kimi K2.6 Moonshot AI |
open | 256K | 68% | $0.95 | $4 | Open agentic coder, 1T MoE |
DeepSeek V4 Pro DeepSeek |
open | 128K | 68% | $0.44 | $0.87 | Open frontier, very cheap |
Qwen3-Coder 480B Alibaba |
open | 256K | 66% | self-host | self-host | Top open code specialist |
MiniMax M3 MiniMax |
open | 200K | 63% | $0.3 | $1.2 | Cheap, agentic, open |
DeepSeek V4 Flash DeepSeek |
open | 128K | 60% | $0.14 | $0.28 | Cheapest frontier-class |
Mistral Large 3 Mistral |
closed | 256K | 58% | $0.5 | $1.5 | Cheap European flagship |
Gemini 3.5 Flash Google |
closed | 1M | 56% | $0.3 | $2.5 | Cheap, huge context, fast |
Claude Haiku 4.5 Anthropic |
closed | 200K | 55% | $1 | $5 | Fast & cheap |
Llama 4 Maverick Meta |
open | 1M | 55% | self-host | self-host | Open, long context |
Codestral 25.x Mistral |
open | 256K | 51% | $0.3 | $0.9 | Lightweight code specialist |
GLM-5.2 Zhipu AI |
open | 200K | n/a | $1 | $3 | Released Jun 2026 — no benchmarks yet |
Methodology & caveats. Figures are indicative, compiled from public reports and vendor disclosures as of June 2026, and they move constantly — treat this as a shape-of-the-field snapshot, not a live leaderboard. SWE-bench Verified measures the share of real GitHub issues a model resolves with a passing patch; scores depend heavily on the agent scaffold around the model, so cross-vendor numbers are directional, not head-to-head. Prices are list API rates per million tokens and ignore caching, batching, and volume discounts — per-task cost depends on how many tokens your loop actually burns. Self-host models have no per-token price but a real hardware-and-ops cost (see running models locally). Claude Fable 5 / Mythos 5 are omitted: they were withdrawn globally on 2026-06-12 under a US national-security order and are currently unavailable (the fallout). n/a = the vendor published no benchmark (e.g. GLM-5.2 shipped without one). Always verify against primary sources before a buying decision: SWE-bench, Aider polyglot, LMArena, and each vendor's pricing page.