New MCP Server:Get LLM requirements, benchmarks, and more!
Updated: July 20, 2025
This leaderboard aggregates performance data on various coding tasks from several major coding benchmarks: Livebench, Aider, ProLLM Acceptance, WebDev Arena, and CanAiCode. Models are ranked using Z-score normalization, which standardizes scores across different benchmarks with varying scales. The final ranking represents a balanced view of each model's overall coding capabilities, with higher Z-scores indicating better performance relative to other models.
Rank | Model Name | Competitive Coding | AI-Assisted Code | Code Acceptance | Web Development | Coding Interview | Std. Dev. | Z-Score Avg |
---|---|---|---|---|---|---|---|---|
1🥇 | Gemini 2.5 Pro | 73.90 | 0.79 ⭐️ | - | 1423.33 🥇 | - | 0.43 | 1.35 |
2🥈 | ChatGPT o3 Pro | 76.78 ⭐️ | 0.85 🥇 | - | - | - | 0.34 | 1.31 |
3🥉 | ChatGPT o3 High | 76.71 ⭐️ | 0.81 🥈 | - | - | - | 0.27 | 1.23 |
4 | Claude 4 Opus | 73.58 | 0.72 ⭐️ | - | 1403.61 🥉 | 147 | 0.40 | 1.07 |
5 | DeepSeek R1 | 76.07 | 0.71 | 0.96 | 1407.45 🥈 | 148 🥉 | 0.38 | 1.04 |
6 | Claude Sonnet 4 | 78.25 🥉 | 0.61 | 0.98 ⭐️ | 1378.38 ⭐️ | - | 0.33 | 1.04 |
7 | Claude 3.7 Sonnet | 74.28 | 0.65 | 0.97 | 1357.1 ⭐️ | - | 0.27 | 0.95 |
8 | O1 Preview | - | - | 0.98 🥉 | - | - | - | 0.90 |
9 | ChatGPT o3 | 76.71 ⭐️ | 0.77 ⭐️ | - | 1188.66 | 146 | 0.37 | 0.80 |
10 | Grok 4 | 71.34 | 0.80 🥉 | - | 1163.08 | - | 0.52 | 0.75 |
11 | Claude 3 Opus | 73.58 | - | - | - | 147 | 0.05 | 0.73 |
12 | abab7 | - | - | 0.96 | - | - | - | 0.69 |
13 | ChatGPT 4.1 | 73.19 | 0.52 | 0.99 🥇 | 1254.73 | 147 | 0.22 | 0.69 |
14 | Claude 3.5 Sonnet | 73.90 | 0.64 | 0.94 | 1237.7 | 146 | 0.12 | 0.66 |
15 | Kimi K2 | 71.78 | 0.59 | - | - | - | 0.05 | 0.62 |
16 | ChatGPT 4.5 Preview | 76.07 | 0.45 | 0.97 | - | - | 0.43 | 0.58 |
17 | Llama 3.1 Nemotron 70B | - | - | 0.95 | - | - | - | 0.56 |
18 | ChatGPT o3 Mini | 77.86 ⭐️ | 0.60 | 0.97 ⭐️ | 1091.69 | - | 0.50 | 0.56 |
19 | ChatGPT o4 Mini | 79.98 🥇 | 0.72 ⭐️ | 0.89 | 1101.36 | 147 | 0.57 | 0.55 |
20 | GPT-4 Turbo | - | - | 0.94 | - | - | - | 0.54 |
21 | Qwen 3 235B | 66.41 | 0.60 | - | 1181.7 | 148 🥈 | 0.20 | 0.50 |
22 | DeepSeek V3.1 | 68.91 | - | - | - | - | - | 0.49 |
23 | DeepSeek V3 | 68.91 | 0.55 | 0.98 ⭐️ | 1206.69 | 142 | 0.21 | 0.48 |
24 | ChatGPT o1 | - | 0.62 | 0.98 ⭐️ | 1044.88 | 148 🥇 | 0.58 | 0.43 |
25 | ChatGPT 4.1 Mini | 72.11 | 0.32 | 0.98 🥈 | 1187.59 | 147 | 0.52 | 0.42 |
26 | Gemini 2.5 Flash | 63.53 | 0.55 | 0.89 | 1298.81 | - | 0.39 | 0.41 |
27 | MiniMax-Text-01 | - | - | 0.93 | - | - | - | 0.39 |
28 | Quasar Alpha | - | 0.55 | - | - | - | - | 0.38 |
29 | Grok 3 Beta | 73.58 | 0.53 | 0.90 | 1142.64 | - | 0.28 | 0.33 |
30 | Optimus Alpha | - | 0.53 | - | - | - | - | 0.31 |
31 | ChatGPT 4o Mini | 79.98 🥈 | 0.72 | 0.89 | 1101.36 | 134 | 0.71 | 0.30 |
32 | Qwen 3 32B | 64.24 | 0.40 | 0.90 | - | 148 ⭐️ | 0.37 | 0.20 |
33 | Mistral Medium 3 | 61.48 | - | - | 1175.57 | - | 0.12 | 0.14 |
34 | Phi 4 | 60.59 | - | - | - | 143 | 0.17 | 0.14 |
35 | ChatGPT o1 Mini | - | 0.33 | 0.97 | 1041.83 | 148 ⭐️ | 0.68 | 0.12 |
36 | ChatGPT 4o | 77.48 ⭐️ | 0.27 | 0.96 | 964 | 146 | 0.85 | 0.09 |
37 | Mistral Large | 62.89 | - | 0.88 | - | 142 | 0.13 | 0.07 |
38 | Qwen 2.5 72B 🏆 MID AWARD | - | - | 0.89 | - | - | - | 0.04 |
39 | Gemini 2.5 Flash Lite | 59.25 | - | - | - | - | - | -0.11 |
40 | Grok 3 Mini Beta | 54.52 | 0.49 | - | - | - | 0.28 | -0.12 |
41 | ChatGPT 4.1 Nano | 63.92 | 0.09 | 0.95 | - | 141 | 0.82 | -0.16 |
42 | QwQ 32B Preview | - | - | 0.86 | - | - | - | -0.27 |
43 | Gemini 2.0 Pro | - | 0.36 | - | 1088.84 | - | 0.06 | -0.36 |
44 | Llama 3.3 70B | - | - | 0.85 | - | - | - | -0.38 |
45 | Llama 4 Maverick | 54.19 | 0.16 | 0.92 | 1026.87 | 140 | 0.56 | -0.40 |
46 | QwQ 32B | - | 0.26 | 0.87 | - | - | 0.31 | -0.50 |
47 | Gemini 1.5 Pro | - | - | 0.94 | 892.48 | - | 1.04 | -0.50 |
48 | Grok 2 | - | - | 0.83 | - | - | - | -0.50 |
49 | Llama 4 Scout 17B | 54.19 | - | 0.85 | 900.17 | 142 | 0.62 | -0.51 |
50 | Qwen 2.5 Coder 32B | 57.26 | 0.16 | 0.90 | 902.26 | 142 | 0.69 | -0.53 |
51 | Qwen Max | 66.79 | 0.22 | - | 974.97 | - | 0.64 | -0.55 |
52 | Mistral Small | 49.65 | - | 0.84 | - | - | 0.10 | -0.60 |
53 | Gemini 2.0 Flash | 59.31 | 0.22 | 0.94 | 1039.88 | 120 | 0.81 | -0.61 |
54 | Hunyuan Turbos | 50.35 | - | - | - | - | - | -0.66 |
55 | Jamba 1.5 Large | - | - | 0.82 | - | - | - | -0.68 |
56 | Gemma 3 27B | 48.94 | 0.05 | 0.91 | - | 133 | 0.67 | -0.73 |
57 | Qwen 3 30B | 47.47 | - | - | - | - | - | -0.84 |
58 | DeepSeek R1 Distill Qwen 32B | 47.03 | - | - | - | - | - | -0.87 |
59 | Claude 3.5 Haiku | 53.17 | 0.28 | 0.88 | 1132.99 | 107 | 1.13 | -0.89 |
60 | DeepSeek R1 Distill Llama 70B | 46.65 | - | - | - | - | - | -0.89 |
61 | Gemini 2.0 Flash Thinking | - | 0.18 | - | 1029.57 | - | 0.24 | -0.91 |
62 | Command A | 54.26 | 0.12 | - | - | - | 0.50 | -0.91 |
63 | Nova Pro | - | - | 0.77 | - | - | - | -1.12 |
64 | Codestral | - | 0.11 | - | - | 131 | 0.31 | -1.14 |
65 | Gemma 3 12B | 42.16 | - | - | - | - | - | -1.17 |
66 | Llama 3.1 8B | - | - | - | - | 126 | - | -1.30 |
67 | Yi Lightning | - | 0.13 | - | - | - | - | -1.37 |
68 | Mistral Nemo | 62.89 | - | 0.48 | 1167 | 118 | 1.72 | -1.42 |
69 | Llama 3.1 405B | - | - | 0.80 | 809.67 | - | 0.61 | -1.46 |
70 | OpenHands LM 32B | - | 0.10 | - | - | - | - | -1.48 |
71 | Gemma 2 27B | - | - | 0.72 | - | - | - | -1.57 |
72 | Gemma 2 9B | - | - | 0.71 | - | - | - | -1.66 |
73 | Gemma 3n E4B | 31.48 | - | - | - | - | - | -1.83 |
74 | Command R Plus | 27.13 | - | - | - | - | - | -2.10 |
75 | Command R | 26.10 | - | - | - | - | - | -2.16 |
76 | Gemma 3n E2B | 16.44 | - | - | - | - | - | -2.76 |
77 | Gemma 3 4B | 15.68 | - | - | - | - | - | -2.81 |
78 | GPT-3.5 Turbo | - | - | 0.58 | - | - | - | -2.98 |
* Scores are aggregated from various benchmarks using Z-score normalization. Missing values are excluded from the average calculation.
Z-Score Avg: This shows how well a model performs across all benchmarks compared to other models. A positive score means the model performs better than average, while a negative score means it performs below average. Think of it as a standardized "overall performance score."
Std. Dev. (Standard Deviation): This measures how consistent a model's performance is across different benchmarks. A low value means the model performs similarly across all benchmarks, it's consistently good (or consistently average) at everything. A high value means the model's performance varies, it might be exceptional at some tasks but struggle with others.
Benchmark Emojis: 🥇, 🥈, 🥉 indicate that the model ranked in the top 3 for that specific benchmark. ⭐️ indicates the model is in the top 15% for that specific benchmark (excluding the top 3 medalists).
Benchmark | Description | Strength | Weakness |
---|---|---|---|
Competitive Coding (Livebench)↗ | Tests LLMs on their ability to generate complete code solutions for competitive programming problems (e.g., LeetCode) and to complete partially provided code solutions. | Resistant to contamination with objective, frequently updated questions. | Limited to specific coding tasks; does not cover broader software development aspects. |
AI-Assisted Code (Aider)↗ | Focuses on AI-assisted coding, measuring how well LLMs can work with existing codebases, understand context, and make useful modifications. | Tests practical utility in existing projects. | Depend heavily on the quality and style of the initial codebase. |
Acceptance (ProLLM)↗ | Measures the rate at which code generated by LLMs is accepted by professional developers or automated checks in a simulated professional workflow. | Reflects practically in acceptance criteria. | Acceptance can be subjective or influenced by specific project guidelines. |
Web Development (WebDev Arena)↗ | Assesses LLMs on tasks related to web development, including HTML, CSS, and JavaScript generation and debugging. | Specific to a common and important domain of coding. | May not be representative of performance in other coding domains like systems programming or data science. |
Coding Interview (CanAiCode)↗ | A benchmark that tests a wide range of coding capabilities, from simple algorithmic tasks to more complex problem-solving. | Self-evaluating tests across multiple languages (Python, JavaScript) with controlled sandbox environments. | Currently focuses on junior-level coding tasks and has a smaller test suite (12 tests) compared to other benchmarks. |