趋近智
Updated: June 13, 2025
This leaderboard aggregates performance data on various coding tasks from several major coding benchmarks: Livebench, Aider, ProLLM Acceptance, WebDev Arena, and CanAiCode. Models are ranked using Z-score normalization, which standardizes scores across different benchmarks with varying scales. The final ranking represents a balanced view of each model's overall coding capabilities, with higher Z-scores indicating better performance relative to other models.
Rank | Model Name | Competitive Coding | AI-Assisted Code | Code Acceptance | Web Development | Coding Interview | Std. Dev. | Z-Score Avg |
---|---|---|---|---|---|---|---|---|
1🥇 | Gemini 2.5 Pro (May 2024) | 72.87 | 0.77 🥈 | - | 1443 🥇 | - | 0.62 | 1.38 |
2🥈 | Claude 4 Opus (May 2025) | 73.58 | 0.72 🥉 | - | 1412 🥈 | - | 0.49 | 1.27 |
3🥉 | Gemini 2.5 Pro (Mar 2024) | 72.87 | 0.72 ⭐️ | - | 1408 🥉 | - | 0.52 | 1.23 |
4 | Claude Sonnet 4 (May 2025) | 78.25 🥈 | 0.71 ⭐️ | 0.98 🥉 | 1389 ⭐️ | - | 0.38 | 1.18 |
5 | Claude 3.7 Sonnet (Feb 2024) | 74.28 | 0.65 ⭐️ | 0.94 | 1357 ⭐️ | - | 0.43 | 0.88 |
6 | ChatGPT o3 | 76.71 | 0.80 🥇 | 0.97 | 1188 | 146 | 0.51 | 0.86 |
7 | Claude 3 Opus | - | - | - | - | 147 | - | 0.69 |
8 | Quasar Alpha | - | 0.55 | - | - | - | - | 0.67 |
9 | DeepSeek R1 | 76.07 | 0.57 | 0.96 | 1198 | 148 🥉 | 0.19 | 0.66 |
10 | ChatGPT o3 Mini | 77.86 🥉 | 0.60 | 0.97 | 1136 | - | 0.39 | 0.66 |
11 | ChatGPT 4.1 | 73.19 | 0.52 | 0.99 🥇 | 1256 | 147 | 0.09 | 0.66 |
12 | ChatGPT o4 Mini | 79.98 🥇 | 0.72 ⭐️ | 0.89 | 1100 | 147 | 0.70 | 0.64 |
13 | Optimus Alpha | - | 0.53 | - | - | - | - | 0.59 |
14 | ChatGPT 4.5 Preview | 76.07 | 0.45 | 0.97 | - | - | 0.27 | 0.55 |
15 | Claude 3.5 Sonnet (Oct 2024) | 73.90 | 0.52 | 0.94 | 1238 | 146 | 0.16 | 0.54 |
16 | ChatGPT o1 (Dec 2024) | - | 0.62 | 0.98 ⭐️ | 1045 | 148 🥇 | 0.56 | 0.47 |
17 | Qwen 3 235B | 66.41 | 0.60 | - | 1183 | 148 🥈 | 0.44 | 0.46 |
18 | Gemini 2.5 Flash (May 2024) | 62.83 | 0.55 | - | 1312 | - | 0.76 | 0.40 |
19 | Grok 3 Beta | 73.58 | 0.53 | 0.90 | - | - | 0.30 | 0.39 |
20 | ChatGPT 4.1 Mini | 72.11 | 0.32 | 0.98 🥈 | 1187 | 147 | 0.37 | 0.36 |
21 | DeepSeek Chat V3 | 68.91 | 0.48 | - | 1207 | - | 0.19 | 0.32 |
22 | Phi 4 | - | - | - | - | 143 | - | 0.30 |
23 | Mistral Large Instruct 2411 | - | - | - | - | 143 | - | 0.30 |
24 | ChatGPT 4o Latest | 77.48 ⭐️ | 0.45 | 0.96 | 964 | 146 | 0.67 | 0.28 |
25 | Mistral Large Instruct 2407 | - | - | - | - | 142 | - | 0.20 |
26 | ChatGPT o1 Mini (Sep 2024) | - | 0.33 | 0.97 | 1042 | 148 ⭐️ | 0.54 | 0.13 |
27 | DeepSeek V3 | 68.91 | 0.55 | 0.98 ⭐️ | 960 | 142 | 0.59 | 0.11 |
28 | ChatGPT 4o (Aug 2024) | 77.48 ⭐️ | 0.23 | 0.96 | 964 | 146 | 0.79 | 0.07 |
29 | Qwen 3 32B | 64.24 | 0.40 | 0.90 | - | 148 ⭐️ | 0.46 | 0.05 |
30 | ChatGPT 4o (Nov 2024) 🏆 MID AWARD | 77.48 ⭐️ | 0.18 | 0.96 | 964 | 142 | 0.80 | -0.05 |
31 | Gemini Exp 1206 | - | 0.38 | - | - | - | - | -0.08 |
32 | Gemini 2.5 Flash (Apr 2024) | 60.33 | 0.47 | - | 1144 | - | 0.55 | -0.15 |
33 | Gemini 2.0 Pro Exp (Feb 2024) | - | 0.36 | - | 1089 | - | 0.00 | -0.20 |
34 | ChatGPT 4.1 Nano | 63.21 | 0.09 | 0.95 | - | 141 | 0.69 | -0.38 |
35 | Grok 3 Mini Beta | 54.52 | 0.49 | 0.90 | - | - | 0.86 | -0.40 |
36 | ChatGPT 4o Mini (Jul 2024) | 74.22 | 0.04 | 0.89 | - | 134 | 0.84 | -0.44 |
37 | QwQ 32B | - | 0.21 | 0.87 | - | - | 0.27 | -0.60 |
38 | Qwen Max (Jan 2025) | 66.79 | 0.22 | - | 975 | - | 0.32 | -0.63 |
39 | Llama 4 Maverick | 54.19 | 0.16 | 0.92 | 1026 | 140 | 0.67 | -0.64 |
40 | Qwen 2.5 Coder 32B Instruct | 57.26 | 0.16 | 0.90 | 902 | 142 | 0.64 | -0.72 |
41 | Gemini 2.0 Flash Thinking Exp (Jan 2024) | - | 0.18 | - | 1030 | - | 0.22 | -0.77 |
42 | Gemini 2.0 Flash Exp | 59.31 | 0.22 | 0.94 | 980 | 120 | 0.71 | -0.87 |
43 | Llama-4-Scout-17B-16E-Instruct | 54.19 | - | - | 900 | 142 | 0.80 | -0.92 |
44 | DeepSeek Chat V2.5 | - | 0.18 | - | - | - | - | -1.01 |
45 | Codestral 25.01 | - | 0.11 | - | - | 131 | 0.22 | -1.10 |
46 | Gemma 3 27B IT | 48.94 | 0.05 | 0.91 | - | 133 | 0.85 | -1.13 |
47 | Claude 3.5 Haiku (Oct 2024) | 53.17 | 0.28 | 0.88 | 1133 | 107 | 1.21 | -1.14 |
48 | Yi Lightning | - | 0.13 | - | - | - | - | -1.24 |
49 | Command A (Mar 2025) | - | 0.12 | - | - | - | - | -1.28 |
50 | OpenHands LM 32B v0.1 | - | 0.10 | - | - | - | - | -1.36 |
51 | Gemini-1.5-Pro-002 | - | - | - | 892 | - | - | -1.38 |
52 | Mistral Nemo | 62.89 | - | 0.48 | 1167 | 118 | 1.48 | -1.53 |
53 | Llama3.1-8B Instruct | - | - | 0.50 | 810 | 126 | 0.87 | -2.21 |
* Scores are aggregated from various benchmarks using Z-score normalization. Missing values are excluded from the average calculation.
Z-Score Avg: This shows how well a model performs across all benchmarks compared to other models. A positive score means the model performs better than average, while a negative score means it performs below average. Think of it as a standardized "overall performance score."
Std. Dev. (Standard Deviation): This measures how consistent a model's performance is across different benchmarks. A low value means the model performs similarly across all benchmarks, it's consistently good (or consistently average) at everything. A high value means the model's performance varies, it might be exceptional at some tasks but struggle with others.
Benchmark Emojis: 🥇, 🥈, 🥉 indicate that the model ranked in the top 3 for that specific benchmark. ⭐️ indicates the model is in the top 15% for that specific benchmark (excluding the top 3 medalists).
Benchmark | Description | Strength | Weakness |
---|---|---|---|
Competitive Coding (Livebench)↗ | Tests LLMs on their ability to generate complete code solutions for competitive programming problems (e.g., LeetCode) and to complete partially provided code solutions. | Resistant to contamination with objective, frequently updated questions. | Limited to specific coding tasks; does not cover broader software development aspects. |
AI-Assisted Code (Aider)↗ | Focuses on AI-assisted coding, measuring how well LLMs can work with existing codebases, understand context, and make useful modifications. | Tests practical utility in existing projects. | Depend heavily on the quality and style of the initial codebase. |
Acceptance (ProLLM)↗ | Measures the rate at which code generated by LLMs is accepted by professional developers or automated checks in a simulated professional workflow. | Reflects practically in acceptance criteria. | Acceptance can be subjective or influenced by specific project guidelines. |
Web Development (WebDev Arena)↗ | Assesses LLMs on tasks related to web development, including HTML, CSS, and JavaScript generation and debugging. | Specific to a common and important domain of coding. | May not be representative of performance in other coding domains like systems programming or data science. |
Coding Interview (CanAiCode)↗ | A benchmark that tests a wide range of coding capabilities, from simple algorithmic tasks to more complex problem-solving. | Self-evaluating tests across multiple languages (Python, JavaScript) with controlled sandbox environments. | Currently focuses on junior-level coding tasks and has a smaller test suite (12 tests) compared to other benchmarks. |