ApX 标志

趋近智

Top AI Models: Best LLMs for Coding

Updated: June 13, 2025

This leaderboard aggregates performance data on various coding tasks from several major coding benchmarks: Livebench, Aider, ProLLM Acceptance, WebDev Arena, and CanAiCode. Models are ranked using Z-score normalization, which standardizes scores across different benchmarks with varying scales. The final ranking represents a balanced view of each model's overall coding capabilities, with higher Z-scores indicating better performance relative to other models.

RankModel NameCompetitive CodingAI-Assisted CodeCode AcceptanceWeb DevelopmentCoding InterviewStd. Dev.Z-Score Avg
1🥇 Gemini 2.5 Pro (May 2024)72.87 0.77 🥈- 1443 🥇- 0.621.38
2🥈 Claude 4 Opus (May 2025)73.58 0.72 🥉- 1412 🥈- 0.491.27
3🥉 Gemini 2.5 Pro (Mar 2024)72.87 0.72 ⭐️- 1408 🥉- 0.521.23
4Claude Sonnet 4 (May 2025)78.25 🥈0.71 ⭐️0.98 🥉1389 ⭐️- 0.381.18
5Claude 3.7 Sonnet (Feb 2024)74.28 0.65 ⭐️0.94 1357 ⭐️- 0.430.88
6ChatGPT o376.71 0.80 🥇0.97 1188 146 0.510.86
7Claude 3 Opus- - - - 147 -0.69
8Quasar Alpha- 0.55 - - - -0.67
9DeepSeek R176.07 0.57 0.96 1198 148 🥉0.190.66
10ChatGPT o3 Mini77.86 🥉0.60 0.97 1136 - 0.390.66
11ChatGPT 4.173.19 0.52 0.99 🥇1256 147 0.090.66
12ChatGPT o4 Mini79.98 🥇0.72 ⭐️0.89 1100 147 0.700.64
13Optimus Alpha- 0.53 - - - -0.59
14ChatGPT 4.5 Preview76.07 0.45 0.97 - - 0.270.55
15Claude 3.5 Sonnet (Oct 2024)73.90 0.52 0.94 1238 146 0.160.54
16ChatGPT o1 (Dec 2024)- 0.62 0.98 ⭐️1045 148 🥇0.560.47
17Qwen 3 235B66.41 0.60 - 1183 148 🥈0.440.46
18Gemini 2.5 Flash (May 2024)62.83 0.55 - 1312 - 0.760.40
19Grok 3 Beta73.58 0.53 0.90 - - 0.300.39
20ChatGPT 4.1 Mini72.11 0.32 0.98 🥈1187 147 0.370.36
21DeepSeek Chat V368.91 0.48 - 1207 - 0.190.32
22Phi 4- - - - 143 -0.30
23Mistral Large Instruct 2411- - - - 143 -0.30
24ChatGPT 4o Latest77.48 ⭐️0.45 0.96 964 146 0.670.28
25Mistral Large Instruct 2407- - - - 142 -0.20
26ChatGPT o1 Mini (Sep 2024)- 0.33 0.97 1042 148 ⭐️0.540.13
27DeepSeek V368.91 0.55 0.98 ⭐️960 142 0.590.11
28ChatGPT 4o (Aug 2024)77.48 ⭐️0.23 0.96 964 146 0.790.07
29Qwen 3 32B64.24 0.40 0.90 - 148 ⭐️0.460.05
30ChatGPT 4o (Nov 2024)
🏆 MID AWARD
77.48 ⭐️0.18 0.96 964 142 0.80-0.05
31Gemini Exp 1206- 0.38 - - - --0.08
32Gemini 2.5 Flash (Apr 2024)60.33 0.47 - 1144 - 0.55-0.15
33Gemini 2.0 Pro Exp (Feb 2024)- 0.36 - 1089 - 0.00-0.20
34ChatGPT 4.1 Nano63.21 0.09 0.95 - 141 0.69-0.38
35Grok 3 Mini Beta54.52 0.49 0.90 - - 0.86-0.40
36ChatGPT 4o Mini (Jul 2024)74.22 0.04 0.89 - 134 0.84-0.44
37QwQ 32B- 0.21 0.87 - - 0.27-0.60
38Qwen Max (Jan 2025)66.79 0.22 - 975 - 0.32-0.63
39Llama 4 Maverick54.19 0.16 0.92 1026 140 0.67-0.64
40Qwen 2.5 Coder 32B Instruct57.26 0.16 0.90 902 142 0.64-0.72
41Gemini 2.0 Flash Thinking Exp (Jan 2024)- 0.18 - 1030 - 0.22-0.77
42Gemini 2.0 Flash Exp59.31 0.22 0.94 980 120 0.71-0.87
43Llama-4-Scout-17B-16E-Instruct54.19 - - 900 142 0.80-0.92
44DeepSeek Chat V2.5- 0.18 - - - --1.01
45Codestral 25.01- 0.11 - - 131 0.22-1.10
46Gemma 3 27B IT48.94 0.05 0.91 - 133 0.85-1.13
47Claude 3.5 Haiku (Oct 2024)53.17 0.28 0.88 1133 107 1.21-1.14
48Yi Lightning- 0.13 - - - --1.24
49Command A (Mar 2025)- 0.12 - - - --1.28
50OpenHands LM 32B v0.1- 0.10 - - - --1.36
51Gemini-1.5-Pro-002- - - 892 - --1.38
52Mistral Nemo62.89 - 0.48 1167 118 1.48-1.53
53Llama3.1-8B Instruct- - 0.50 810 126 0.87-2.21

* Scores are aggregated from various benchmarks using Z-score normalization. Missing values are excluded from the average calculation.

Understanding the Leaderboard

Z-Score Avg: This shows how well a model performs across all benchmarks compared to other models. A positive score means the model performs better than average, while a negative score means it performs below average. Think of it as a standardized "overall performance score."

Std. Dev. (Standard Deviation): This measures how consistent a model's performance is across different benchmarks. A low value means the model performs similarly across all benchmarks, it's consistently good (or consistently average) at everything. A high value means the model's performance varies, it might be exceptional at some tasks but struggle with others.

Benchmark Emojis: 🥇, 🥈, 🥉 indicate that the model ranked in the top 3 for that specific benchmark. ⭐️ indicates the model is in the top 15% for that specific benchmark (excluding the top 3 medalists).

About the Benchmarks:

BenchmarkDescriptionStrengthWeakness
Competitive Coding (Livebench)Tests LLMs on their ability to generate complete code solutions for competitive programming problems (e.g., LeetCode) and to complete partially provided code solutions. Resistant to contamination with objective, frequently updated questions. Limited to specific coding tasks; does not cover broader software development aspects.
AI-Assisted Code (Aider)Focuses on AI-assisted coding, measuring how well LLMs can work with existing codebases, understand context, and make useful modifications.Tests practical utility in existing projects.Depend heavily on the quality and style of the initial codebase.
Acceptance (ProLLM)Measures the rate at which code generated by LLMs is accepted by professional developers or automated checks in a simulated professional workflow.Reflects practically in acceptance criteria.Acceptance can be subjective or influenced by specific project guidelines.
Web Development (WebDev Arena)Assesses LLMs on tasks related to web development, including HTML, CSS, and JavaScript generation and debugging.Specific to a common and important domain of coding.May not be representative of performance in other coding domains like systems programming or data science.
Coding Interview (CanAiCode)A benchmark that tests a wide range of coding capabilities, from simple algorithmic tasks to more complex problem-solving.Self-evaluating tests across multiple languages (Python, JavaScript) with controlled sandbox environments.Currently focuses on junior-level coding tasks and has a smaller test suite (12 tests) compared to other benchmarks.