ApX 标志

趋近智

新MCP服务器: 获取LLM需求、基准测试等!

Top AI Models: Best LLMs for Coding

Updated: July 20, 2025

This leaderboard aggregates performance data on various coding tasks from several major coding benchmarks: Livebench, Aider, ProLLM Acceptance, WebDev Arena, and CanAiCode. Models are ranked using Z-score normalization, which standardizes scores across different benchmarks with varying scales. The final ranking represents a balanced view of each model's overall coding capabilities, with higher Z-scores indicating better performance relative to other models.

RankModel NameCompetitive CodingAI-Assisted CodeCode AcceptanceWeb DevelopmentCoding InterviewStd. Dev.Z-Score Avg
1🥇 Gemini 2.5 Pro73.90 0.79 ⭐️- 1423.33 🥇- 0.431.35
2🥈 ChatGPT o3 Pro76.78 ⭐️0.85 🥇- - - 0.341.31
3🥉 ChatGPT o3 High76.71 ⭐️0.81 🥈- - - 0.271.23
4Claude 4 Opus73.58 0.72 ⭐️- 1403.61 🥉147 0.401.07
5DeepSeek R176.07 0.71 0.96 1407.45 🥈148 🥉0.381.04
6Claude Sonnet 478.25 🥉0.61 0.98 ⭐️1378.38 ⭐️- 0.331.04
7Claude 3.7 Sonnet74.28 0.65 0.97 1357.1 ⭐️- 0.270.95
8O1 Preview- - 0.98 🥉- - -0.90
9ChatGPT o376.71 ⭐️0.77 ⭐️- 1188.66 146 0.370.80
10Grok 471.34 0.80 🥉- 1163.08 - 0.520.75
11Claude 3 Opus73.58 - - - 147 0.050.73
12abab7- - 0.96 - - -0.69
13ChatGPT 4.173.19 0.52 0.99 🥇1254.73 147 0.220.69
14Claude 3.5 Sonnet73.90 0.64 0.94 1237.7 146 0.120.66
15Kimi K271.78 0.59 - - - 0.050.62
16ChatGPT 4.5 Preview76.07 0.45 0.97 - - 0.430.58
17Llama 3.1 Nemotron 70B- - 0.95 - - -0.56
18ChatGPT o3 Mini77.86 ⭐️0.60 0.97 ⭐️1091.69 - 0.500.56
19ChatGPT o4 Mini79.98 🥇0.72 ⭐️0.89 1101.36 147 0.570.55
20GPT-4 Turbo- - 0.94 - - -0.54
21Qwen 3 235B66.41 0.60 - 1181.7 148 🥈0.200.50
22DeepSeek V3.168.91 - - - - -0.49
23DeepSeek V368.91 0.55 0.98 ⭐️1206.69 142 0.210.48
24ChatGPT o1- 0.62 0.98 ⭐️1044.88 148 🥇0.580.43
25ChatGPT 4.1 Mini72.11 0.32 0.98 🥈1187.59 147 0.520.42
26Gemini 2.5 Flash63.53 0.55 0.89 1298.81 - 0.390.41
27MiniMax-Text-01- - 0.93 - - -0.39
28Quasar Alpha- 0.55 - - - -0.38
29Grok 3 Beta73.58 0.53 0.90 1142.64 - 0.280.33
30Optimus Alpha- 0.53 - - - -0.31
31ChatGPT 4o Mini79.98 🥈0.72 0.89 1101.36 134 0.710.30
32Qwen 3 32B64.24 0.40 0.90 - 148 ⭐️0.370.20
33Mistral Medium 361.48 - - 1175.57 - 0.120.14
34Phi 460.59 - - - 143 0.170.14
35ChatGPT o1 Mini- 0.33 0.97 1041.83 148 ⭐️0.680.12
36ChatGPT 4o77.48 ⭐️0.27 0.96 964 146 0.850.09
37Mistral Large62.89 - 0.88 - 142 0.130.07
38Qwen 2.5 72B
🏆 MID AWARD
- - 0.89 - - -0.04
39Gemini 2.5 Flash Lite59.25 - - - - --0.11
40Grok 3 Mini Beta54.52 0.49 - - - 0.28-0.12
41ChatGPT 4.1 Nano63.92 0.09 0.95 - 141 0.82-0.16
42QwQ 32B Preview- - 0.86 - - --0.27
43Gemini 2.0 Pro- 0.36 - 1088.84 - 0.06-0.36
44Llama 3.3 70B- - 0.85 - - --0.38
45Llama 4 Maverick54.19 0.16 0.92 1026.87 140 0.56-0.40
46QwQ 32B- 0.26 0.87 - - 0.31-0.50
47Gemini 1.5 Pro- - 0.94 892.48 - 1.04-0.50
48Grok 2- - 0.83 - - --0.50
49Llama 4 Scout 17B54.19 - 0.85 900.17 142 0.62-0.51
50Qwen 2.5 Coder 32B57.26 0.16 0.90 902.26 142 0.69-0.53
51Qwen Max66.79 0.22 - 974.97 - 0.64-0.55
52Mistral Small49.65 - 0.84 - - 0.10-0.60
53Gemini 2.0 Flash59.31 0.22 0.94 1039.88 120 0.81-0.61
54Hunyuan Turbos50.35 - - - - --0.66
55Jamba 1.5 Large- - 0.82 - - --0.68
56Gemma 3 27B48.94 0.05 0.91 - 133 0.67-0.73
57Qwen 3 30B47.47 - - - - --0.84
58DeepSeek R1 Distill Qwen 32B47.03 - - - - --0.87
59Claude 3.5 Haiku53.17 0.28 0.88 1132.99 107 1.13-0.89
60DeepSeek R1 Distill Llama 70B46.65 - - - - --0.89
61Gemini 2.0 Flash Thinking- 0.18 - 1029.57 - 0.24-0.91
62Command A54.26 0.12 - - - 0.50-0.91
63Nova Pro- - 0.77 - - --1.12
64Codestral- 0.11 - - 131 0.31-1.14
65Gemma 3 12B42.16 - - - - --1.17
66Llama 3.1 8B- - - - 126 --1.30
67Yi Lightning- 0.13 - - - --1.37
68Mistral Nemo62.89 - 0.48 1167 118 1.72-1.42
69Llama 3.1 405B- - 0.80 809.67 - 0.61-1.46
70OpenHands LM 32B- 0.10 - - - --1.48
71Gemma 2 27B- - 0.72 - - --1.57
72Gemma 2 9B- - 0.71 - - --1.66
73Gemma 3n E4B31.48 - - - - --1.83
74Command R Plus27.13 - - - - --2.10
75Command R26.10 - - - - --2.16
76Gemma 3n E2B16.44 - - - - --2.76
77Gemma 3 4B15.68 - - - - --2.81
78GPT-3.5 Turbo- - 0.58 - - --2.98

* Scores are aggregated from various benchmarks using Z-score normalization. Missing values are excluded from the average calculation.

Understanding the Leaderboard

Z-Score Avg: This shows how well a model performs across all benchmarks compared to other models. A positive score means the model performs better than average, while a negative score means it performs below average. Think of it as a standardized "overall performance score."

Std. Dev. (Standard Deviation): This measures how consistent a model's performance is across different benchmarks. A low value means the model performs similarly across all benchmarks, it's consistently good (or consistently average) at everything. A high value means the model's performance varies, it might be exceptional at some tasks but struggle with others.

Benchmark Emojis: 🥇, 🥈, 🥉 indicate that the model ranked in the top 3 for that specific benchmark. ⭐️ indicates the model is in the top 15% for that specific benchmark (excluding the top 3 medalists).

About the Benchmarks:

BenchmarkDescriptionStrengthWeakness
Competitive Coding (Livebench)Tests LLMs on their ability to generate complete code solutions for competitive programming problems (e.g., LeetCode) and to complete partially provided code solutions. Resistant to contamination with objective, frequently updated questions. Limited to specific coding tasks; does not cover broader software development aspects.
AI-Assisted Code (Aider)Focuses on AI-assisted coding, measuring how well LLMs can work with existing codebases, understand context, and make useful modifications.Tests practical utility in existing projects.Depend heavily on the quality and style of the initial codebase.
Acceptance (ProLLM)Measures the rate at which code generated by LLMs is accepted by professional developers or automated checks in a simulated professional workflow.Reflects practically in acceptance criteria.Acceptance can be subjective or influenced by specific project guidelines.
Web Development (WebDev Arena)Assesses LLMs on tasks related to web development, including HTML, CSS, and JavaScript generation and debugging.Specific to a common and important domain of coding.May not be representative of performance in other coding domains like systems programming or data science.
Coding Interview (CanAiCode)A benchmark that tests a wide range of coding capabilities, from simple algorithmic tasks to more complex problem-solving.Self-evaluating tests across multiple languages (Python, JavaScript) with controlled sandbox environments.Currently focuses on junior-level coding tasks and has a smaller test suite (12 tests) compared to other benchmarks.