Best LLMs for Coding | LLM Leaderboards

Top AI Models: Best LLMs for Coding

Updated: July 20, 2025 (go to LLM Listing page to view more up-to-date rankings)

This leaderboard aggregates performance data on various coding tasks from several major coding benchmarks: Livebench, Aider, ProLLM Acceptance, WebDev Arena, and CanAiCode. Models are ranked using Z-score normalization, which standardizes scores across different benchmarks with varying scales. The final ranking represents a balanced view of each model's overall coding capabilities, with higher Z-scores indicating better performance relative to other models.

Rank	Model Name	Competitive Coding	AI-Assisted Code	Code Acceptance	Web Development	Coding Interview	Std. Dev.	Z-Score Avg
1🥇	Gemini 2.5 Pro	73.90	0.79 ⭐️	-	1423.33 🥇	-	0.43	1.35
2🥈	ChatGPT o3 Pro	76.78 ⭐️	0.85 🥇	-	-	-	0.34	1.31
3🥉	ChatGPT o3 High	76.71 ⭐️	0.81 🥈	-	-	-	0.27	1.23
4	Claude 4 Opus	73.58	0.72 ⭐️	-	1403.61 🥉	147	0.40	1.07
5	DeepSeek R1	76.07	0.71	0.96	1407.45 🥈	148 🥉	0.38	1.04
6	Claude Sonnet 4	78.25 🥉	0.61	0.98 ⭐️	1378.38 ⭐️	-	0.33	1.04
7	Claude 3.7 Sonnet	74.28	0.65	0.97	1357.1 ⭐️	-	0.27	0.95
8	O1 Preview	-	-	0.98 🥉	-	-	-	0.90
9	ChatGPT o3	76.71 ⭐️	0.77 ⭐️	-	1188.66	146	0.37	0.80
10	Grok 4	71.34	0.80 🥉	-	1163.08	-	0.52	0.75
11	Claude 3 Opus	73.58	-	-	-	147	0.05	0.73
12	abab7	-	-	0.96	-	-	-	0.69
13	ChatGPT 4.1	73.19	0.52	0.99 🥇	1254.73	147	0.22	0.69
14	Claude 3.5 Sonnet	73.90	0.64	0.94	1237.7	146	0.12	0.66
15	Kimi K2	71.78	0.59	-	-	-	0.05	0.62
16	ChatGPT 4.5 Preview	76.07	0.45	0.97	-	-	0.43	0.58
17	Llama 3.1 Nemotron 70B	-	-	0.95	-	-	-	0.56
18	ChatGPT o3 Mini	77.86 ⭐️	0.60	0.97 ⭐️	1091.69	-	0.50	0.56
19	ChatGPT o4 Mini	79.98 🥇	0.72 ⭐️	0.89	1101.36	147	0.57	0.55
20	GPT-4 Turbo	-	-	0.94	-	-	-	0.54
21	Qwen 3 235B	66.41	0.60	-	1181.7	148 🥈	0.20	0.50
22	DeepSeek V3.1	68.91	-	-	-	-	-	0.49
23	DeepSeek V3	68.91	0.55	0.98 ⭐️	1206.69	142	0.21	0.48
24	ChatGPT o1	-	0.62	0.98 ⭐️	1044.88	148 🥇	0.58	0.43
25	ChatGPT 4.1 Mini	72.11	0.32	0.98 🥈	1187.59	147	0.52	0.42
26	Gemini 2.5 Flash	63.53	0.55	0.89	1298.81	-	0.39	0.41
27	MiniMax-Text-01	-	-	0.93	-	-	-	0.39
28	Quasar Alpha	-	0.55	-	-	-	-	0.38
29	Grok 3 Beta	73.58	0.53	0.90	1142.64	-	0.28	0.33
30	Optimus Alpha	-	0.53	-	-	-	-	0.31
31	ChatGPT 4o Mini	79.98 🥈	0.72	0.89	1101.36	134	0.71	0.30
32	Qwen 3 32B	64.24	0.40	0.90	-	148 ⭐️	0.37	0.20
33	Mistral Medium 3	61.48	-	-	1175.57	-	0.12	0.14
34	Phi 4	60.59	-	-	-	143	0.17	0.14
35	ChatGPT o1 Mini	-	0.33	0.97	1041.83	148 ⭐️	0.68	0.12
36	ChatGPT 4o	77.48 ⭐️	0.27	0.96	964	146	0.85	0.09
37	Mistral Large	62.89	-	0.88	-	142	0.13	0.07
38	Qwen 2.5 72B 🏆 MID AWARD	-	-	0.89	-	-	-	0.04
39	Gemini 2.5 Flash Lite	59.25	-	-	-	-	-	-0.11
40	Grok 3 Mini Beta	54.52	0.49	-	-	-	0.28	-0.12
41	ChatGPT 4.1 Nano	63.92	0.09	0.95	-	141	0.82	-0.16
42	QwQ 32B Preview	-	-	0.86	-	-	-	-0.27
43	Gemini 2.0 Pro	-	0.36	-	1088.84	-	0.06	-0.36
44	Llama 3.3 70B	-	-	0.85	-	-	-	-0.38
45	Llama 4 Maverick	54.19	0.16	0.92	1026.87	140	0.56	-0.40
46	QwQ 32B	-	0.26	0.87	-	-	0.31	-0.50
47	Gemini 1.5 Pro	-	-	0.94	892.48	-	1.04	-0.50
48	Grok 2	-	-	0.83	-	-	-	-0.50
49	Llama 4 Scout 17B	54.19	-	0.85	900.17	142	0.62	-0.51
50	Qwen 2.5 Coder 32B	57.26	0.16	0.90	902.26	142	0.69	-0.53
51	Qwen Max	66.79	0.22	-	974.97	-	0.64	-0.55
52	Mistral Small	49.65	-	0.84	-	-	0.10	-0.60
53	Gemini 2.0 Flash	59.31	0.22	0.94	1039.88	120	0.81	-0.61
54	Hunyuan Turbos	50.35	-	-	-	-	-	-0.66
55	Jamba 1.5 Large	-	-	0.82	-	-	-	-0.68
56	Gemma 3 27B	48.94	0.05	0.91	-	133	0.67	-0.73
57	Qwen 3 30B	47.47	-	-	-	-	-	-0.84
58	DeepSeek R1 Distill Qwen 32B	47.03	-	-	-	-	-	-0.87
59	Claude 3.5 Haiku	53.17	0.28	0.88	1132.99	107	1.13	-0.89
60	DeepSeek R1 Distill Llama 70B	46.65	-	-	-	-	-	-0.89
61	Gemini 2.0 Flash Thinking	-	0.18	-	1029.57	-	0.24	-0.91
62	Command A	54.26	0.12	-	-	-	0.50	-0.91
63	Nova Pro	-	-	0.77	-	-	-	-1.12
64	Codestral	-	0.11	-	-	131	0.31	-1.14
65	Gemma 3 12B	42.16	-	-	-	-	-	-1.17
66	Llama 3.1 8B	-	-	-	-	126	-	-1.30
67	Yi Lightning	-	0.13	-	-	-	-	-1.37
68	Mistral Nemo	62.89	-	0.48	1167	118	1.72	-1.42
69	Llama 3.1 405B	-	-	0.80	809.67	-	0.61	-1.46
70	OpenHands LM 32B	-	0.10	-	-	-	-	-1.48
71	Gemma 2 27B	-	-	0.72	-	-	-	-1.57
72	Gemma 2 9B	-	-	0.71	-	-	-	-1.66
73	Gemma 3n E4B	31.48	-	-	-	-	-	-1.83
74	Command R Plus	27.13	-	-	-	-	-	-2.10
75	Command R	26.10	-	-	-	-	-	-2.16
76	Gemma 3n E2B	16.44	-	-	-	-	-	-2.76
77	Gemma 3 4B	15.68	-	-	-	-	-	-2.81
78	GPT-3.5 Turbo	-	-	0.58	-	-	-	-2.98

* Scores are aggregated from various benchmarks using Z-score normalization. Missing values are excluded from the average calculation.

Understanding the Leaderboard

Z-Score Avg: This shows how well a model performs across all benchmarks compared to other models. A positive score means the model performs better than average, while a negative score means it performs below average. Think of it as a standardized "overall performance score."

Std. Dev. (Standard Deviation): This measures how consistent a model's performance is across different benchmarks. A low value means the model performs similarly across all benchmarks, it's consistently good (or consistently average) at everything. A high value means the model's performance varies, it might be exceptional at some tasks but struggle with others.

Benchmark Emojis: 🥇, 🥈, 🥉 indicate that the model ranked in the top 3 for that specific benchmark. ⭐️ indicates the model is in the top 15% for that specific benchmark (excluding the top 3 medalists).

About the Benchmarks:

Benchmark	Description	Strength	Weakness
Competitive Coding (Livebench)↗	Tests LLMs on their ability to generate complete code solutions for competitive programming problems (e.g., LeetCode) and to complete partially provided code solutions.	Resistant to contamination with objective, frequently updated questions.	Limited to specific coding tasks; does not cover broader software development aspects.
AI-Assisted Code (Aider)↗	Focuses on AI-assisted coding, measuring how well LLMs can work with existing codebases, understand context, and make useful modifications.	Tests practical utility in existing projects.	Depend heavily on the quality and style of the initial codebase.
Acceptance (ProLLM)↗	Measures the rate at which code generated by LLMs is accepted by professional developers or automated checks in a simulated professional workflow.	Reflects practically in acceptance criteria.	Acceptance can be subjective or influenced by specific project guidelines.
Web Development (WebDev Arena)↗	Assesses LLMs on tasks related to web development, including HTML, CSS, and JavaScript generation and debugging.	Specific to a common and important domain of coding.	May not be representative of performance in other coding domains like systems programming or data science.
Coding Interview (CanAiCode)↗	A benchmark that tests a wide range of coding capabilities, from simple algorithmic tasks to more complex problem-solving.	Self-evaluating tests across multiple languages (Python, JavaScript) with controlled sandbox environments.	Currently focuses on junior-level coding tasks and has a smaller test suite (12 tests) compared to other benchmarks.