LLMs for writing software are changing fast, and it's tough to stay updated. Many reviews focus a lot on test scores, but these scores don't always show how useful the LLMs are for actual daily work. As a software engineer, I'm sharing my view on what really works well, based on using these tools for real tasks, such as building this website.
So, let's forget the official test results for a moment. Trying these tools on actual projects often shows what they're good at and where they struggle, things that benchmarks just can't show you.
It's easy to get swayed by impressive benchmark numbers. However, these standardized tests can sometimes be misleading for several reasons:
My focus here is on practical experience: how these tools enhance or hinder a software engineer's daily grind.
After extensive use across various projects, here's my current take on some of the leading LLMs for programming tasks. Please note that this area is incredibly dynamic and evolving, and these observations reflect my experience up to mid-2025.
Regarding brainstorming user interfaces or thinking through user experience flows, Gemini 2.5 Pro has been a standout. Its ability to generate creative and practical design suggestions is remarkable. It's not just about spitting out code; it has a better grasp of aesthetic principles and user-centric design considerations than others.
For example, when asked to propose a dashboard layout for a complex data analytics application, Gemini 2.5 Pro provided several distinct options and reasoning for component placement and interaction suggestions. It did more than provide simple wireframes, offering ideas that genuinely improved the initial designs.
For the core task of understanding existing code and figuring out tricky problems, Claude 3.7 Sonnet has recently become my go-to. It demonstrates a strong ability to grasp the conventions and patterns within a given codebase, making its suggestions more contextually relevant. This is particularly helpful when parachuting into an unfamiliar project.
ChatGPT (specifically 4o/4.1) remains a very strong contender here. It's adept at dissecting complex logic and offering solutions. The gap isn't huge, but I've found Sonnet to have a slight edge in consistently understanding the "why" behind existing code, not just the "what."
Consider this simplified scenario. I fed both models a moderately complex Python function with a subtle bug related to an edge case in data processing.
# Original buggy function
def process_data(records):
# ... assumes records is always a list of dicts
# and each dict has a 'value' key
processed_count = 0
for record in records:
if record['value'] > 10: # Potential KeyError if 'value' missing
# Perform some processing
processed_count += 1
return processed_count
Claude 3.7 Sonnet was quicker to identify not only the potential KeyError
if a record lacked a 'value'
key but also suggested checking the records
type itself, asking if it could be None
or not a list, aligning better with defensive programming practices I try to follow. ChatGPT also found the KeyError
but was less proactive about other potential input issues without more explicit prompting.
The raw speed of response is often important, especially during rapid iteration or when you need a quick answer. ChatGPT 4o/4.1 generally feels the snappiest for most coding-related queries in this department. Its ability to generate code and explanations quickly is a definite plus.
However, a new player looks very promising: Gemini 2.5 Flash. Google is touting it as a lighter, faster model. I haven't had enough hands-on time with it to give a full evaluation, but initial impressions suggest it could be a significant contender for speed-critical tasks. I'll be closely watching its development and practical performance.
While raw speed is one thing, the balance of speed and the quality of the output is often more important for coding. Claude 3.5 Sonnet (a slightly older version than 3.7, but still very relevant) hits a fantastic sweet spot here. It's noticeably faster than some larger models but still provides high-quality, useful code suggestions and explanations.
Suppose you need good code reasonably quickly and aren't necessarily tackling the absolute most complex theoretical problems. In that case, Claude 3.5 Sonnet often provides the best bang for your buck regarding time saved versus output utility.
Generating functional code is one thing; generating readable and maintainable code is another. ChatGPT 4o often excels at this. It tends to produce code that follows common style guides and is well-commented, making it easier for humans to understand and maintain later.
Gemini 2.5 Pro is also quite good in this regard, often producing clean and well-structured code. Interestingly, while Claude 3.7 Sonnet is excellent at problem-solving, its generated code can sometimes be more verbose or less idiomatic than I'd prefer, requiring a little more cleanup for optimal readability. This isn't a major issue, but it's a noticeable difference in my experience.
For example, when asking for a Python function to recursively find all files with a specific extension in a directory, ChatGPT 4o produced:
import os
def find_files_by_extension(directory, extension):
"""
Recursively finds all files with the given extension in a directory.
Args:
directory (str): The directory to search.
extension (str): The file extension to look for (e.g., '.py').
Returns:
list: A list of paths to the found files.
"""
found_files = []
for root, _, files in os.walk(directory):
for file in files:
if file.endswith(extension):
found_files.append(os.path.join(root, file))
return found_files
# Example usage:
# my_project_py_files = find_files_by_extension("./my_project", ".py")
# print(f"Found {len(my_project_py_files)} Python files.")
The code is clear, includes a docstring, and uses standard library functions effectively. Gemini 2.5 Pro produced something very similar.
Refactoring existing code can be a delicate operation. You want to improve structure or performance without introducing new bugs. My early experiences suggest Gemini 2.5 Pro has a good aptitude for this. It understands the intent behind the code and can suggest meaningful refactors.
For instance, when given a repetitive block of code, Gemini 2.5 Pro was adept at suggesting how to extract it into a reusable function, including identifying appropriate parameters and return values. However, I need to conduct more extensive tests to form a definitive opinion, especially on larger, more complex refactoring tasks.
This is an area where Gemini 2.5 Pro truly stands out. It often demonstrates a level of "common sense" and an ability to understand the finer points of a request that can be very impressive. It doesn't always follow instructions in an overly literal, robotic way if the context suggests a different interpretation.
A simple, non-coding example illustrates this: I asked it to draft an announcement for a new feature but to "write it as an April Fool's joke." Many AIs would explicitly state "April Fool's Edition!" or make the joke so obvious it loses its subtlety. Gemini 2.5 Pro, however, crafted a humorously exaggerated and subtly absurd message, capturing the spirit of an April Fool's joke without explicitly labelling it as such. This ability to read between the lines is also incredibly valuable in technical contexts, where implicit assumptions and unstated requirements are common.
Choosing the "best" LLM depends heavily on your needs and priorities.
If I had to pick an overall performer, especially when design and deep understanding are most important, I'd lean towards Gemini 2.5 Pro. Its design intuition is a significant bonus, and its grasp of context is top-tier. The main drawback is that it is slower than other options.
For a balance of strong coding performance and good speed, Claude 3.7 Sonnet (and its sibling Claude 3.5 Sonnet for pure coding speed-to-performance) is an excellent choice. It's a workhorse for problem-solving and understanding codebases.
ChatGPT 4o/4.1 remains a solid, reliable option, especially if consistent speed is your primary concern. Its performance is generally good enough for a wide range of programming tasks, and its ecosystem and widespread availability are also in its favour.
The wild card is Gemini 2.5 Flash. It could significantly alter these rankings if it delivers on its promise of speed while retaining a good portion of the Pro version's intelligence. I strongly recommend exploring it as it becomes more accessible, and I will certainly be putting it through its paces.
LLMs for coding are changing very quickly. The tools we have now are already making a big difference in how software is made, and they keep getting better. I've found that benchmarks can give you some ideas, but using these tools for your actual work is the best way to see what they can do.
In my opinion, we already have a lot of amazing LLMs for programming. These tools are incredible and can exponentially improve our productivity. I believe the main hurdle now often lies more in our communication skills and how well we engineer our prompts.
In the end, the right LLM for you is the one that works best with how you work, the kinds of projects you do, and what you like. I suggest you try out some of these top models yourself. What you learn from trying them will help you the most.
Recommended Posts
© 2025 ApX Machine Learning. All rights reserved.
Recommended Courses
Related to this post