New Masterclass:How to Build a Large Language Model

Read Now →

Best LLM for Programming: Software Engineer's Review (May 2025)

Wei Ming T.

By Wei Ming T. on May 25, 2025

LLMs for writing software are changing fast, and it's tough to stay updated. Many reviews focus a lot on test scores, but these scores don't always show how useful the LLMs are for actual daily work. As a software engineer, I'm sharing my view on what really works well, based on using these tools for real tasks, such as building this website.

So, let's forget the official test results for a moment. Trying these tools on actual projects often shows what they're good at and where they struggle, things that benchmarks just can't show you.

Why Benchmarks Don't Always Reflect Reality

It's easy to get swayed by impressive benchmark numbers. However, these standardized tests can sometimes be misleading for several reasons:

  • Training on Evaluation Sets: Some models might inadvertently (or intentionally) be trained on parts of the evaluation sets. This is like a student knowing the exam questions beforehand, leading to inflated scores that don't represent true problem-solving ability on unseen tasks.
  • Limited Scope: Benchmarks often test narrow, specific capabilities. They might not assess how well an LLM integrates into a complex workflow, understands broad project context, or handles ambiguous requests: all common scenarios in software engineering.
  • Lack of Practical Usage Patterns: How an LLM performs on a curated dataset can differ greatly from its performance when dealing with messy, incomplete, or poorly documented legacy code. Actual coding involves more than just generating pristine functions in isolation.
  • Overfitting to Specific Metrics: Models can be optimized to excel at the particular metrics a benchmark uses, even if those metrics don't perfectly align with what a human developer values (e.g., long-term maintainability vs. sheer lines of code produced).
  • Ignoring the "Human Factor": The interaction style, the ability to understand follow-up questions, and the general "collaborative feel" of an LLM are important for productivity but are hard to quantify in a benchmark.

My focus here is on practical experience: how these tools enhance or hinder a software engineer's daily grind.

The Contenders

After extensive use across various projects, here's my current take on some of the leading LLMs for programming tasks. Please note that this area is incredibly dynamic and evolving, and these observations reflect my experience up to mid-2025.

UI & UX Design: Gemini 2.5 Pro Shines

Regarding brainstorming user interfaces or thinking through user experience flows, Gemini 2.5 Pro has been a standout. Its ability to generate creative and practical design suggestions is remarkable. It's not just about spitting out code; it has a better grasp of aesthetic principles and user-centric design considerations than others.

For example, when asked to propose a dashboard layout for a complex data analytics application, Gemini 2.5 Pro provided several distinct options and reasoning for component placement and interaction suggestions. It did more than provide simple wireframes, offering ideas that genuinely improved the initial designs.

Problem-Solving and Understanding Codebase Conventions: Claude 3.7 Sonnet Leads, ChatGPT Close Behind

For the core task of understanding existing code and figuring out tricky problems, Claude 3.7 Sonnet has recently become my go-to. It demonstrates a strong ability to grasp the conventions and patterns within a given codebase, making its suggestions more contextually relevant. This is particularly helpful when parachuting into an unfamiliar project.

ChatGPT (specifically 4o/4.1) remains a very strong contender here. It's adept at dissecting complex logic and offering solutions. The gap isn't huge, but I've found Sonnet to have a slight edge in consistently understanding the "why" behind existing code, not just the "what."

Consider this simplified scenario. I fed both models a moderately complex Python function with a subtle bug related to an edge case in data processing.

# Original buggy function
def process_data(records):
    # ... assumes records is always a list of dicts
    # and each dict has a 'value' key
    processed_count = 0
    for record in records:
        if record['value'] > 10: # Potential KeyError if 'value' missing
            # Perform some processing
            processed_count += 1
    return processed_count

Claude 3.7 Sonnet was quicker to identify not only the potential KeyError if a record lacked a 'value' key but also suggested checking the records type itself, asking if it could be None or not a list, aligning better with defensive programming practices I try to follow. ChatGPT also found the KeyError but was less proactive about other potential input issues without more explicit prompting.

Speed: ChatGPT 4o/4.1 Currently, but Watch Gemini 2.5 Flash

The raw speed of response is often important, especially during rapid iteration or when you need a quick answer. ChatGPT 4o/4.1 generally feels the snappiest for most coding-related queries in this department. Its ability to generate code and explanations quickly is a definite plus.

However, a new player looks very promising: Gemini 2.5 Flash. Google is touting it as a lighter, faster model. I haven't had enough hands-on time with it to give a full evaluation, but initial impressions suggest it could be a significant contender for speed-critical tasks. I'll be closely watching its development and practical performance.

Speed-to-Performance Sweet Spot for Coding: Claude 3.5 Sonnet

While raw speed is one thing, the balance of speed and the quality of the output is often more important for coding. Claude 3.5 Sonnet (a slightly older version than 3.7, but still very relevant) hits a fantastic sweet spot here. It's noticeably faster than some larger models but still provides high-quality, useful code suggestions and explanations.

Suppose you need good code reasonably quickly and aren't necessarily tackling the absolute most complex theoretical problems. In that case, Claude 3.5 Sonnet often provides the best bang for your buck regarding time saved versus output utility.

Code Readability & Maintainability: ChatGPT 4o and Gemini 2.5 Pro

Generating functional code is one thing; generating readable and maintainable code is another. ChatGPT 4o often excels at this. It tends to produce code that follows common style guides and is well-commented, making it easier for humans to understand and maintain later.

Gemini 2.5 Pro is also quite good in this regard, often producing clean and well-structured code. Interestingly, while Claude 3.7 Sonnet is excellent at problem-solving, its generated code can sometimes be more verbose or less idiomatic than I'd prefer, requiring a little more cleanup for optimal readability. This isn't a major issue, but it's a noticeable difference in my experience.

For example, when asking for a Python function to recursively find all files with a specific extension in a directory, ChatGPT 4o produced:

import os

def find_files_by_extension(directory, extension):
    """
    Recursively finds all files with the given extension in a directory.
    Args:
        directory (str): The directory to search.
        extension (str): The file extension to look for (e.g., '.py').
    Returns:
        list: A list of paths to the found files.
    """
    found_files = []
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith(extension):
                found_files.append(os.path.join(root, file))
    return found_files

# Example usage:
# my_project_py_files = find_files_by_extension("./my_project", ".py")
# print(f"Found {len(my_project_py_files)} Python files.")

The code is clear, includes a docstring, and uses standard library functions effectively. Gemini 2.5 Pro produced something very similar.

Refactoring Code: Gemini 2.5 Pro Shows Promise

Refactoring existing code can be a delicate operation. You want to improve structure or performance without introducing new bugs. My early experiences suggest Gemini 2.5 Pro has a good aptitude for this. It understands the intent behind the code and can suggest meaningful refactors.

For instance, when given a repetitive block of code, Gemini 2.5 Pro was adept at suggesting how to extract it into a reusable function, including identifying appropriate parameters and return values. However, I need to conduct more extensive tests to form a definitive opinion, especially on larger, more complex refactoring tasks.

Understanding Context & Common Sense: Gemini 2.5 Pro's Subtlety

This is an area where Gemini 2.5 Pro truly stands out. It often demonstrates a level of "common sense" and an ability to understand the finer points of a request that can be very impressive. It doesn't always follow instructions in an overly literal, robotic way if the context suggests a different interpretation.

A simple, non-coding example illustrates this: I asked it to draft an announcement for a new feature but to "write it as an April Fool's joke." Many AIs would explicitly state "April Fool's Edition!" or make the joke so obvious it loses its subtlety. Gemini 2.5 Pro, however, crafted a humorously exaggerated and subtly absurd message, capturing the spirit of an April Fool's joke without explicitly labelling it as such. This ability to read between the lines is also incredibly valuable in technical contexts, where implicit assumptions and unstated requirements are common.

My Current Recommendations

Choosing the "best" LLM depends heavily on your needs and priorities.

If I had to pick an overall performer, especially when design and deep understanding are most important, I'd lean towards Gemini 2.5 Pro. Its design intuition is a significant bonus, and its grasp of context is top-tier. The main drawback is that it is slower than other options.

For a balance of strong coding performance and good speed, Claude 3.7 Sonnet (and its sibling Claude 3.5 Sonnet for pure coding speed-to-performance) is an excellent choice. It's a workhorse for problem-solving and understanding codebases.

ChatGPT 4o/4.1 remains a solid, reliable option, especially if consistent speed is your primary concern. Its performance is generally good enough for a wide range of programming tasks, and its ecosystem and widespread availability are also in its favour.

The wild card is Gemini 2.5 Flash. It could significantly alter these rankings if it delivers on its promise of speed while retaining a good portion of the Pro version's intelligence. I strongly recommend exploring it as it becomes more accessible, and I will certainly be putting it through its paces.

Conclusion

LLMs for coding are changing very quickly. The tools we have now are already making a big difference in how software is made, and they keep getting better. I've found that benchmarks can give you some ideas, but using these tools for your actual work is the best way to see what they can do.

In my opinion, we already have a lot of amazing LLMs for programming. These tools are incredible and can exponentially improve our productivity. I believe the main hurdle now often lies more in our communication skills and how well we engineer our prompts.

In the end, the right LLM for you is the one that works best with how you work, the kinds of projects you do, and what you like. I suggest you try out some of these top models yourself. What you learn from trying them will help you the most.

© 2025 ApX Machine Learning. All rights reserved.