GPT-5 Fails Coding Tests: Why I'm Sticking with GPT-4o Now
GPT-5 in Programming: An Honest and Disappointing Assessment
I tested GPT-5's programming capabilities, and the results were so disappointing that I will stick with GPT-4o for now. In my latest evaluation of programming skills, GPT-5 showed notably poor performance, delivering broken plugins, error-ridden scripts, and confidently incorrect answers, which could hinder projects not under close human supervision. Here's what you need to know before relying on it.

GPT-5 failed, OpenAI's new flagship model, in half of the programming tests I conducted, while previous versions achieved near-perfect scores. Fortunately, after OpenAI made the option to revert to other Large Language Models (LLMs) available, reliable alternatives are now accessible.

GPT-5 is now available, and it's the talk of the tech world, but it comes with some fundamental problems. The model failed in half of the programming tests I designed, marking the worst performance of a flagship OpenAI model in my tests ever.

A review is also available for the best AI tools for programming in 2025 (and what to avoid). Before diving into the details, let's pause at a curious new feature: an "Edit" button appearing above generated code outputs. Clicking it takes you to an embedded code editor, where I directly replaced the author field in ChatGPT's results. It seemed promising but ultimately offered no practical utility. When I closed the editor, it asked if I wanted to save changes, which I did, only to be met with an unhelpful message. I couldn't return to my original session and had to resubmit my original request, allowing GPT-5 to perform the task a second time. But wait, there's more. Let's delve into my test results...
1. Writing a WordPress Plugin

This was the first real test I conducted to evaluate any AI's programming ability. It's the same test that gave me the "the world is about to change" feeling when I first ran it with GPT-3.5, the older model known for its strong programming capabilities.
Subsequent tests, using the same prompt with different AI models, yielded mixed results. Some models were excellent, others not so much, while others, like those from Microsoft and Google, improved over time.
Read also: How I test an AI chatbot's coding ability – and you can too.
ChatGPT has always been the gold standard for this test since its inception, which makes GPT-5's results even more concerning. The actual programming experience with GPT-5 was only partially successful. Initially, the model generated a single code block that I could paste into a file and run, providing the necessary user interface for the plugin. When I pasted the test names, it dynamically updated the line count, although it used "line" for both singular and plural. But when testing the core functionality by clicking the "Randomize" button, the plugin failed to work as expected, instead redirecting me incorrectly to the `tools.php` page. This surprising behavior represents a significant regression, as previous versions like GPT-3.5, GPT-4, or GPT-4o had no issues with this fundamental WordPress development test. The failure of GPT-5, ostensibly OpenAI's most advanced model, at this initial stage was profoundly frustrating.
Then I gave GPT-5 this prompt: "When I click Randomize, I'm redirected to http://testsite.local/wp-admin/tools.php. I don't get a list of random results. Can you fix that?" The result was a single line to fix the error, an approach I don't prefer because it requires the user to manually search the code and replace the line without making mistakes. So, I asked for a complete plugin. This time, it gave me the full plugin code to copy and paste, and it worked. It randomized the lines and separated duplicates as requested. Finally.
Read also: Stop Collecting AI Tools, Start Making Them Work Together.

I'm sorry, OpenAI, but I have to fail you on this test. You would have passed if the only error was using the singular instead of the plural, but the fact that it delivered a non-working plugin on the first try is a failure, even if it managed to fix it on the second try. Regardless of how you justify it, this is a step backward.
2. Rewriting a string function

GPT-5 performed well in this test. It delivered a simple and straightforward result because it didn't perform any additional error checking, such as validating non-text inputs, excessive whitespace, thousands separators, or currency symbols. But this wasn't required in the initial prompt. I asked it to rewrite a specific function, which originally didn't include any error checking. GPT-5 did exactly what was asked of it without unnecessary additions, which is good, because it couldn't know if the function's previous code had already performed this check.
GPT-5 passed this test.
3. Finding a complex programming bug
This test came from a personal experience where I was struggling with an obscure bug in my own code. Without going into the details of how the WordPress framework works, the obvious solution was not the correct one. Solving the problem requires some specialized knowledge of how WordPress "filters" pass their information, which has been a stumbling block for many large language models.
Important: Disappointment with generative AI looms, according to the Gartner Hype Cycle, a model describing the maturity stages of new technologies from "peak of inflated expectations" to "trough of disillusionment" then "slope of enlightenment" and "plateau of productivity".

Nevertheless, GPT-5, just like GPT-4 and GPT-4o before it, was able to understand the problem and provided a clear and correct solution.
GPT-5 passed this test.
4. Writing a script
This test asks the AI to integrate a relatively obscure macOS scripting tool called Keyboard Maestro, a powerful task automation tool, with AppleScript from Apple, and Chrome's scripting behavior. This test is a true measure of the AI's breadth of knowledge, its understanding of how to build web pages, and its ability to write working code across three interconnected environments.
A significant number of AI models have failed this test, with the failure point usually being a lack of knowledge of Keyboard Maestro. GPT-3.5 knew nothing about it, but ChatGPT has been successfully passing this test since the release of GPT-4. Until now.
Where do we start? The good news is that GPT-5 handled the Keyboard Maestro part of the problem well. But it catastrophically botched the AppleScript programming, inventing a non-existent property, representing a classic case of an AI confidently giving a completely wrong answer.
Read also: ChatGPT Now Comes with Character Presets – And Other Upgrades You Might Have Missed.
AppleScript is case-insensitive by default. To make it case-sensitive, you must use a `considering case` block. This is why the following error message appeared. The reason the error message referred to the title of one of my articles is that this was the front window in Chrome, where the function checks the front window and performs actions based on the title. But misunderstanding how to handle case sensitivity wasn't the only error in GPT-5's generated AppleScript code; it also referred to a variable named `searchTerm` without first defining it, a practice that leads to errors in almost any programming language.
Fail, fail, fail.
The Internet Has Spoken
OpenAI seemed to suffer from the same arrogance its AI sometimes exhibits. It confidently migrated all users to GPT-5 and removed the ability to revert to GPT-4o. I pay $20 a month for my ChatGPT Plus account, and on Friday, I couldn't switch back to GPT-4o for programming tasks, nor could anyone else.
However, this decision sparked a strong reaction from users. By "strong," I mean the entire internet. So, by Saturday, ChatGPT added a new option. To access it, go to ChatGPT settings and enable the "Show Older Models" option. After that, you can simply open the model menu and choose the model you want. Note: This option is only available to paid subscribers. If you're using ChatGPT for free, you get what you're given.

Since this generative AI craze began in early 2023, ChatGPT has been the gold standard for programming tools, at least according to my tests.
You might also be interested in: Microsoft Copilot for Gaming: Windows 11's New AI Assistant and Performance Concerns.
Now? I'm not really sure. It's only been one day since GPT-5 was released, and its results will likely improve over time. But for now, I'll stick with GPT-4o for programming, although I appreciate GPT-5's deep thinking capabilities. What about you? Have you tried GPT-5 for programming tasks yet? Did it perform better or worse than previous versions like GPT-4o? Were you able to get working code on the first try, or did you have to guide it toward fixes? Will you use GPT-5 for programming or stick with older models? Share your thoughts in the comments below.

You can follow my daily project updates on social media. Be sure to subscribe to my weekly newsletter, and follow me on X/Twitter at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.
Featured
- Unplugging these seven common household appliances helped reduce electricity bills.
- No, Microsoft Authenticator will no longer manage your passwords – or most passkeys.
- Yes, you need a firewall on Linux – here's why and which one to use.
- Is a refurbished MacBook worth buying? I did the math, and here's my buying advice.