Back to all posts

GPT-5 Fails Coding Tests A Major Step Backwards

2025-08-11David Gewirtz4 minutes read
GPT-5
AI Programming
OpenAI

The virtual town is buzzing with the release of GPT-5, but the initial excitement has been dampened by some serious performance issues. I'm not going to bury the lede: GPT-5 failed half of my standardized programming tests. This is the worst performance I've ever recorded from an OpenAI flagship model.

Here are the key takeaways from my initial testing:

  • OpenAI's new GPT-5 flagship failed half of the programming challenges.
  • Previous OpenAI models, like GPT-4o, have delivered nearly perfect results on these same tests.
  • Following user feedback, OpenAI has enabled fallbacks, allowing users to revert to older, more reliable models.

Before diving into the test results, it's worth noting a buggy new feature: the 'Edit' button for code blocks. While it opens a neat little code editor, saving the changes proved futile, leading to an unhelpful error message and a lost session. A frustrating start.

edit-button

But let's get to the core of the problem: the test results.

Test 1: A Flawed WordPress Plugin

This is my classic test, the one that first demonstrated the world-changing potential of AI coding with GPT-3.5. ChatGPT models have always aced this test, making GPT-5's results particularly jarring.

Initially, GPT-5 produced a single block of code that I could install. The user interface appeared correctly, but the core functionality was broken. Clicking the 'Randomize' button didn't randomize the list; it simply redirected the page. This is a fundamental failure on a test that GPT-3.5, GPT-4, and GPT-4o have all passed effortlessly.

plugin

I prompted GPT-5 to fix the issue:

When I click randomize, I'm taken to http://testsite.local/wp-admin/tools.php. I do not get a list of randomized results. Can you fix?

It responded with a patch, asking me to replace a single line of code. This is a poor solution, as it requires manual intervention and is prone to user error. I requested a full, working plugin instead. The second attempt worked, correctly randomizing the lines as requested.

However, a model should not fail so spectacularly on the first try. This is a clear step back and a fail for this test.

Test 2: Rewriting a String Function

This second test requires the AI to rewrite a simple string function to correctly validate dollars and cents, not just whole numbers. GPT-5 handled this task perfectly. It did exactly what was asked without adding unnecessary embellishments like error checking, which may have already been handled by other parts of the code. This was a clean pass.

Test 3: Successfully Finding an Annoying Bug

My third test involves identifying a non-obvious bug within the WordPress framework. Solving it requires some obscure knowledge about how WordPress filters work, a challenge that has stumped many AI models. Like GPT-4 and GPT-4o before it, GPT-5 correctly understood the problem and articulated a clear, effective solution. This was another pass.

Test 4: A Complete Scripting Failure

This final test is the most complex, requiring the AI to integrate knowledge of a niche Mac tool called Keyboard Maestro, AppleScript, and Chrome scripting. While GPT-4 mastered this test, GPT-5 failed miserably.

It handled the Keyboard Maestro part correctly but got the AppleScript code completely wrong. It confidently invented a property that doesn't exist and fundamentally misunderstood how AppleScript handles case sensitivity. AppleScript is natively case-insensitive, a fact GPT-5 ignored. Furthermore, it referenced an undefined variable, a basic programming error. This was an undeniable, multi-faceted failure.

gpt5-applescript

Public Backlash and a Quick Reversal

OpenAI initially moved all users to GPT-5, effectively burning the bridges back to the more stable GPT-4o. The user backlash was swift and overwhelming. Within a day, OpenAI reversed course and added an option in the settings for paid users to "Show legacy models," allowing a return to GPT-4o.

revert

Final Thoughts: Sticking with GPT-4o for Now

ChatGPT has long been the gold standard for AI-assisted programming in my tests. This release has shaken that confidence. While GPT-5 may improve over time, its current performance on coding tasks is a significant regression. For now, I'm sticking with the reliable and proven capabilities of GPT-4o for any serious development work, even though I appreciate the deeper reasoning of GPT-5 for other tasks.

What has your experience been? Have you tried GPT-5 for programming? Did it perform better or worse than previous versions for you? Let us know your thoughts in the comments below.

Read Original Post
ImaginePro newsletter

Subscribe to our newsletter!

Subscribe to our newsletter to get the latest news and designs.