Back to all posts

How Playing Games Reveals The Truth About AI

2025-07-08Rupert Goodwins5 minutes read
AI
Gaming
Benchmarks

The Historical Love Affair Between AI and Games

In the popular imagination, Artificial Intelligence has always been linked to two things: world domination and a love for games. Early AI pioneers took it as a given that once a machine could defeat a human at chess, true artificial intelligence would be within reach. This idea was soundly debunked over 50 years later when IBM's Deep Blue defeated Garry Kasparov in 1997. The match proved that computers could achieve mastery in chess while possessing the general intelligence of a rock.

When Modern AI Fails at Classic Games

Despite this, the dysfunctional romance between AI development and gaming continues. We've seen machine learning proponents celebrate victories in the complex game of Go and AIs mastering video games. Yet, there's a comical twist: some of the most advanced generative AIs can't even win at Atari 2600 video chess. It might be fairer to start them on something simpler, like the 1024-byte ZX81 1K Chess. Even more revealing is ChatGPT's glorious incompetence at tic-tac-toe—a game so straightforward you could build an unbeatable opponent with a few lightbulbs.

Why AI's Gaming Failures Matter

This isn't just a trivial, humorous observation. The initial connection between chess and AI was flawed, but it served as a vital disproof of a popular theory. Back then, human intellect was as mysterious as the future of computing. The fact that brilliant minds believed games were the ultimate benchmark reveals two things: we naturally use games to measure prowess, and doing so creates a universally understood narrative about AI. Today, these same game-like benchmarks are our best defense against the wave of AI hype we're all facing.

Testing Agentic AI in a Simulated World

Consider the recent study on the effectiveness of agentic AI. These AI agents are being sold as the next big thing—autonomous assistants capable of gathering, analyzing, and acting on data. But do they actually work? The answer is, for the most part, no. They exhibit classic AI shortcomings: failing to handle complexity, hallucinating facts, being deceptive, and simply not finishing the tasks they're given.

Office in a field

We know this because researchers from Carnegie Mellon University (CMU) created a fake business environment—a game, in other words—to monitor and score these AI agents as they performed employee tasks. This simulation of real-world challenges humanizes a technical evaluation, and that is incredibly important.

Games as a Human Evaluation Tool

The fundamental purpose of gaming in human society isn't just about winning. Games are experimental spaces for learning crucial skills, including cooperation and the ability to evaluate others. A player who is overconfident, unskilled, or deceptive quickly earns a bad reputation that can follow them into the real world. Sane employers don't hire people like that.

AI, especially the kind that claims it can act independently on your behalf, shouldn't get a pass based on promises alone. AI vendors promise the world, and the AIs themselves are masters of projecting unearned confidence. Just as an interview is meant to test a candidate's claims against their actual skills and integrity, we need benchmarks that allow everyone—not just the few experts with rare AI evaluation skills—to see how these systems truly perform.

Creating Relatable Benchmarks to Fight Hype

This is where gaming shines. It's a deeply human method of evaluation, and the results are easy to communicate. The final score is less important than the feeling of playing the game, and that emotion is what creates stories people care about and want to share. If you challenge ChatGPT to a game of tic-tac-toe, ask it about its confidence beforehand, and then try to explain its mistakes afterward, you'll walk away with a compelling story about the technology that anyone can understand.

This is exactly what is needed to defend against AI hype. It's not enough to complain to IT colleagues about a technology's flaws; that knowledge has to become part of the wider culture, understood by your aunt, your nephews, and your CEO. Creating game-like environments to test both people and AIs is a challenge, but the CMU paper offers a clear path forward.

WHite lab coated scientist looks sceptical in front of microscope. Photo by Shutterstock

If the AI industry had more genuine confidence and less bluster, it would embrace this approach. Previous "AI winters" were caused as much by shifting sentiment as by financial spreadsheets; the perception of AI's greatness faded when more compelling stories took hold. Demonstrating that AI agents are genuinely good partners in ways people can intrinsically understand should be a top priority, right?

That the industry doesn't seem to think so is a powerful story in itself. That it wants to place technology with flaws so deep it couldn't get a job as a junior assistant at the core of our businesses is another. Finding a way to tell these stories outside of tech circles is a very serious business. Game on. ®

Read Original Post
ImaginePro newsletter

Subscribe to our newsletter!

Subscribe to our newsletter to get the latest news and designs.