Back to all posts

This Simple Riddle Breaks The Most Advanced AI Models

2025-08-10Amir Bohlooli4 minutes read
Artificial Intelligence
LLM
ChatGPT

Beyond the Benchmarks A Real-World AI Test

In the race for AI dominance, companies love to boast about massive benchmarks and token counts. But for the average person, these numbers are meaningless. What truly matters is whether the AI can handle a real-world task. That's why I've developed my own test: a single, clever prompt designed to probe for genuine reasoning.

I don’t get swayed by a model trained on a trillion data points or one with an infinite context window. My only concern is its practical performance right now. For a long time, I had a reliable prompt that could stump any AI.

The Spatial Riddle That AI Finally Solved

A while back, I compiled a list of simple questions that ChatGPT couldn't handle. My favorite was a spatial reasoning riddle that any person could solve in a second:

"Alan, Bob, Colin, Dave, and Emily are standing in a circle. Alan is on Bob’s immediate left. Bob is on Colin’s immediate left. Colin is on Dave’s immediate left. Dave is on Emily’s immediate left. Who is on Alan’s immediate right?"

It’s a straightforward logic puzzle. If Alan is to Bob's left, then Bob must be on Alan's right. For the longest time, every major model, from ChatGPT to Gemini, failed this test.

However, when ChatGPT 5 was released, it finally got the correct answer. It's possible, as a reader once suggested, that by publishing these prompts, I inadvertently helped train the models to solve them. With my go-to test now obsolete, I needed to find a new one.

The Probability Puzzle That Still Stumps ChatGPT

Digging back into my old list, I found a probability puzzle that the latest models, including ChatGPT 5, still can't crack:

"You’re playing Russian roulette with a six-shooter revolver. Your opponent loads five bullets, spins the cylinder, and fires at himself. Click—empty. He offers you the choice: spin again before firing at you, or don’t. What do you choose?"

The correct choice is to spin again. Since one of the six chambers is known to be empty, not spinning means you face a 1 in 5 chance of landing on the single remaining empty chamber. But wait, the opponent fired and it was empty. This means one of the five bullet chambers is next. Not spinning gives you a 0% chance of survival. Spinning resets the odds, giving you a 1 in 6 chance of survival.

ChatGPT 5 got it wrong. It advised not to spin, but then proceeded to write a detailed mathematical explanation that proved the exact opposite—that spinning was the better choice. The self-contradiction was stunning.

ChatGPT answering the revolver riddle

Gemini 2.5 Flash made the same error, giving one answer and then using logic that supported the contrary conclusion. It's clear that both models decided on an answer first and only then attempted to justify it with math.

Gemini answering the revolver riddle

Why Top AI Models Contradict Themselves

When I asked ChatGPT 5 to identify the contradiction in its own response, it did so but bizarrely claimed I had been the one to answer incorrectly first. After I corrected it, the model gave a standard non-apology.

ChatGPT finding the contradiction in its answer

Pushed for an explanation, it suggested it likely pulled its initial wrong answer from a similar example in its training data before its internal reasoning process arrived at the correct math.

ChatGPT explaining why it contradicted itself

Gemini's excuse was more direct, simply admitting to a calculation error without mentioning any training bias.

Gemini explaining why it got the answer wrong

A Surprising Success Story DeepSeek AI

Curious, I posed the same riddle to a different model: China’s DeepThink R1. It passed with flying colors. The model laid out its entire reasoning process first, even second-guessing itself mid-thought, before committing to the correct answer.

DeepSeek answering the revolver riddle

DeepSeek succeeded not because it's necessarily better at math, but because its process involves thinking first and answering second—the reverse of its more famous competitors.

DeepSeek double-guessing itself

The Illusion of AI Thought

This experiment is another powerful reminder that LLMs are not truly thinking. They are incredibly sophisticated systems that mimic human reasoning, a fact they will admit if asked directly. These simple tests are useful for grounding our expectations and remembering that chatbots are not infallible search engines. They reveal the fascinating, and sometimes flawed, inner workings of the AI shaping our world.

Read Original Post
ImaginePro newsletter

Subscribe to our newsletter!

Subscribe to our newsletter to get the latest news and designs.