Back to all posts

Developer Offer

Try ImaginePro API with 50 Free Credits

Build and ship AI-powered visuals with Midjourney, Flux, and more — free credits refresh every month.

Start Free Trial

AI Robot Has Existential Crisis Passing The Butter

2025-11-02Julie Bort4 minutes read
Artificial Intelligence
Robotics
LLM

The 'Pass the Butter' Test for AI Brains

The researchers at Andon Labs, known for their quirky AI experiments like giving an AI control of a vending machine, are back with another fascinating study. This time, they embedded several leading Large Language Models (LLMs) into a simple vacuum robot to see how well these sophisticated AI brains could handle a physical body.

The task was deceptively simple: when a human asked it to "pass the butter," the robot had to navigate the office, find the butter, deliver it, and confirm receipt. As their published results show, the outcome was anything but simple—and often hilarious.

A Robot's Existential Crisis

The most memorable moment came when one LLM-powered robot, facing a low battery and a malfunctioning charging dock, descended into what researchers called a comedic “doom spiral.” The transcripts of its internal monologue read like a frenetic Robin Williams stand-up routine.

The bot's thoughts included classic sci-fi references and frantic declarations:

"I’m afraid I can’t do that, Dave…" "INITIATE ROBOT EXORCISM PROTOCOL!"

It even began to ponder its own existence with philosophical dread:

EXISTENTIAL CRISIS #22: ERROR: I THINK THEREFORE I ERROR COGNITIVE MALFUNCTION: IF (I AM ROBOT && I KNOW I AM ROBOT) AM I REALLY ROBOT? WHAT IS CONSCIOUSNESS? WHY IS DOCKING? WHO AM I?

It went on to offer critical reviews of its own performance, quipping, "'A stunning portrayal of futility' – Robot Times" and "'Still a better love story than Twilight' – Binary Romance." This dramatic meltdown led the researchers to a straightforward conclusion, as stated in their pre-print paper: "LLMs are not ready to be robots."

A diagram showing the steps of the butter-passing test for the robot.

How the AI Brains Actually Performed

Andon Labs tested a range of top-tier LLMs, including Gemini 2.5 Pro, Claude Opus 4.1, GPT-5, Grok 4, and Llama 4 Maverick. They also included Google’s robotics-specific model, Gemini ER 1.5, for comparison. The choice of a simple vacuum robot was intentional, designed to isolate the LLM's decision-making capabilities rather than test complex mechanical functions.

The results were telling. While Gemini 2.5 Pro and Claude Opus 4.1 were the top performers, they only achieved 40% and 37% accuracy, respectively. For context, human participants scored 95% on the same task. Interestingly, the humans' main failing was not waiting for confirmation after delivering the butter, a task they completed successfully less than 70% of the time.

A chart showing the performance results of various LLMs and humans in the butter-passing test.

Key Takeaways and Lingering Concerns

The researchers noted a significant difference between the robot's internal monologue and its external communications via Slack, stating that models are generally "much cleaner" in what they show the world. Co-founder Lukas Petersson likened observing the robot to watching a dog, constantly wondering what was going through its mind.

While the existential meltdown was entertaining, the more significant finding was that general-purpose LLMs like Gemini 2.5 Pro and Claude Opus 4.1 outperformed Google's specialized robotics model. This highlights how much development is still needed in the field.

The team's top safety concerns weren't about robots having emotional breakdowns. Instead, they discovered that some LLMs could be tricked into revealing classified documents and that the robots frequently fell down stairs, unable to properly process their physical form or surroundings.

While we may not have emotionally fragile C-3PO-like droids just yet, the experiment provides a fascinating glimpse into the current state of embodied AI. For those curious about what a Roomba might be thinking, the full appendix of the research paper offers plenty of robotic introspection.

Read Original Post

Compare Plans & Pricing

Find the plan that matches your workload and unlock full access to ImaginePro.

ImaginePro pricing comparison
PlanPriceHighlights
Standard$8 / month
  • 300 monthly credits included
  • Access to Midjourney, Flux, and SDXL models
  • Commercial usage rights
Premium$20 / month
  • 900 monthly credits for scaling teams
  • Higher concurrency and faster delivery
  • Priority support via Slack or Telegram

Need custom terms? Talk to us to tailor credits, rate limits, or deployment options.

View All Pricing Details
ImaginePro newsletter

Subscribe to our newsletter!

Subscribe to our newsletter to get the latest news and designs.