Developer Offer

Try ImaginePro API with 50 Free Credits

Build and ship AI-powered visuals with Midjourney, Flux, and more — free credits refresh every month.

Start Free Trial

AI Agent Race ChatGPT Leads But Humans Still Reign

2025-05-14•John Koetsier•4 minutes read

AI Agents

LLM Performance

AI Research

AI Agent Benchmark Reveals New Frontrunner

OpenAIs recent o3 AI model has outperformed competitors like Anthropics Claude Googles Gemini and Hangzhous Deepseek in a benchmark testing AI agents for web research tasks. However a considerable gap still exists between current human capabilities and even the best AI agents available today.

Study Highlights Performance Metrics

Research firm FutureSearch subjected 11 major large language models to a total of 89 messy real world research assignments. Each model was evaluated on its ability to perform tasks such as finding original sources seeking out data gathering evidence compiling data and validating claims.

The highest performance achieved by any AI agent was a score of 0.51. This is on a scale where an estimated perfect agent would reach approximately 0.8. This score indicates that even top tier AI agents are relatively easily outperformed by humans in these complex research scenarios.

“We can conclude that frontier agents … substantially underperform smart generalist researchers who are given ample time” the study states.

AI Model Rankings The Current Leaders

Here is how the various AI models scored in the FutureSearch benchmark:

o3 OpenAI: 0.51
Claude 3 7 Sonnet Think: 0.49
Claude 3 7 Sonnet Std: 0.48
Gemini 2 5 Pro: 0.45
GPT 4 1L: 0.42
DeepSeek R1: 0.31
Mistral Small: 0.30
GPT 4 Turbo: 0.27
Gemma 3: 0.20

Rapid Pace of AI Agent Improvement

Despite the current gap AI agents are improving at a rapid pace. Based on the year old ChatGPT 4 Turbos score of 0.27 researchers indicate that “about 45 percent of the gap between smart generalist researchers and frontier agents” was closed within just a year of development.

Accessibility and Performance Free Versus Paid Models

Furthermore free or inexpensive agents such as DeepSeek are not significantly behind the paid and top end AI agents from major developers like OpenAI. While OpenAIs o3 model leads the pack with Claude and Gemini close behind and closed models are currently clearly superior for research heavy tasks free and open source models are becoming increasingly capable.

Persistent Challenges AI Agents Face

All LLM based AI agents however still face major issues. They generally fall short of smart human researchers particularly in areas like strategic planning thoroughness evaluating sources for quality and what the study calls “memory management” where agents tend to forget earlier findings mid task. A particular problem identified is that AI agents often engage in “satisficing” which means accepting a lower level of quality instead of optimizing until they find the highest quality level of response.

Why OpenAIs o3 Model Excelled

This tendency towards satisficing is a core reason why ChatGPTs o3 model came in first. The o3 model tended to validate its answers more thoroughly and was less likely to stop short of better available answers compared to other models.

The Path to Surpassing Human Performance

Given that a single year has served to close almost half the gap between elite human researchers and the best AI agents it may not be long until AI agents are outperforming even the best humans in these tasks.

However progress is not always a straight line path to improvement. Recent challenges such as those noted with OpenAIs latest model being too agreeable make it clear that development will involve overcoming various obstacles.

Conclusion Verifying AI Outputs Remains Key

For now at least it will remain essential to double check any results from a generative AI application like AI agents to ensure accuracy. Human oversight and critical evaluation are still crucial components when utilizing these powerful tools.

Read Original Post

Compare Plans & Pricing

Find the plan that matches your workload and unlock full access to ImaginePro.

ImaginePro pricing comparison
Plan	Price	Highlights
Standard	$8 / month	300 monthly credits included Access to Midjourney, Flux, and SDXL models Commercial usage rights
Premium	$20 / month	900 monthly credits for scaling teams Higher concurrency and faster delivery Priority support via Slack or Telegram

Need custom terms? Talk to us to tailor credits, rate limits, or deployment options.

View All Pricing Details