Back to all posts

Xbench Redefines AI Model Evaluation

2025-06-24Caiwei Chen4 minutes read
AI
Benchmarking
Technology

Introducing Xbench Beyond Rote Memorization

Discerning true reasoning from mere regurgitation in AI models presents a significant challenge. Xbench a novel benchmark from the Chinese venture capital firm HSG also known as HongShan Capital Group aims to address this. It uniquely evaluates models not just on arbitrary tests common in other benchmarks but also on their capacity to perform real world tasks. Furthermore Xbench is designed for regular updates to ensure its continued relevance.

This week HongShan Capital Group is releasing a portion of Xbenchs question set as open source making it freely available. Alongside this a leaderboard has been published detailing how leading AI models perform on Xbench. Notably ChatGPT o3 secured the top rank across all categories while models like ByteDances Doubao Gemini 2.5 Pro Grok and Claude Sonnet also demonstrated strong performance.

From Internal Tool to Public Resource

The journey of Xbench began at HongShan in 2022 sparked by the phenomenal success of ChatGPT. Initially it served as an internal mechanism for evaluating potential AI model investments. Under the guidance of partner Gong Yuan the system progressively expanded incorporating contributions from external researchers and professionals for refinement. As Xbench matured in sophistication the decision was made to make it publicly accessible.

Xbenchs Dual Evaluation System

Xbench tackles AI assessment through two distinct systems. The first mirrors traditional benchmarks employing an academic test to measure a models proficiency across diverse subjects. The second system functions more like a technical job interview focusing on a models potential to deliver tangible real world economic value.

Academic Prowess ScienceQA and DeepResearch

Currently Xbenchs methods for evaluating raw intelligence comprise two main components Xbench ScienceQA and Xbench DeepResearch. ScienceQA aligns with established postgraduate level STEM benchmarks such as GPQA and SuperGPQA. This section features questions across disciplines like biochemistry and orbital mechanics developed by graduate students and verified by professors. Importantly scoring considers not just the correctness of the answer but also the logical reasoning behind it.

DeepResearch on the other hand evaluates a models proficiency in navigating the Chinese language web. A panel of ten subject matter experts formulated 100 challenging questions in areas including music history finance and literature. These questions are designed to require substantial research rather than simple search engine queries. The scoring criteria emphasize the diversity of sources used factual accuracy and the models honesty in acknowledging data limitations. For instance one publicly shared question asks How many Chinese cities in the three northwestern provinces border a foreign country. The answer is 12 a fact correctly identified by only 33% of the tested models.

Continuous Improvement and Future Vision

According to the companys website the research team plans to enhance the test by incorporating additional dimensions. These include evaluating a models creativity in problem solving its collaborative capabilities when interacting with other models and its overall reliability.

The Xbench team is committed to updating the test questions quarterly and will maintain a dataset that is partly public and partly private.

Assessing Real World Value

To gauge the real world applicability of AI models the Xbench team collaborated with industry experts to create tasks based on actual professional workflows. Initial focuses include recruitment and marketing. For example one task challenges a model to identify five qualified battery engineer candidates and provide justification for each selection. Another task involves matching advertisers with suitable short video creators from a database of over 800 influencers.

The Xbench website also hints at forthcoming evaluation categories such as finance legal accounting and design. However the question sets for these new areas have not yet been made open source.

In the current professional categories ChatGPT o3 once again leads the rankings. For recruitment tasks Perplexity Search and Claude 3.5 Sonnet secured second and third places respectively. In marketing related tasks Claude Grok and Gemini all demonstrated strong performance.

Expert Endorsement

Zihan Zheng the lead researcher for the new LiveCodeBench Pro benchmark and an NYU student commented on the challenges stating It is really difficult for benchmarks to include things that are so hard to quantify. However Zheng also noted But Xbench represents a promising start.

Read Original Post
ImaginePro newsletter

Subscribe to our newsletter!

Subscribe to our newsletter to get the latest news and designs.