Back to all posts

AI Health Advice OpenAI Safety Benchmark

2025-05-16Victor Dey4 minutes read
AI Healthcare
OpenAI
Medical Technology

A doctor wearing a white coat Are you asking ChatGPT for medical advice? Ahmed/Unsplash

A growing number of individuals worldwide are seeking medical advice from ChatGPT when feeling unwell or anxious, sometimes even choosing it over a consultation with a human doctor.

The Rise of AI in Personal Health Queries

Studies highlight this increasing reliance on artificial intelligence for health information. A study published in August 2024 by the University of California revealed that patients frequently consult AI models like ChatGPT either before or after discussions with their physicians. Adding to this, research from the University of Sydney in February, surveying over 2,000 adults, indicated that nearly 60 percent of respondents had posed at least one high-risk health question to ChatGPT—questions that would normally necessitate professional clinical expertise.

Balancing Convenience with Safety OpenAI Introduces HealthBench

The appeal of ChatGPT for health queries is clear: it offers free, easily accessible answers within seconds. But a crucial question remains: how safe is this instant feedback? OpenAI aims to address this with HealthBench, a novel evaluation tool created to test the accuracy and reliability of AI models when responding to health-related inquiries.

How HealthBench Assesses AI Medical Advice

HealthBench was developed in collaboration with 262 licensed physicians from 60 different countries. It rigorously evaluates AI model performance across 5,000 realistic medical conversations using criteria written by doctors. The tool simulates interactive chats between users and AI assistants covering a broad spectrum of health concerns, from simple symptom checks to advice on when to seek emergency medical care. Each response generated by the AI is meticulously assessed against detailed guidelines. These guidelines specify what constitutes a good medical answer, what information should be avoided, and how the advice should be communicated. OpenAI's own GPT-4.1 model is utilized to score these AI responses.

Expert Concerns The Risks of AI Medical Consultations

Dr. Ran D. Anbar, a pediatric pulmonologist and clinical hypnotist at Center Point Medicine in California, commented that it is not surprising patients are turning to AI, especially when access to traditional health care can seem difficult or delayed. "Patients may defer consultation with a health care provider because they’re satisfied with the answers they get from ChatGPT, or because they want to save money," he explained to Observer.

However, Dr. Anbar also issued a stern warning about the significant risks associated with this convenience. "Unfortunately, it is completely predictable that patients will be harmed and perhaps even die because of delays in seeking appropriate medical treatment, thinking ChatGPT’s guidance is sufficient," he cautioned.

Which AI Model Performs Best for Health Questions

According to OpenAI's own evaluations using HealthBench, its newest model, o3, achieved the highest score at 60 percent. Following o3 were xAI’s Grok, which scored 54 percent, and Google’s Gemini 2.5 Pro, at 52 percent.

Interestingly, GPT-3.5 Turbo—the model powering the free version of ChatGPT—scored a mere 16 percent. Notably, OpenAI’s GPT-4.1 Nano model demonstrated superior performance compared to older, larger models, while also being approximately 25 times cheaper to operate. This suggests a trend towards newer AI tools becoming both more intelligent and more economically efficient.

Acknowledging Limitations The Path Forward for AI in Healthcare

Despite these advancements, Dr. Anbar emphasized that AI tools like ChatGPT remain unreliable, even as the conventional health care system struggles with being overburdened and slow to respond in urgent scenarios. "The apparent certainty with which it provides its responses can be misleadingly reassuring," he stated.

OpenAI itself acknowledged these limitations in the blog post announcing HealthBench. The company noted that its models can still make critical errors, especially when dealing with vague or high-stakes medical queries. In the field of medicine, even a single incorrect answer can have far more serious consequences than dozens of accurate ones. Given the widespread use of ChatGPT for health-related questions, OpenAI’s latest efforts to introduce safeguards like HealthBench represent a step forward. However, whether these measures will be sufficient to ensure patient safety remains an important and open question.

Read Original Post
ImaginePro newsletter

Subscribe to our newsletter!

Subscribe to our newsletter to get the latest news and designs.