Back to all posts

Can ChatGPT Accurately Diagnose Diabetic Eye Disease

2025-08-31Dove Press5 minutes read
Artificial Intelligence
Healthcare
Medical Diagnosis

The Growing Threat of Diabetic Retinopathy

Diabetic retinopathy (DR) is a serious eye complication arising from diabetes and stands as a primary cause of vision loss among working-age adults. The numbers are staggering; by 2050, it's projected that 16 million adults over 40 in the United States will have DR, with 3.4 million facing vision-threatening stages. For people with diabetes, the fear of losing their sight often outweighs concerns about other complications. The challenge is that DR often shows no symptoms until it's advanced. However, the good news is that early detection and treatment can prevent up to 98% of blindness caused by the condition. This makes regular screening a critical and cost-effective part of diabetes care. Despite this, only about 60% of patients get their recommended annual screenings, creating a public health challenge that needs innovative solutions.

Can AI Help Screen for Eye Disease?

To address the growing demand for screenings, specialized artificial intelligence (AI) systems have been developed to analyze color fundus photographs (CFPs) of the eye. In the United States, two systems have received FDA approval. The IDx-DR system, approved in 2018, showed high sensitivity and specificity for detecting more than mild DR (mtmDR). Similarly, the EyeArt system, approved in 2020 and updated in 2023, also demonstrates impressive accuracy. These tools use complex deep learning models designed specifically for this task.

Enter ChatGPT, the widely known AI from OpenAI. Its latest version, ChatGPT-4 Omni (GPT-4o), has advanced visual capabilities, allowing it to interpret images. Unlike the specialized FDA-approved systems that come with significant costs and require specific equipment, GPT-4o is freely accessible. With over 180 million users, its potential influence on how people interpret medical information is enormous. This raises a crucial question: could a general-purpose AI like GPT-4o offer a viable, low-cost alternative for DR screening? A recent study set out to answer this by rigorously testing its diagnostic accuracy.

Putting ChatGPT to the Test

Researchers used a public dataset from Kaggle containing 2,500 high-resolution retinal images, each graded by a specialist for DR severity on a scale from 0 (no DR) to 4 (proliferative DR). The study was designed as a "zero-shot" evaluation, meaning GPT-4o had not been specifically trained on this type of labeled medical data.

The team presented each image to ChatGPT-4o with a series of carefully crafted prompts. The first prompt framed the task as a multiple-choice question from a medical board exam, a format where ChatGPT has previously performed well.

To test the model's robustness, researchers used seven other prompts. These included framing the question as a clinical examination, assigning the AI the role of an ophthalmologist, and simplifying the task into binary choices (e.g., "no DR" vs. "severe DR"). This technique, known as prompt engineering, aims to guide the AI toward more accurate responses by providing clear, structured instructions.

The Results: How Did ChatGPT Perform?

When asked to classify images into one of the five stages of DR, ChatGPT-4o showed a strong bias, frequently classifying images as having no DR, even when they did. This resulted in a high number of false negatives for mild, moderate, and severe stages.

Confusion matrix for prompts testing multi-level classification.

However, the model's performance improved significantly when the task was simplified. In binary classification tests (e.g., choosing between "no DR" and "proliferative DR"), its accuracy and other statistical measures were much higher. It performed best at identifying the most severe stage, proliferative DR, suggesting it can recognize more obvious signs of the disease.

Confusion matrix for prompts testing binary classification.

ChatGPT vs. Specialized AI: A Clear Performance Gap

When compared directly to the FDA-approved AI systems for detecting more than mild DR, ChatGPT-4o's performance fell short. Its sensitivity (47.7%) and specificity (73.8%) were significantly lower than those of IDx-DR (87% sensitivity, 90% specificity) and EyeArt (over 94% sensitivity, 91% specificity).

Confusion matrix for prompt 5, which focused on detecting moderate or worse DR.

This gap highlights a key difference: specialized AI systems are trained extensively on vast medical image datasets to recognize subtle, pixel-level features like microaneurysms, which are essential for an accurate diagnosis. ChatGPT-4o, as a general-purpose Large Language Model, lacks this specific training, making it prone to underestimating the severity of the disease.

The Future of AI in Ophthalmology

While ChatGPT-4o is not ready to replace specialized medical AI, its accessibility and low cost make it a compelling tool for future development. The study suggests that its performance could be improved by integrating its language capabilities with dedicated image recognition models or by training it on large, diverse datasets of retinal images.

As AI tools like ChatGPT become more integrated into our lives, their potential role in clinical settings is a topic of great importance. They could one day assist clinicians by providing instant insights during patient consultations. However, its current limitations must be carefully addressed before it can be safely integrated into routine clinical practice. This study underscores the promise of general-purpose AI in medicine but also serves as a crucial reminder that for high-stakes tasks like medical diagnosis, specialized, clinically validated tools remain the gold standard.

Read Original Post
ImaginePro newsletter

Subscribe to our newsletter!

Subscribe to our newsletter to get the latest news and designs.