Can You Trust Dr ChatGPT New Study Reveals Flaws
The Rise of AI Powered Self Diagnosis
More and more people are turning to generative AI tools like ChatGPT when they feel sick, typing in their symptoms in search of a quick diagnosis. But how reliable is the medical advice provided by these large language models (LLMs)?
A recent study published in the journal iScience decided to put ChatGPT to the test, and the results were a mix of surprising successes and significant failures.
Putting ChatGPT Under the Microscope
The research was led by Ahmed Abdeen Hamed, a research fellow at the Thomas J. Watson College of Engineering and Applied Science at Binghamton University. He collaborated with researchers from AGH University of Krakow, Howard University, and the University of Vermont. Hamed, who works in Professor Luis M. Rocha’s Complex Adaptive Systems and Computational Intelligence Lab, has a background in verifying AI-generated content. He previously developed an algorithm called xFakeSci that can detect fake scientific papers with high accuracy. This new research extends that work to the biomedical capabilities of LLMs.
“People talk to ChatGPT all the time these days, and they say: ‘I have these symptoms. Do I have cancer? Do I have cardiac arrest? Should I be getting treatment?’” Hamed explained. “It can be a very dangerous business, so we wanted to see what would happen if we asked these questions, what sort of answers we got and how these answers could be verified from the biomedical literature.”
The Surprising Strengths of AI Medical Knowledge
The researchers tested ChatGPT on its knowledge of disease terms, drug names, genetics, and symptoms. The results in the first three categories were unexpectedly high, far exceeding Hamed's initial expectation of "at most 25% accuracy."
- Disease Terms: 88-97% accuracy
- Drug Names: 90-91% accuracy
- Genetic Information: 88-98% accuracy
“The exciting result was ChatGPT said cancer is a disease, hypertension is a disease, fever is a symptom, Remdesivir is a drug and BRCA is a gene related to breast cancer,” Hamed noted. “Incredible, absolutely incredible!”
Where AI Falters Symptoms and Hallucinations
Despite its strengths, ChatGPT showed a significant weakness in identifying symptoms, with accuracy scores between 49-61%. The researchers believe this is due to a mismatch in language. While medical professionals use formal biomedical ontologies for precision, ChatGPT is trained on a vast amount of internet data and tends to use more informal, "friendly" language to communicate with average users. It simplifies medical terminology, which can lead to inaccuracies.
An even more concerning issue was the AI's tendency to "hallucinate," or invent information. When the researchers asked for specific accession numbers for DNA sequences from the National Institutes of Health's GenBank database—unique identifiers like NM_007294.4 for the BRCA1 gene—ChatGPT simply made them up. Hamed views this as a major flaw.
The Path Forward Improving AI for Healthcare
Despite the discovery of these hallucinations, the research points to a clear opportunity for improvement. Hamed believes that the flaws can be fixed, making these AI tools even more powerful and reliable.
“Maybe there is an opportunity here that we can start introducing these biomedical ontologies to the LLMs to provide much higher accuracy, get rid of all the hallucinations and make these tools into something amazing,” he suggested.
Hamed's ultimate goal is not to discredit LLMs, but to expose their current limitations so that data scientists can refine the models, ensuring that the knowledge they provide is accurate and trustworthy.