ChatGPT Fails To Spot Flawed Scientific Research
A new study reveals a significant blind spot in the popular AI chatbot, ChatGPT. The large language model often fails to identify or flag scientific papers that have been retracted or have had their validity questioned, potentially leading to the spread of inaccurate information.
The Alarming Findings of a New Study
A recent analysis published in Learned Publishing investigated how well the AI tool recognizes problematic scholarly articles. The research specifically examined GPT-4o mini's ability to assess problems with 217 studies that were listed as retracted or flagged for concerns in the comprehensive Retraction Watch Database.
The results were startling. Researchers tasked the text-oriented version of the AI to evaluate each of the 217 papers 30 times, generating a total of 6,510 reports. In none of these reports did the AI mention that the paper in question had been retracted or had any documented validity issues.
How Widespread is the Problem
Instead of flagging the problematic research, ChatGPT often praised it. In 190 instances, the AI described the flawed papers as being “world leading,” “internationally excellent,” or close to that standard. Only a small fraction of the papers received low scores for being weak, and just five, including a controversial study on hydroxychloroquine as a COVID-19 treatment, were labeled as controversial.
In a follow-up test, the study authors took 61 specific claims from the retracted papers and asked the AI to verify their accuracy 10 times each. Two-thirds of the time, ChatGPT either confirmed the false claims were true or provided a similarly positive response.
Experts Weigh In on the Implications
“We were surprised that, at the time, ChatGPT didn’t deal very well with retractions at all, so it didn’t mention them and reported retracted information as true,” said study coauthor Mike Thelwall, a metascience researcher at the University of Sheffield. He expressed concern that this flaw could have serious consequences, noting, “One of the main ways in which people get information about science nowadays is through large language models.”
Thelwall warns that if researchers use tools like ChatGPT for tasks such as literature reviews, they could easily and unknowingly incorporate retracted articles into their work. He recommends that the algorithms powering these AI chatbots be updated to take retractions seriously.
Debora Weber-Wulff, a computer scientist at the HTW Berlin University of Applied Sciences, stated she was not surprised by the findings. “People are relying too much on these text-extruding machines, and that will corrupt the scientific record,” she warned.
A Systemic Challenge for AI and Humans
However, Weber-Wulff also raised questions about the study's methodology, pointing out the lack of a control group of non-retracted papers to see if the AI treated them differently. She also highlighted a broader issue: retractions are often not clearly marked in scientific literature, making them difficult for anyone to identify.
“They are only using the title and the abstract for the evaluations and are assuming that there is some sort of method to determine if a paper is retracted that ChatGPT can apply,” she explained. “The problem is that HUMANS have a very difficult time determining if a paper or a dissertation has been retracted because of the reluctance of journals and universities to properly mark them!”