Back to all posts

AI Summaries Failing Study Shows Worsening Accuracy

2025-05-18Joe Wilkins4 minutes read
AI
Chatbots
Research

The AI Promise Meets a Troubling Reality

Ask the CEO of any AI startup and you will probably get an earful about the technology's potential to transform work or revolutionize the way we access knowledge.

Indeed there is no shortage of promises that AI is only getting smarter. We are told this progress will speed up the rate of scientific breakthroughs streamline medical testing and breed a new kind of scholarship.

Study Uncovers Alarming Inaccuracies in AI Summaries

But according to a new study published in the Royal Society as many as 73 percent of seemingly reliable answers from AI chatbots could actually be inaccurate.

The collaborative research paper looked at nearly 5000 large language model LLM summaries of scientific studies by ten widely used chatbots including ChatGPT-4o ChatGPT-4.5 DeepSeek and LLaMA 3.3 70B. It found that even when explicitly goaded into providing the right facts AI answers lacked key details at a rate of five times that of human written scientific summaries.

When summarizing scientific texts LLMs may omit details that limit the scope of research conclusions leading to generalizations of results broader than warranted by the original study the researchers wrote.

Newer AI Models Surprisingly Less Accurate

Alarmingly the LLMs rate of error was found to increase the newer the chatbot was the exact opposite of what AI industry leaders have been promising us. This is in addition to a correlation between an LLMs tendency to overgeneralize with how widely used it is posing a significant risk of large scale misinterpretations of research findings according to the studys authors.

For example use of the two ChatGPT models listed in the study doubled from 13 to 26 percent among US teens between 2023 and 2025. Though the older ChatGPT-4 Turbo was roughly 2.6 times more likely to omit key details compared to their original texts the newer ChatGPT-4o models were nine times as likely. This tendency was also found in Metas LLaMA 3.3 70B which was 36.4 times more likely to overgeneralize compared to older versions.

The Challenge of Teaching AI to Summarize

The job of synthesizing huge swaths of data into just a few sentences is a tricky one. Though it comes pretty easily to fully grown humans it is a really complicated process to program into a chatbot.

While the human brain can instinctively learn broad lessons from specific experiences like touching a hot stove complex nuances make it difficult for chatbots to know what facts to focus on. A human quickly understands that stoves can burn while refrigerators do not but an LLM might reason that all kitchen appliances get hot unless otherwise told. Expand that metaphor out a bit to the scientific world and it gets complicated fast.

The Risks of Flawed AI Summaries in Critical Applications

But summarizing is also time consuming for humans the researchers list clinical medical settings as one area where LLM summaries could have a huge impact on work. It goes the other way too though in clinical work details are extremely important and even the tiniest omission can compound into a life changing disaster.

This makes it all the more troubling that LLMs are being shoehorned into every possible workspace from high school homework to pharmacies to mechanical engineering despite a growing body of work showing widespread accuracy problems inherent to AI.

Study Limitations and the Path Forward for AI

However there were some important drawbacks to their findings the scientists pointed out. For one the prompts fed to LLMs can have a significant impact on the answer it spits out. Whether this affects LLM summaries of scientific papers is unknown suggesting a future avenue for research.

Regardless the trendlines are clear. Unless AI developers can set their new LLMs on the right path you will just have to keep relying on humble human bloggers to summarize scientific reports for you wink.

For more on AI developments you might be interested in reading about how Senators are demanding safety records from AI chatbot apps as controversy grows.

Read Original Post
ImaginePro newsletter

Subscribe to our newsletter!

Subscribe to our newsletter to get the latest news and designs.