AI Medical Advice for Knee Injuries Put to the Test
Can You Trust AI with Your Health Questions
As artificial intelligence becomes a part of our daily lives, more people are turning to large language models like ChatGPT and Gemini for quick answers to complex questions, including those about their health. But how reliable is this advice, especially for serious medical issues like sports injuries? A recent study delved into this question, specifically examining the quality of AI-generated recommendations for anterior cruciate ligament (ACL) and meniscal injuries, two of the most common knee problems for athletes and active individuals.
The Study A Head to Head AI Comparison
Researchers set out to determine if the medical advice provided by ChatGPT and Gemini aligns with the gold standard of care. The study's goal was to compare the AI responses against the highly respected Evidence-Based Clinical Practice Guidelines (CPGs) published by the American Academy of Orthopaedic Surgeons (AAOS). This comparison provides a crucial benchmark for evaluating the safety and reliability of using these AI tools for initial medical information.
How The AIs Were Tested
To create a fair and standardized test, the researchers formulated questions based directly on the official AAOS guideline statements for ACL and meniscus injuries. These questions were then posed to both ChatGPT and Gemini. A team of two reviewers independently evaluated each AI-generated response, classifying it into one of three categories:
- Agree: The AI's advice was consistent with the AAOS guidelines.
- Neutral: The response did not directly contradict the guidelines but was not fully aligned.
- Disagree: The AI's recommendation conflicted with the established medical best practices.
To ensure both reviewers graded consistently, a Cohen’s kappa coefficient was used to assess interrater reliability, and statistical analyses were performed to compare the performance between the two AI models.
Key Findings AI Accuracy and Sourcing
The results revealed a high degree of accuracy from both platforms, though with some important distinctions. Of the 11 CPG recommendations with strong or moderate evidence, ChatGPT agreed with the guidelines 82% of the time (9 out of 11), while Gemini showed a 73% agreement rate (8 out of 11). Interestingly, both AIs demonstrated perfect concordance with the guidelines for meniscal injuries. When comparing their responses on ACL injuries or overall performance on strong vs. moderate recommendations, there were no significant statistical differences between them.
A critical difference, however, emerged in their approach to sourcing information. ChatGPT provided no references or citations for its claims. In stark contrast, Gemini supplied 25 PubMed resources, of which 23 were found to be relevant and correctly supported the information given.
The Verdict Is AI a Reliable Medical Source
The study concludes that while AI language models are not a substitute for a consultation with a qualified doctor, they are becoming increasingly aligned with expert clinical guidelines for certain conditions. Patients using these tools for information on ACL and meniscal injuries are likely to receive advice that is generally appropriate. However, the lack of citations from some platforms remains a concern for transparency and verification. Gemini's ability to provide supporting evidence marks a significant step forward. For healthcare providers, this means being aware that patients may arrive with well-informed questions from these platforms, but it also highlights the need to guide them toward verified medical resources.