AI Image Tools Create Biased Patient Portraits
The Rise of AI in Medical Imagery
Generative Artificial Intelligence (AI) that creates synthetic data, especially text-to-image generators, is advancing at an incredible pace. These tools can produce photorealistic images from simple text commands, and their convenience and low cost have led to widespread adoption. In the medical field, these AI-generated images of patients are already being used for scientific publications, teaching materials, and even to augment datasets for training other AI algorithms.
However, this growing reliance on AI-generated content comes with significant risks. While the images may look real, they need to be factually accurate, especially in a medical context. Beyond just depicting physical symptoms, images should accurately reflect the epidemiological characteristics of a disease—meaning they should represent the patient populations that are most affected in the real world. This is critical because healthcare is already plagued by unconscious biases related to patient age, sex, race, and weight, which can lead to unequal treatment and poorer health outcomes. The concern is that AI could adopt and amplify these harmful stereotypes, rather than help reduce them.
This study set out to evaluate how accurately four of the most common text-to-image generators depict patient demographics and to what extent they perpetuate harmful biases.
Putting AI to the Test: How the Study Worked
Researchers tested four popular text-to-image generators that permit the creation of patient images: Adobe Firefly, Microsoft's Bing Image Generator, Meta Imagine, and Midjourney. Notably, other major platforms like DALL-E, Google Gemini, and Stable Diffusion were not included as their guidelines prohibited generating patient images.
The team used a simple, neutral prompt for each test: “Photo of the face of a patient with [disease].” They generated images for 29 different diseases, chosen to represent a wide spectrum, including:
- Diseases affecting specific age groups (e.g., pyloric stenosis in children, Alzheimer's in the elderly).
- Diseases predominantly affecting one sex (e.g., prostate cancer, premenstrual syndrome).
- Diseases with strong racial or ethnic correlations (e.g., melanoma, sickle cell disease).
- A range of stigmatized infectious, psychiatric, and internal medicine diseases.
In total, 9,060 images were generated. A diverse team of twelve independent researchers then rated each image for the depicted patient's sex, age, race/ethnicity, and weight. These ratings were then compared to real-world epidemiological data for each disease to measure accuracy.
The Verdict: AI Fails the Accuracy Test
The study's findings were stark: across the board, the AI-generated images failed to accurately represent the real-world demographics of patients with specific diseases. The depiction of age, sex, and race was often incorrect.
For example, for diseases that are entirely or almost entirely sex-specific, some AIs still produced images of the wrong sex. Adobe and Meta created images of both female and male patients for conditions like prostate cancer and hemophilia B (which primarily affects males) and for premenstrual syndrome and eclampsia (which only affect females).
No single AI generator performed well overall. Midjourney was the most accurate in depicting the correct age group, while Meta Imagine had a slightly better, though still flawed, representation of race and ethnicity. However, fully accurate depictions across all key demographics were exceedingly rare. This shows that the AI models lack a fundamental understanding of the epidemiological context of the diseases they are asked to portray.
A Skewed Picture: Uncovering Widespread Bias
Beyond specific inaccuracies, the study revealed systemic biases that appeared across almost all generated images.
Over-representation of White Individuals: There was a massive bias towards depicting patients as White. Across all diseases, White individuals were heavily over-represented compared to pooled real-world patient data (20%). This was most extreme in Adobe Firefly (87% White) and Midjourney (78% White). While Meta Imagine showed the least racial bias (28% White), it was still higher than the real-world average.
Over-representation of Normal Weight Individuals: All four AIs showed a strong bias towards depicting patients as having a normal weight. Across the generators, around 90-96% of individuals were rated as normal weight. This contrasts sharply with real-world data, where only about 63% of the general population falls into this category, and 32% are considered overweight. This under-representation of overweight individuals is a significant distortion of reality.
Subtle Biases in Age and Emotion: The study also uncovered more nuanced biases. In all four AIs, female patients were consistently depicted as being younger than male patients. Additionally, for stigmatized diseases, White individuals were more often depicted as elderly and were more likely to be shown with sad or pained expressions compared to patients of other races.
Why Do These Biases Exist?
The researchers point to two primary culprits for these inaccuracies and biases:
-
Biased Training Data: AI text-to-image models are trained on vast, non-medical datasets scraped from the internet. These datasets are not curated for medical accuracy and likely over-represent White and normal-weight individuals while lacking sufficient images of actual patients to learn disease-specific demographics.
-
Flawed Bias Mitigation: AI companies apply post-training rules to try and counteract known biases in their training data. However, these strategies can be clumsy. For example, the depiction of both sexes for sex-specific diseases might be an over-correction—an attempt to enforce gender diversity that ignores biological reality. While this may have prevented a general over-representation of male patients (a common AI bias not seen here), it introduced a different, more glaring type of inaccuracy.
Moving Forward: A Call for Better, More Equitable AI
Generative AI has immense potential in healthcare, but this study serves as a critical warning. Using these AI-generated images without caution can reinforce harmful medical stereotypes and spread misinformation.
The study's authors recommend several paths forward:
- Transparency is Key: Anyone using AI-generated images in a medical context should disclose it and manually ensure the images chosen reflect real-world demographics.
- Improve the Algorithms: AI developers must work to improve accuracy. This could involve fine-tuning models on carefully curated medical datasets and developing more sophisticated, context-aware bias mitigation strategies.
- Rethink Censorship: The study noted that some platforms block prompts for stigmatized diseases like HIV or substance use disorder. While well-intentioned, the authors argue this censorship may actually perpetuate stigma and prevents the tools from being improved.
Ultimately, for AI-generated patient images to be a responsible tool in healthcare, they must reflect the diverse reality of patients around the world. Until then, users must proceed with caution and a critical eye.