Unlocking Pharma AI The Power of Quality Data

2025-06-13•By Alessio Zoccoli, Dr Carlos Velez, Dr Remco Jan Geukes Foppen, Vincenzo Gioia•9 minutes read

AIinPharma

DataQuality

DrugDevelopment

Industry experts Remco Jan Geukes Foppen, Vincenzo Gioia, Alessio Zoccoli, and Carlos Velez are highlighting the critical need for high-quality data to fully harness the capabilities of multimodal language models (MLMs) in drug development.

AI image for MLM article

The Crucial Role of Data Quality in AI for Healthcare

Regulatory bodies such as the US Food and Drug Administration (FDA) and European Medicines Agency (EMA) are increasingly stressing the importance of data quality for AI applications within the healthcare sector. High-quality, well-structured data and metadata are defined by their accuracy, consistency, and completeness. These characteristics are fundamental for generating reliable insights and trustworthy AI outputs. Effectively addressing data quality issues, which include missing data, inconsistencies, and biases, is essential to realize the full potential of MLMs in drug discovery and to ensure patient safety.

Fueling the Rise of Multimodal Language Models

The success of MLMs can be attributed to two key factors: open access policies and the advent of next-generation sequencing (NGS). Access to open clinical data has provided the vast amounts of raw material necessary for training multimodal models. Concurrently, NGS has generated massive and complex datasets, which are ideally suited for the integration capabilities of generative AI (GenAI). Clinical genomics, by enabling patient stratification and validating therapeutic targets, has underscored the value of a multimodal approach. This approach, which integrates genomic, clinical, and pharmacological data, is pivotal in identifying personalized therapies. These elements have cultivated an ecosystem where MLMs can effectively tackle the challenges inherent in drug development.

How MLMs are Revolutionizing Drug Discovery

MLMs possess the capability to analyze genomic data, images, and scientific literature simultaneously. This allows them to uncover correlations and interactions far more rapidly than traditional manual research methods, which could take years. Furthermore, the automatic generation of molecular structures by MLMs accelerates the design of drug candidates by predicting their affinity to specific targets.

Slashing Costs and Boosting Efficiency with MLMs

Through enhanced efficiency, MLMs can significantly lower drug development costs. This is achieved by the early identification and elimination of drug candidates with low potential, thereby minimizing resource wastage. In preclinical phases, MLMs combine laboratory data with computational models to provide more accurate predictions of safety and efficacy. Real-time analysis of clinical trials enables the quick identification of issues, preventing costly and unnecessary modifications to protocols. Additionally, the automation of complex tasks, such as histological image analysis, reduces costs associated with manual labor and human resources. As the experts note, MLMs can lower drug development costs through increased efficiency.

Enhancing Clinical Trial Success with Advanced AI

MLMs also analyze extensive volumes of data to predict clinical trial outcomes with greater accuracy. This allows pharmaceutical companies to adjust their strategies based on the probability of success (PoS). Targets can be validated with increased confidence, thereby reducing risk. Advanced stratification techniques enable the design of drugs specific to patient subgroups, enhancing efficacy and lowering failure risks, particularly in fields like oncology with targeted therapies. Moreover, GenAI improves clinical trial design by more effectively identifying eligible patients and shortening recruitment times.

Navigating the Data Integration Maze

Managing and integrating data from diverse and heterogeneous sources—such as genomic sequences, clinical data, biological images, and chemical structures—presents a significant challenge, particularly concerning data normalization. Differences in data formats, quality, and granularity make it difficult to create robust and standardized analytical pipelines. Furthermore, interpreting the patterns generated by multimodal models often requires multidisciplinary skills that are not yet widely available or utilized in the industry.

Defining Excellence What Makes Data High Quality for AI

High-quality data is characterized by its accuracy, completeness, consistency, timeliness, and uniqueness. Data quality is the bedrock for generating reliable insights, making accurate predictions, and driving impactful decision-making. It essentially sets the upper limit for achieving accurate, relevant, and coherent results from AI systems. While this challenge is widely recognized, efforts to improve data quality are often still insufficient. For AI systems to produce trustworthy and explainable results, high-quality data and comprehensive metadata are indispensable. Continuous evaluation of data quality metrics is crucial, along with the prioritization of measures to mitigate biases. This ongoing assessment ensures the integrity and reliability of the data used in AI models and the subsequent outputs. With the increasing availability of data repositories and the ensuing data augmentation, maintaining data quality is vital to enhance the ability of AI to understand and generate content across different modalities.

Data quality helps models produce better results and allows data scientists to reduce their effort in debugging informatics pipelines. Some key metrics for assessing data quality include:

Factual and conceptual accuracy: Ensuring that information, facts, and definitions are correct, which improves model trustworthiness and reliability.
Contextual accuracy: Accurate context ensures that information is relevant, self-contained, and applicable to the intended use cases, enabling better generation for the models.
Consistency: Data must be consistent both internally and with respect to its sources. Consistent data improves the ability of models to learn, extract patterns, and mitigate biases. Consistency across different data sources is also important; while heterogeneous sources are acceptable, it is crucial to provide the model with a way to understand information cohesively.

Further exploration into GenAI in pharma and its current standing can provide additional context.

Overcoming Common Hurdles in Data Quality Management

Several common challenges impede data quality:

Missing data: This is a significant obstacle to accuracy as it omits valuable context and information. This issue also relates to data discontinuity, where there are pockets of very rich data alongside areas of sparse data.
Inconsistencies: Data should be formatted and labeled in a well-defined manner, as ambiguity can mislead both humans and AI models. This aspect is also crucial for generating interpretable results.
Duplicate data: Redundancy in data can introduce noise and reinforce biases, leading to poorly performing models.
Data traceability and immutability: Meticulously documenting metadata related to data sources, quality, and context is essential to provide AI applications with the necessary contextual information during data processing.

Ensuring Trust and Compliance in AI Driven Drug Development

Beyond technical hurdles, there are also significant legal, ethical, and financial challenges. In the life sciences domain, where decision-making critically impacts people’s lives, risks cannot be overlooked. Misguided AI predictions raise serious concerns about accountability and liability. In this sense, data quality is not merely a technical challenge but a legal one as well. Notably, the FDA issued draft guidance this year emphasizing the importance of data quality, defining “fit-for-use” with metrics like relevance and reliability. This FDA guidance on AI in drug development stresses that data quality is crucial for reliable AI-driven results. Variability in data quality, size, and representation can introduce bias and undermine confidence in AI model outputs.

The guidance urges the use of relevant and reliable datasets for training, evaluating, and maintaining AI models. A key component is a risk-based credibility assessment framework, which includes defining the question of interest, the ‘context of use’, and assessing model risk. This assessment involves evaluating the data used in model development and ensuring its adequacy. Furthermore, the guidance highlights the need for lifecycle maintenance, including continuous monitoring of model output and accuracy, to address potential data drift and ensure consistent performance over time.

In the EU, this focus is reinforced by the EU AI Act, which regulates accountability for AI deployment. It mandates that organizations meticulously document the methodologies used to generate AI outputs, clearly articulate the intended purpose of the AI system, and obtain explicit consent for all data used. Furthermore, all data inputs must be subject to rigorous consent procedures and undergo thorough filtering processes.

High-quality multimodal data, such as well-labeled medical images, genomic sequences, and textual annotations, enable these models to identify patterns, generate insights, and assist in diagnostics with greater accuracy. Conversely, poor data quality – characterized by errors, perpetuating biases, or inconsistencies – can lead to erratic decision-making, flawed diagnostics, and the propagation of misinformation. This can have problematic implications in sensitive areas like healthcare and drug discovery. Indeed, in drug development, the potential impact of multimodal analysis is immense.

Charting the Future Towards Integrated AI in Pharma

While AI has long been utilized in drug discovery, its application has largely been confined to ‘point solutions,’ addressing isolated inefficiencies or obstacles. The integration of AI, particularly MLMs, with vast genomic and clinical datasets—enabled by advancements in next-generation sequencing and open data policies—is paving the way for integrated, end-to-end AI. Because the drug development pipeline inherently relies on integrating diverse data sources – from molecular structures to clinical outcomes – MLMs offer a powerful data-driven framework for uncovering meaningful insights. The ultimate aim is to increase the probability of success (PoS) and expedite the development of novel therapies. By revealing intricate biological relationships that unimodal approaches might miss, MLMs provide a richer understanding of complex biological systems. Considering whether AI in pharmaceutical development is hype or panacea helps frame this evolution.

If AI in drug development is to transition from fragmented point solutions to integrated, end-to-end approaches, several challenges must be addressed. These include managing data complexity, ensuring data quality, and addressing ethical considerations related to data privacy and consent. The potential impact of multimodal analysis in this field is immense. This shift fosters a systems-thinking approach, reducing redundancies and optimizing strategies based on probability of success. It enables management to make data-driven decisions regarding resource allocation, project prioritization, and go/no-go decisions. The refined PoS assessment aims to reduce risk and accelerate the development of more effective and targeted therapies.

The Tangible Benefits of MLMs in Pharmaceuticals

The adoption of MLMs has significantly impacted the pharmaceutical industry by enhancing operational efficiency and potentially improving treatment quality for patients. Benefits include quicker access to pharmacological treatments due to faster drug development and marketing processes, as well as reduced treatment costs stemming from overall development efficiencies. Moreover, GenAI’s ability to learn from new data fosters continuous innovation and opens up new therapeutic opportunities.

Read Original Post