Back to all posts

Vast Genetic Data Fuels Quest For Biology AI

2025-06-18James Dinneen5 minutes read
AI
Biotechnology
Genomics

Basecamp researchers gathering genetic data in Malta Basecamp researchers gathering genetic data in Malta. Credit: Greg Funnell

The Quest for a Biological "ChatGPT"

A UK-based biotech company, Basecamp Research, has dedicated several years to an ambitious project: amassing vast quantities of genetic data from microorganisms thriving in some of the planet's most extreme environments. Their efforts have reportedly unveiled over a million previously unknown microbial species and an astounding ten billion genes new to scientific records. Basecamp Research believes this colossal database of Earth's biodiversity holds the key to training what they term a "ChatGPT of biology," an AI capable of answering profound questions about life itself. However, the success of this endeavor remains uncertain.

Is More Data Always Better Data?

Jörg Overmann, from the Leibniz Institute DSMZ in Germany—home to one of the world's most diverse microbial culture collections—acknowledges the value in expanding known genetic sequences. However, he expresses caution, suggesting that simply increasing the volume of data may not automatically lead to practical breakthroughs in areas like drug discovery or chemistry. Overmann emphasizes the need for more contextual information about the organisms from which these sequences originate. "I’m not convinced that in the end the understanding of really novel functions will be accelerated by this brute-force increase in the sequence space," he states.

AI in Biology: Current Landscape and Challenges

In recent times, researchers have made significant strides in developing machine learning models designed to discern patterns and predict relationships within massive biological datasets. The most prominent example is AlphaFold, an AI developed by Google DeepMind. AlphaFold can predict the three-dimensional structure of a protein solely from its genetic data, an achievement that garnered its creators the 2024 Nobel Prize in Chemistry.

According to Frances Ding from the University of California, Berkeley, while these "generative biology" models have increased in complexity, their performance hasn't necessarily seen a corresponding improvement. A potential reason for this plateau is the lack of biodiverse data. Ding points out, “Current models in biology are trained on datasets that disproportionately represent well-studied species (e.g., E. coli, mice, humans), and these models are worse at predicting properties about sequences from other parts of the tree of life.”

Bridging the Biodiversity Gap: Basecamp's Strategy

Basecamp Research aims to directly address this biodiversity deficit. As detailed in a company report, their expanding database now includes samples from over 120 locations across 26 countries. Jonathan Finn, the company’s Chief Science Officer, explains that their collection efforts have concentrated on extreme environments previously under-sampled. These range from the icy waters beneath Arctic sea ice to hot springs found in jungles. "Most of the samples that we’ve been going after are prokaryotic samples: bacteria, microbes and their viruses," Finn notes, adding, "I know we’ve got some fungi in there."

Genetic analysis of these unique samples has highlighted variations in genes that are almost universally shared across the entirety of life. Based on these findings, Basecamp estimates their data encompasses information from over 1 million species not present in existing public genomic datasets typically used for training AI biology models. Collectively, this new data is said to contain approximately 9.8 billion newly identified genes. This represents a tenfold expansion in the total number of known genes, with each new gene potentially encoding a useful protein, according to the researchers.

"By showing these models a large piece of nature, they should have a better understanding of how biology works," Finn elaborates. "We’re trying to build a ChatGPT of biology."

The Vastness of Uncharted Biological Territory

Estimates suggest that Earth could be home to as many as a trillion microbial species, the vast majority of which remain poorly characterized. Given this immense unknown, it is not entirely surprising that Basecamp has identified such a significant amount of new biological information. Leopold Parts from the Wellcome Sanger Institute in the UK comments, “It’s almost inevitable that if you explore more you get more different gene variants.”

Despite the expected nature of these discoveries, Basecamp is betting that this novel material holds significant value, and they are not alone in this optimism. "This is one of the most exciting things I’ve seen in a long time," says Nathan Frey, a machine learning researcher at US biotech firm Genentech. Frey notes that, generally, research on AI models for biology has prioritized algorithmic improvements or laboratory-based data generation over fieldwork and sample collection from diverse natural environments.

From Data to Discovery: The Unanswered Questions

Nevertheless, skepticism persists regarding whether this new database will indeed lead to the radically improved AI models Basecamp envisions. A key uncertainty is the extent to which this newly discovered diversity of proteins translates into valuable new functions—such as enzymes capable of degrading plastics or proteins suitable for repurposing in gene editing. "They have to show that this novelty is useful in some way," Parts emphasizes.

Furthermore, Overmann raises another critical point: if these new genes are truly substantially different from those currently known, it is unclear how existing computational tools could predict their functions, or how such data could effectively train new AI models. "You don’t have any clue what the majority of the genes do," he cautions. While Basecamp may have compiled a treasure trove of new biological information, without substantial traditional laboratory work to understand its contents, this data might remain enigmatic, even for the most advanced AI.

Read Original Post
ImaginePro newsletter

Subscribe to our newsletter!

Subscribe to our newsletter to get the latest news and designs.