AI Designs Proteins From Simple Text Prompts
Just as many now use ChatGPT for quick information like summarizing movie plots before a new release, scientists are increasingly harnessing the power of large language models (LLMs). Proteins, the essential molecules that drive cellular functions, possess their own unique language. This language is composed of 20 amino acids, each represented by a letter, which combine in sequences to form functional proteins, with their order dictating both structure and purpose.
Inspired by the success of LLMs, researchers are developing protein language models capable of designing entirely new proteins. While some such tools are available, they often demand significant technical expertise. The exciting question arises: what if any researcher could prompt an AI to design a protein with a simple text instruction?
Introducing Pinal A Text To Protein AI
This vision became more concrete last month when researchers unveiled a new AI, Pinal, that functions much like ChatGPT for protein design. By describing the desired type, structure, or function of a protein, users can receive potential candidates from the algorithm. In a notable demonstration, Pinal successfully generated multiple proteins capable of breaking down alcohol, a function validated within living cells. Interested users can even experiment with Pinal online.
Pinal joins a growing suite of algorithms that can translate plain English into novel proteins. These sophisticated protein designers comprehend both natural language and the principles of structural biology, serving as accessible guides for scientists exploring custom protein creation with minimal technical background.
The international team behind Pinal described it as an "ambitious and general approach" in their preprint on bioRxiv. The AI leverages the "descriptive power and flexibility of natural language" to democratize designer protein creation for biologists. When benchmarked against existing protein design algorithms, Pinal demonstrated a superior understanding of the primary goal for a target protein and increased the probability of its functionality in living cells.
Fajie Yuan, an AI scientist at Westlake University in China who led the team, told Nature, "We are the first to design a functional enzyme using only text. It’s just like science fiction."
Designing Proteins Beyond Natures Blueprint
Proteins are fundamental to life, forming our physical structures, powering metabolic processes, and serving as targets for numerous medications. These complex molecules originate from sequences of amino acid "letters" that bond and fold into intricate three-dimensional shapes. Specific structural elements—like loops, weaves, or pockets—are crucial for their designated functions.
Scientists have long endeavored to engineer proteins with novel capabilities, such as developing enzymes that efficiently degrade plastics. Traditional methods typically involve customizing existing proteins for specific biological, chemical, or medical applications. However, these strategies, as the Pinal authors noted, "are limited by their reliance on existing protein templates and natural evolutionary constraints." In contrast, protein language models can conceptualize a vast array of new proteins, unconstrained by evolutionary history.
Instead of processing text, images, or video files like conventional LLMs, these specialized algorithms learn the language of proteins by training on extensive datasets of protein sequences and structures. For instance, EvolutionaryScale’s ESM3 model was trained on over 2.7 billion protein sequences, structures, and functions. Similar models have already been employed to design antibodies for combating viral infections and to create novel gene editing tools.
Despite their power, many existing algorithms are challenging for non-experts to use. Pinal, however, is designed with the average scientist in mind. The team described it as being like a DSLR camera on auto mode, as the model "bypasses manual structural specifications," simplifying the creation of desired proteins.
How Pinal Translates Your Words Into Proteins
To utilize Pinal, a user prompts the AI to construct a protein using several keywords, phrases, or even an entire paragraph. The AI first parses the specific requirements from the prompt on the front end. Then, on the back end, it converts these instructions into a functional protein design.
This process is somewhat analogous to asking ChatGPT to compose a restaurant review or an essay. However, designing proteins is inherently more complex. Although proteins are also composed of "letters," their final three-dimensional shape is paramount to their function. One method, known as end-to-end training, directly translates a prompt into protein sequences. But this approach exposes the AI to an enormous space of potential sequences, making it difficult to pinpoint accurate, functional protein sequences. Generating and deciphering protein structure—the final 3D shape—is comparatively easier for the algorithm.
Another challenge is sourcing adequate training data. The Pinal team addressed this by leveraging existing protein databases and employing LLMs to label them. This effort resulted in a massive library of 1.7 billion protein-text pairs, where protein structures are matched with textual descriptions of their functions.
The completed Pinal algorithm employs 16 billion parameters—the internal connections within an AI—to translate plain English into the language of biology.
Pinal operates in a two-step process. First, it translates user prompts into structural information. This step deconstructs a protein into manageable structural elements, or "tokens." In the second step, a protein-language model named SaProt takes into account user intent and protein functionality to design protein sequences that are most likely to fold into a working protein meeting the user’s specifications.
Pinal In Action And The Road Ahead
When compared to cutting-edge protein design algorithms that also use text as input, including the prominent ESM3 model, Pinal demonstrated superior performance in both accuracy and novelty—that is, generating proteins not previously known to exist. When using just a few keywords to design a protein, the researchers found that "half of the proteins from Pinal exhibit predictable functions, only around 10 percent of the proteins generated by ESM3 do so."
In a practical test, the team provided Pinal with a concise prompt: "Please design a protein that is an alcohol dehydrogenase." These enzymes are responsible for breaking down alcohol. From over 1,600 candidate proteins generated by Pinal, the team selected the eight most promising ones for testing in living cells. Two of these successfully metabolized alcohol at body temperature, while others showed higher activity at a warmer 158 degrees Fahrenheit.
More detailed prompts, which included information about the protein’s function and examples of similar molecules, led to the generation of candidates for antibiotics and proteins designed to aid cells in recovering from infection.
Pinal is not the sole AI in the text-to-protein space. The startup 310 AI has developed an AI named MP4 to generate proteins from text, with the company suggesting its results could lead to treatments for heart disease.
However, this approach is not without its flaws. Similar to how LLMs can "hallucinate" incorrect information, protein language models can also generate unreliable or repetitive sequences, which diminishes the likelihood of a functional end product. The specific wording of prompts also significantly influences the final protein structure. Nevertheless, the current state of this AI technology is akin to the initial versions of DALL-E: it's a tool to experiment with, and its outputs should be validated using other established methods.