<img src="https://news.mit.edu/sites/default/files/styles/news_article__cover_image__original/public/images/202501/MIT-Antibodies-ai-01_0.jpg?itok=Y9SPR7l6″ />
By adapting artificial intelligence models known as big language models, researchers have made great strides in their ability to predict the structure of a protein from its sequence. However, this approach has not been as successful with antibodies, in part due to the hypervariability observed in this type of protein.
To overcome that limitation, MIT researchers have developed a computational technique that allows large language models to predict antibody structures more accurately. Their work could allow researchers to screen millions of potential antibodies to identify those that could be used to treat SARS-CoV-2 and other infectious diseases.
“Our method allows us to scale, where others don't, to the point that we can find some needles in the haystack,” says Bonnie Berger, the Simons Professor of Mathematics, head of the Computer Science and Biology group at the Computer Science Institute. MIT. Science and artificial intelligence Laboratory (CSAIL) and one of the lead authors of the new study. “If we could help prevent pharmaceutical companies from participating in clinical trials with the wrong product, we would really save a lot of money.”
The technique, which focuses on modeling the hypervariable regions of antibodies, also has the potential to analyze entire antibody repertoires from individual people. This could be useful for studying the immune response of people who respond very well to diseases such as HIV, to help discover why their antibodies defend the virus so effectively.
Bryan Bryson, an associate professor of biological engineering at MIT and a member of the Ragon Institute of MGH, MIT, and Harvard, is also senior author of the paper, which appears this week in the Proceedings of the National Academy of Sciences. Rohit Singh, a former CSAIL research scientist who is now an assistant professor of biostatistics, bioinformatics, and cell biology at Duke University, and Chiho Im '22 are senior authors of the paper. Researchers from Sanofi and eth Zurich also contributed to the research.
Hypervariability modeling
Proteins are made up of long chains of amino acids, which can fold into an enormous number of possible structures. In recent years, predicting these structures has become much easier thanks to the use of artificial intelligence programs such as AlphaFold. Many of these programs, such as ESMFold and OmegaFold, are based on large language models, which were originally developed to analyze large amounts of text, allowing them to learn to predict the next word in a sequence. This same approach can work for protein sequences, by learning which protein structures are most likely to form from different amino acid patterns.
However, this technique does not always work with antibodies, especially in a segment of the antibody known as the hypervariable region. Antibodies typically have a Y-shaped structure, and these hypervariable regions are located at the tips of the Y, where they detect and bind to foreign proteins, also known as antigens. The bottom of the Y provides structural support and helps antibodies interact with immune cells.
Hypervariable regions vary in length but typically contain less than 40 amino acids. It has been estimated that the human immune system can produce up to 1 trillion different antibodies by changing the sequence of these amino acids, helping to ensure that the body can respond to a huge variety of potential antigens. Those sequences are not evolutionarily constrained in the same way as other protein sequences, so it is difficult for large language models to learn to predict their structures accurately.
“Part of the reason language models can predict protein structure well is that evolution constrains these sequences so the model can figure out what those constraints would have meant,” Singh says. “It's similar to learning grammar rules by looking at the context of words in a sentence, allowing you to figure out what it means.”
To model those hypervariable regions, the researchers created two modules that are based on existing protein language models. One of these modules was trained on hypervariable sequences from approximately 3,000 antibody structures found in the Protein Data Bank (PDB), allowing it to learn which sequences tend to generate similar structures. The other module was trained with data that correlates about 3,700 antibody sequences with the strength with which they bind to three different antigens.
The resulting computational model, known as AbMap, can predict antibody structures and binding strength based on their amino acid sequences. To demonstrate the usefulness of this model, the researchers used it to predict antibody structures that would strongly neutralize the spike protein of the SARS-CoV-2 virus.
The researchers started with a set of antibodies that had been predicted to bind to this target and then generated millions of variants by changing the hypervariable regions. Their model was able to identify antibody structures that would be more successful, with much more precision than traditional protein structure models based on large language models.
The researchers then took the additional step of grouping the antibodies into groups that had similar structures. They chose antibodies from each of these groups to test experimentally, working with researchers at Sanofi. Those experiments found that 82 percent of these antibodies had better binding strength than the original antibodies that were included in the model.
Identifying a variety of good candidates early in the development process could help pharmaceutical companies avoid spending a lot of money testing candidates that end up failing later, researchers say.
“They don't want to put all their eggs in one basket,” Singh says. “They don't want to say: I'm going to take this antibody and put it through preclinical trials, and then it's going to turn out to be toxic. “They would rather have a set of good possibilities and take advantage of them all, so they have some options if one goes wrong.”
Comparing antibodies
Using this technique, researchers could also try to answer some long-standing questions about why different people respond differently to infection. For example, why do some people develop much more severe forms of Covid and why do some people exposed to HIV never become infected?
Scientists have been trying to answer those questions by sequencing single-cell RNA from immune cells from individuals and comparing them, a process known as antibody repertoire analysis. Previous work has shown that the antibody repertoires of two different people can overlap by as little as 10 percent. However, sequencing does not provide as complete a picture of antibody performance as structural information, because two antibodies that have different sequences may have similar structures and functions.
The new model can help solve that problem by rapidly generating structures for all the antibodies found in an individual. In this study, the researchers showed that when structure is taken into account, there is much more overlap between individuals than the 10 percent observed in sequence comparisons. They now plan to further investigate how these structures may contribute to the body's overall immune response against a particular pathogen.
“This is where a language model fits in very well because it has the scalability of sequence-based analysis, but approaches the precision of structure-based analysis,” Singh says.
The research was funded by Sanofi and the Abdul Latif Jameel Clinic for Machine Learning in Health.