Biology is a wonderful but delicate tapestry. At the heart is DNA, the protein-coding master weaver responsible for orchestrating the many life-sustaining biological functions within the human body. However, our body is like a tuned instrument, susceptible to losing its harmony. After all, we face an unforgiving and ever-changing natural world: pathogens, viruses, diseases and cancer.
Imagine if we could speed up the process of creating vaccines or drugs for newly emerged pathogens. What if we had gene editing technology capable of automatically producing proteins to rectify DNA errors that cause cancer? The quest to identify proteins that can bind strongly to targets or accelerate chemical reactions is vital for drug development, diagnostics, and numerous industrial applications, but is often a time-consuming and expensive undertaking.
To improve our capabilities in protein engineering, MIT CSAIL researchers created “FrameDiff,” a computational tool to create new protein structures beyond what nature has produced. The machine learning approach generates “frames” that align with the inherent properties of protein structures, allowing it to build new proteins independently of pre-existing designs, facilitating unprecedented protein structures.
“In nature, protein engineering is a slow-burning process that takes millions of years. Our technique aims to provide an answer to address man-made problems that evolve much faster than the pace of nature,” says Jason Yim, MIT CSAIL PhD student and leader. author of a new paper on the work: “The goal, regarding this new ability to generate synthetic protein structures, opens up a myriad of improved capabilities, such as better binders. This means designing proteins that can bind to other molecules more efficient”. and selectively, with widespread implications related to selective drug delivery and biotechnology, where it could result in the development of better biosensors. “It could also have implications for the field of biomedicine and beyond, offering possibilities such as the development of more efficient photosynthetic proteins, the creation of more effective antibodies and engineered nanoparticles for gene therapy.”
FrameDiff
Proteins have complex structures, made up of many atoms connected by chemical bonds. The most important atoms that determine the three-dimensional shape of the protein are called the “backbone”, something like the backbone of the protein. Each triplet of atoms along the backbone shares the same bonding pattern and atom types. The researchers noted that this pattern can be leveraged to build machine learning algorithms using ideas from differential geometry and probability. This is where frames come in: Mathematically, these triplets can be modeled as rigid bodies called “frames” (common in physics) that have a 3D position and rotation.
These frames provide each triplet with enough information to know its spatial environment. The task is then for a machine learning algorithm to learn how to move each frame to build a protein backbone. By learning to build existing proteins, the algorithm is expected to generalize and be able to create new proteins never before seen in nature.
Training a model to build proteins using “diffusion” involves injecting noise that randomly moves all frames and blurs the appearance of the original protein. The algorithm’s job is to move and rotate each frame until it looks like the original protein. Although simple, the development of diffusion in frames requires stochastic calculus techniques on Riemannian manifolds. From a theoretical point of view, researchers developed “SE(3) diffusion” to learn probability distributions that non-trivially connect the translation and rotation components of each frame.
The subtle art of diffusion.
In 2021, DeepMind introduced AlphaFold2, a deep learning algorithm to predict 3D protein structures from their sequences. When creating synthetic proteins, there are two essential steps: generation and prediction. Generation means the creation of new protein structures and sequences, while “prediction” means finding out what the three-dimensional structure of a sequence is. It is no coincidence that AlphaFold2 has also used frameworks to model proteins. SE(3) diffusion and FrameDiff were inspired to take the idea of frames further by incorporating frames into diffusion models, a generative ai technique that has become immensely popular in image generation, like Midjourney, for example.
The shared frameworks and principles between protein structure generation and prediction meant that the best models from both ends were compatible. In collaboration with the Institute for Protein Design at the University of Washington, SE(3) diffusion is already being used to create and experimentally validate new proteins. Specifically, they combined SE(3) diffusion with RosettaFold2, a protein structure prediction tool very similar to AlphaFold2, leading to “RFdiffusion.” This new tool brought protein designers closer to solving crucial problems in biotechnology, including the development of highly specific protein binders for accelerated vaccine design, symmetric protein engineering for gene delivery, and robust motif scaffolding. for precise enzyme design.
FrameDiff’s future efforts involve improving the generality of problems that combine multiple requirements for biologics such as drugs. Another extension is to generalize the models to all biological modalities, including DNA and small molecules. The team posits that by expanding FrameDiff’s training on more substantial data and improving its optimization process, it could generate fundamental structures with design capabilities on par with RFdiffusion, while preserving FrameDiff’s inherent simplicity.
“Discarding a pre-trained structure prediction model (on FrameDiff) opens up possibilities for rapidly generating structures that extend to large lengths,” says Harvard University computational biologist Sergey Ovchinnikov. The researchers’ innovative approach offers a promising step toward overcoming the limitations of current structure prediction models. Although this is still preliminary work, it is an encouraging step in the right direction. “As such, the vision of protein design, which plays a critical role in addressing humanity’s most pressing challenges, appears increasingly achievable, thanks to the pioneering work of this MIT research team.”
Yim wrote the paper alongside Columbia University postdoc Brian Trippe, researcher at the French National Center for Scientific Research in Paris, Center for Data Science, Valentin De Bortoli, University of Cambridge postdoc, Emile Mathieu, and Oxford University statistics professor and senior research scientist at DeepMind, Arnaud Doucet. . MIT professors Regina Barzilay and Tommi Jaakkola advised the research.
The team’s work was supported, in part, by MIT’s Abdul Latif Jameel Clinic for Machine Learning in Health, EPSRC grants, and a Prosperity Partnership between Microsoft Research and the University of Cambridge, the Graduate Research Fellowship Program. the National Science Foundation, the NSF Expeditions grant, and Machine Learning. for the Pharmaceutical Synthesis and Discovery consortium, the DTRA Discovery of Medical Countermeasures against New and Emerging Threats program, the DARPA Accelerated Molecular Discovery program, and the Sanofi Computational Antibody Design grant. This research will be presented at the International Conference on Machine Learning in July.