Protein engineering is essential for designing proteins with specific functions, but navigating the complex fitness landscape of protein mutations poses a significant challenge, making it difficult to find optimal sequences. Zero-shot approaches, which predict mutational effects without relying on homologs or multiple sequence alignments (MSAs), reduce some dependencies but fail to predict various protein properties. Learning-based models trained on deep mutational scanning (DMS) or MAVE data have been used to predict fitness landscapes alone or with MSAs or language models. Even so, these data-driven models often struggle when experimental data are sparse.
Microsoft Research ai for Science researchers introduced µFormer, a deep learning framework that integrates a pre-trained protein language model with specialized scoring modules to predict protein mutational effects. µFormer predicts higher-order mutants, models epistatic interactions, and handles insertions. Using reinforcement learning, µFormer efficiently explores vast mutant spaces to design improved protein variants. The model predicted mutants with a 2000-fold increase in bacterial growth rate, driven by enhanced enzymatic activity. µFormer’s success extends to challenging scenarios, including multipoint mutations, and its predictions were validated through laboratory experiments, highlighting its potential for optimizing protein design.
The µFormer model is a deep learning approach designed to predict the fitness of mutated protein sequences. It works in two stages: first, by pre-training a masked protein language model (PLM) on a large dataset of unlabeled protein sequences, and second, by predicting fitness scores using three scoring modules built into the pre-trained model. These modules (residual level, motif level, and sequence level) capture different aspects of the protein sequence and combine their results to generate the final fitness score. The model is trained using known fitness data, which minimizes errors between predicted and actual scores.
Furthermore, µFormer is combined with a reinforcement learning (RL) strategy to explore the vast space of possible mutations efficiently. The protein engineering problem in this framework is modeled as a Markov decision process (MDP), with proximal policy optimization (PPO) used to optimize mutation policies. Dirichlet noise is added during the mutation search process to ensure effective exploration and avoid local optima. Baseline comparisons were performed using models such as ESM-1v and ECNet, and evaluated on datasets such as FLIP and ProteinGym.
µFormer, a hybrid model combining a self-supervised protein language model with supervised scoring modules, efficiently predicts protein fitness scores. Pre-trained on 30 million protein sequences from UniRef50 and fine-tuned with three scoring modules, µFormer outperformed ten methods on the ProteinGym benchmark, achieving a mean Spearman correlation of 0.703. It predicts higher-order mutations and epistasis, with strong correlations for multi-site mutations. In protein optimization, µFormer, coupled with reinforcement learning, designed TEM-1 variants that significantly improved growth, with a double mutant outperforming a known quadruple mutant.
In conclusion, previous studies have demonstrated the potential of sequence-based protein language models in tasks such as enzyme function prediction and antibody design. µFormer, a sequence-based model with three scoring modules, was developed to generalize across diverse protein properties. It achieved state-of-the-art performance in fitness prediction tasks, including complex mutations and epistasis. µFormer also demonstrated its ability to optimize enzyme activity, particularly in predicting TEM-1 variants against cefotaxime. Despite its success, improvements can be made by incorporating structural data, developing phenotype-aware models, and creating models capable of handling longer protein sequences for increased accuracy.
Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and LinkedInJoin our Telegram Channel.
If you like our work, you will love our fact sheet..
Don't forget to join our SubReddit of over 50,000 ml
Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and ai to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of ai and real-life solutions.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>