Big language models have been able to dive into almost every domain. From natural language processing and natural language understanding to computer vision, these models have incredible capabilities to provide solutions in all fields of artificial intelligence. Developments in artificial intelligence and machine learning have shown how these language models can also be used to predict protein structure and functionality. Protein language models (PLMs), which are pretrained on large-scale protein sequence datasets, have demonstrated capabilities to improve the prediction of protein function and structure.
Proteins, which are essential for biological growth and cell repair and regeneration, also have important applications in drug discovery and healthcare. Currently, existing PLMs only learn protein representations while recording coevolutionary information based on protein sequences and do not include protein functions or other crucial features such as subcellular locations. These models lack explicit acquisition of protein functionalities.
Textual property descriptions are available for several proteins that provide information about their important functions and properties. To take this further, a team of researchers presented ProtST, a framework for improving pre-training and understanding of protein sequences using biomedical texts. The team has also developed a data set called ProtDescribe, which combines protein sequences with text descriptions of their functions and other properties. The ProtST framework based on the ProtDescribe dataset aims to preserve the rendering power of conventional PLMs to capture coevolutionary information during the pretraining process.
Three separate jobs were created to add protein property data of various granularities to a PLM during the pre-training phase while maintaining the initial rendering power of the model. The first is unimodal mask prediction, the goal of which is to preserve the ability of PLM to record coevolutionary information by modeling masked proteins. The model is trained to anticipate masked parts based on the surrounding context by masking certain regions in protein sequences, ensuring that the PLM retains its rendering ability despite adding more property data.
The second is multimodal representation alignment, in which protein sequences and their related text representations are aligned. Structured text representations of protein property descriptors are extracted using a biological language model and by following the alignment of protein sequences with these text representations, PLM can record the semantic relationship between the sequences and their descriptions. textual.
In the third task, ie multimodal mask prediction, fine-grained dependencies between residues in protein sequences and words in protein property descriptions are defined. To create multimodal representations of both residues and words, a fusion module is used to predict masked residues and words, and by doing so, the PLM can record the complex connections between protein sequences and textual descriptions of their properties.
Upon evaluation, the team found that to perform better on various representation learning benchmarks, supervised learning in ProtST makes use of enriched protein representations. On these many representational learning challenges, ProtST-induced PLM outperforms previous models. ProtST has shown good performance in categorizing zero-firing proteins in the zero-firing environment, as a result, even for classes that were not present during training, the trained model was able to classify proteins into various functional categories. ProtST also allows the retrieval of functional proteins from a sizable database without the need for function annotation.
In conclusion, this framework that improves pretraining and understanding of protein sequences with biomedical texts looks promising and a welcome addition to advances in AI.
review the Paper and github link. Don’t forget to join our 26k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out over 800 AI tools at AI Tools Club
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.