New AI research proposes a simple but effective structure-based encoder for learning the representation of proteins according to their 3D structures

Proteins, the energy of the cell, are involved in various applications, including materials and treatments. They are made up of a chain of amino acids that folds into a certain shape. A significant number of new protein sequences have recently been found due to the development of low-cost sequencing technology. Accurate and effective in silico protein function annotation methods are required to bridge the current sequence function gap, as functional annotation of a new protein sequence is still expensive and time-consuming.

Many data-driven approaches rely on learning representations of protein structures because many functions of proteins are controlled by the way they fold. These representations can be applied to tasks such as protein design, structure classification, model quality assessment, and function prediction.

The number of published protein structures is much lower than the number of data sets in other machine learning application fields due to the difficulty of experimental identification of protein structures. For example, the Protein Data Bank has 182K experimentally confirmed structures, compared to 47M protein sequences in Pfam and 10M annotated images in ImageNet. Several studies have used the abundance of unlabeled protein sequence data to develop a suitable representation of existing proteins to bridge this representation gap. Many researchers have used self-supervised learning to pretrain protein coders on millions of sequences.

Create high-quality training data sets with Kili technology and solve NLP machine learning challenges to develop powerful machine learning applications

Recent developments in accurate deep learning-based protein structure prediction techniques have made it feasible to effectively and reliably predict the structures of many protein sequences. However, these techniques do not specifically capture or use information about protein structure that is known to determine how proteins function. Many structure-based protein coders have been proposed to make better use of structural information. Unfortunately, edge interactions, which are crucial for simulating protein structure, have not yet been explicitly addressed in these models. Furthermore, due to the paucity of experimentally established protein structures, until recently relatively little work has been done to create pretraining techniques that take advantage of unlabeled 3D structures.

Taking inspiration from this breakthrough, they created a protein encoder that can be applied to a variety of property prediction applications and is pretrained on the most feasible protein structures. They suggest a simple but efficient structure-based encoder called GeomEtry-aware Relational Graph Neural Network, which performs relational message passing on protein residue graphs after encoding spatial information by including various structural or sequential edges. They suggest a sparse edge message passing technique to improve the protein structure coding, which is the first effort to implement edge level message passing in GNN for protein structure coding. His idea was inspired by the attention triangle design on the Evoformer.

They also provide a geometric pretraining approach based on the well-known contrastive learning framework to learn the encoder of protein structure. They suggest novel augmentation functions that enhance the similarity between acquired representations of substructures of the same protein while decreasing those of different proteins to find coexisting physiologically linked protein substructures in proteins. At the same time, they suggest a set of simple baselines based on self-prediction.

They established a solid foundation for pretraining protein structure representations by comparing their pretraining methods with various downstream property prediction tasks. These pre-training problems include the masked prediction of various geometric or physicochemical properties, such as residue types, Euclidean distances, and dihedral angles. Extensive tests using a variety of benchmarks, such as Enzyme Commission number prediction, Gene Ontology term prediction, fold classification, and reaction classification, show that GearNet enhanced with edge message passing can consistently outperform existing protein coders on most tasks in a supervised environment.

Also, using the suggested pretraining strategy, your model trained on less than a million samples performs as well or even better than more advanced sequence-based encoders pretrained on data sets of a million or a billion. The codebase is publicly available on Github. It is written in PyTorch and Torch Drug.

review the Paper and GitHub link. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 26k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.