Messenger RNA (mRNA) plays a crucial role in protein synthesis, translating genetic information into proteins through a process involving sequences of nucleotides called codons. However, current language models used for biological sequences, especially mRNA, fail to capture the hierarchical structure of mRNA codons. This limitation leads to suboptimal performance when predicting properties or generating diverse mRNA sequences. Modeling mRNA is uniquely challenging due to its many-to-one relationship between codons and the amino acids they encode, as multiple codons can encode the same amino acid but vary in their biological properties. This hierarchical structure of synonymous codons is crucial for the functional roles of mRNA, particularly in therapies such as vaccines and gene therapies.
Researchers from Johnson & Johnson and the University of Central Florida propose a new approach to improve mRNA language modeling called Hierarchical Coding for mRNA Language Modeling (HELM). HELM incorporates hierarchical codon relationships into the language model formation process. This is achieved by modulating the loss function based on codon synonymy, which effectively aligns training with the biological reality of the mRNA sequences. Specifically, HELM modulates the magnitude of the error in its loss function depending on whether the errors involve synonymous codons (considered less significant) or codons leading to different amino acids (considered more significant). Researchers evaluate HELM against existing mRNA models on various tasks, including mRNA property prediction and antibody region annotation, and find that it significantly improves performance, demonstrating an average accuracy of about 8% better in comparison with existing models.
The core of HELM lies in its hierarchical coding approach, which integrates the codon structure directly into the training of the language model. This involves the use of hierarchical cross-entropy loss (HXE), where mRNA codons are treated based on their positions in a tree-like hierarchy representing their biological relationships. The hierarchy begins with a root node representing all codons, branching into coding and non-coding codons, with further categorization based on biological functions such as “start” and “stop” signals or specific amino acids. During pre-training, HELM uses masked language modeling (MLM) and causal language modeling (CLM) techniques, improving training by weighting errors in proportion to the position of the codons within this hierarchical structure. This ensures that synonymous codon substitutions are less penalized, fostering a nuanced understanding of relationships at the codon level. Additionally, HELM retains support for common language model architectures and can be seamlessly implemented without major changes to existing training pipelines.
HELM was evaluated on multiple data sets, including antibody-related mRNA and general mRNA sequences. Compared to non-hierarchical language models and state-of-the-art basic RNA models, HELM demonstrated consistent improvements. On average, it outperformed standard pre-training methods by 8% on predictive tasks on six diverse data sets. For example, in annotating antibody mRNA sequences, HELM achieved an accuracy improvement of around 5%, indicating its ability to capture biologically relevant structures better than traditional models. The hierarchical HELM approach also showed stronger clustering of synonymous sequences, indicating that the model captures biological relationships more accurately. Beyond classification, HELM was also evaluated for its generative capabilities, showing that it can generate diverse mRNA sequences more accurately aligned to real data distributions compared to non-hierarchical baselines. Frechet biological distance (FBD) was used to measure how well the generated sequences matched the real biological data, and HELM consistently showed lower FBD scores, indicating closer alignment with the real biological sequences.
The researchers conclude that HELM represents a significant advance in modeling mRNA sequences, particularly in its ability to capture the biological hierarchies inherent to mRNA. By incorporating these relationships directly into the training process, HELM achieves superior results on both predictive and generative tasks, while requiring minimal modifications to standard model architectures. Future work could explore more advanced methods, such as training HELM in hyperbolic space to better capture hierarchical relationships that Euclidean space cannot easily model. Overall, HELM paves the way for better analysis and application of mRNA, with promising implications for areas such as therapeutic development and synthetic biology.
look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet.. Don't forget to join our SubReddit over 55,000ml.
(Trend) LLMWare Introduces Model Depot: An Extensive Collection of Small Language Models (SLM) for Intel PCs
Nikhil is an internal consultant at Marktechpost. He is pursuing an integrated double degree in Materials at the Indian Institute of technology Kharagpur. Nikhil is an ai/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in materials science, he is exploring new advances and creating opportunities to contribute.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>