The study of evolution by natural selection at the molecular level has advanced significantly with the arrival of genomic technologies. Traditionally, researchers have focused on observable traits such as flowering time or growth. However, gene expression provides an intermediate phenotype that connects genomic data to these macroscopic traits, offering a deeper understanding of selection pressures. In a recent study with Ivyleaf Morning Glory (*Ipomoea hederacea*), researchers used RNA sequencing to analyze gene expression under natural field conditions. The challenge of dealing with high-dimensional and small sample size data, typical of transcriptomics, was addressed using machine learning methods. These methods, known for their ability to handle complex, multivariate data, revealed that genes related to photosynthesis, stress response, and light response were crucial in predicting fitness. This demonstrates the potential of ML models to uncover important biological processes and genes under selection in natural environments, overcoming the limitations of traditional statistical approaches.
Furthermore, intricate patterns of codon usage, which vary significantly between and within species, are influenced by evolutionary selection. One study explored whether ai could predict codon sequences from given amino acid sequences in different organisms, including yeast and bacteria. The researchers used advanced ai models, specifically the mBART transformer-based architecture, to capture complex dependencies in codon usage that simple frequency-based methods fail to detect. Their findings indicate that ai can effectively learn and predict these codon patterns, particularly in highly expressed genes and longer proteins. This suggests that codon choice is influenced by evolutionary pressures related to protein expression and folding. This approach improves our understanding of codon bias and its impact on protein synthesis and provides a new tool to optimize codon usage in biotechnology and synthetic biology applications.
Method Summary:
The study used NCBI coding sequences from S. cerevisiae, S. pombe, E. coli and B. subtilis, divided into training, validation and test sets. CD-HIT clustered amino acid sequences, ensuring that clusters remained within individual sets. BLAST identified similar sequences and expression levels of categorized proteins. Codon prediction models included frequency-based methods and mBART models with different configurations. The training protocol included pre-training and tuning with specific hyperparameters. Fixed-size windows were applied during inference and predictions were averaged across windows: precision and perplexity metrics evaluated model performance against true codon sequences.
Training and Evaluation of mBART Models:
mBART models were trained to predict codon sequences from amino acid sequences using masking and mimicry. Masking involved predicting codons from the amino acid sequence alone while mimicking predicted codons based on those of an orthologous protein from a different organism. The mimicry approach is based on the hypothesis that codons can influence the rate of translation elongation, which is critical for cotranslational folding of proteins. The training data sets consisted of proteins from S. cerevisiae, S. pombe, E. coli, and B. subtilis, divided into training, validation, and test sets with no amino acid sequence overlap between the training and test sets. Evaluation of the models showed that mBART models generally outperformed frequency-based baselines, especially in predicting codons for proteins with higher expression levels. This suggests that mBART can learn and use long-range interactions between codons more effectively.
Accuracy of masking and imitation predictions:
Masking mode predictions from mBART models showed superior accuracy compared to frequency-based methods, demonstrating the ability to capture complex patterns in codon usage. Different window sizes were tested, with the 30-codon window model having the best performance. Although predictions in imitation mode were slightly more accurate than predictions in masking mode, they still showed potential, especially in eukaryotic organisms and for highly conserved orthologous segments. The performance of the mBART models did not benefit significantly from sequence similarities between the training and test sets, indicating robust learning of codon usage patterns. Furthermore, the accuracy of the models varied among proteins with different expression levels and molecular functions, with notable improvements for proteins involved in ribosomal functions, nucleic acid binding, and catalytic activities in S. cerevisiae and E. coli.
Methods:
Tissue was collected from Ipomoea hederacea, an annual vine distributed throughout the eastern United States. A field experiment involved planting 100 individuals from 56 populations in a greenhouse and transplanting them to a field. A year later, soil samples were analyzed for heavy metals. Leaf tissue was collected after 71 days and mRNA was extracted and sequenced. Data processing included aligning reads to the Ipomoea nil genome, transforming gene counts, and filtering out low expression genes. Analytical methods involved principal component regression and supervised modeling using neural networks and gradient tree boosting. Significant genes were identified and GO term enrichment analysis was performed using Blast2Go and goseq.
Insights into ai-powered codon prediction and gene expression analysis:
Advanced ai models, such as mBART, have been leveraged to predict codon usage in various organisms and analyze the impact of gene expression on fitness. These models highlight significant correlations between codon usage and protein expression, evolutionary conservation, and functional attributes. Highly expressed genes and conserved proteins exhibit more predictable codon patterns. Additionally, machine learning approaches effectively identify gene expression patterns related to fitness, particularly in genes associated with stress response and reproductive development. This underscores the usefulness of ai in decoding complex biological sequences and improving our understanding of evolutionary biology and gene regulation.
Sources:
Sana Hassan, a consulting intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and artificial intelligence to address real-world challenges. With a strong interest in solving practical problems, she brings a new perspective to the intersection of ai and real-life solutions.