In organic synthesis, molecules are built through organic processes, making it an important branch of synthetic chemistry. One of the most important jobs in computer-aided organic synthesis is retrosynthesis analysis1, proposing probable reaction precursors given a desired result. Finding the best possible reaction routes from a large set of possibilities requires accurate predictions of reactants. Microsoft researchers refer to substrates that provide atoms for a product molecule as “reactants” in the context of this article. They did not count as reactants in the paper solvents or catalysts that facilitate a reaction but do not themselves contribute any atoms to the final product. Recently, machine learning-based methods have shown considerable promise in tackling this problem. Token-by-token autoregressive generation of the output sequence is a common feature of many of these approaches, and many of them use encoder-decoder frameworks in which the encoder component encodes the molecular sequence or graph as high-dimensional vectors and the decoder component decodes the encoder’s output.
The process of retrosynthesis analysis was conceptualized as a translation from one language to another, in this case, from the result to the reactants. Using Bayesian-like probability, a Molecular Transformer was used to predict retrosynthetic routes using exploratory methodologies. The usage of well-developed deep neural networks in natural language processing is made possible by recasting retrosynthesis analysis as a machine translation problem.
Token-by-token autoregression is used to build SMILES output strings in the decoding stage; in conventional ways, elementary tokens in SMILES strings typically refer to single atoms or molecules. This is not immediately intuitive or explicable for chemists engaged in synthesis design or retrosynthesis analysis. When faced with a real-world route scouting challenge, most synthetic chemists rely on their years of training and experience to develop a reaction pathway by combining their knowledge of existing reaction pathways with an abstract grasp of the underlying mechanics gleaned from basic principles. Humans commonly perform retrosynthesis analysis, which begins with molecular fragments or substructures chemically similar to or maintained in target molecules. These fragments or substructures are pieces of a puzzle that, if put together correctly, could lead to the final product through a series of chemical processes.
Researchers suggest using typically maintained substructures in organic synthesis without resorting to expert systems or template libraries. These substructures are retrieved from vast sets of known reactions and capture minute commonalities between reactants and products. In this sense, they may frame the retrosynthesis analysis as a sequence-to-sequence learning problem at the substructure level.
Modeling of extracted substructures
Molecular fragments or smaller building pieces chemically comparable to or retained within target molecules are called “substructures” in organic chemistry. These substructures are crucial for analyzing retrosynthesis because they help illuminate how complex molecules are assembled.
Using this idea as inspiration, the framework has three primary parts:
If one provides a product molecule, this module will find other reactions that produce a similar product. It employs a cross-lingual memory retriever that can be trained to arrange reactants and products in high-dimensional vector space properly.
Researchers use molecular fingerprinting to isolate the shared substructures between the product molecule and the best cross-aligned possibilities. These substructures provide the fragment-to-fragment mapping between substrates and products at the reaction level.
Intersequence coupling at the level of substructure In the learning process, researchers take the initial series of tokens and transform it into a sequence of substructures. Substructure SMILES strings are first in the new input sequence, followed by SMILES strings of additional fragments labeled with virtual numbers. Virtually numbered pieces are the output sequences. Bond forming and linking sites are denoted by their corresponding virtual numerals.
Compared to other methods that have been tried and evaluated, the approach has the same or higher top-one accuracy practically everywhere. Model performance is significantly enhanced on the data subset from which substructures were successfully recovered.
Eighty-two percent of the goods in the USPTO test dataset were successfully extracted substructures using the method, proving its generalizability.
To reduce the length of the string representations of molecules and the number of atoms that needed to be predicted, we only needed to produce pieces related to virtually tagged particles in the substructures.
In conclusion, Microsoft researchers devised a means of deriving universally conserved substructures for use in retrosynthesis predictions. Without any help from humans, they can extract the underlying structures. The method as a whole is very akin to the way human scientists conduct retrosynthesis analysis. When compared to previously published models, the current implementation is an improvement. They also show that enhancing the underlying substructure extraction procedure can help the model perform better in retrosynthesis prediction. The goal is to pique readers’ curiosity about the exciting, multidisciplinary field of retrosynthesis prediction and associated research.
Check out the Microsoft Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.