Is it possible to build machine learning models without machine learning experience?
Jim Collins, Termeer Professor of Engineering and Medical Sciences in the Department of Biological Engineering at MIT and life sciences faculty leader at the Abdul Latif Jameel Clinic for Machine Learning in Health (Jameel Clinic), along with several colleagues decided to address this problem when we face a similar enigma. An open access paper on the proposed solution, called BioAutoMated, was published. published on June 21 in Cellular systems.
Recruiting machine learning researchers can be a time-consuming and financially costly process for science and engineering labs. Even with a machine learning expert, selecting the appropriate model, formatting the data set for the model, and then tuning it can dramatically change the performance of the model and requires a lot of work.
“In your machine learning project, how much time will you typically spend on data preparation and transformation?” asks a Google 2022 course on the fundamentals of machine learning (ML). The two options offered are “Less than half the project time” or “More than half the project time.” If you guessed the latter, you’d be right; Google claims that it takes more than 80 percent of the project time to format the data, and that doesn’t even take into account the time needed to frame the problem in machine learning terms.
“It would take many weeks of effort to figure out the appropriate model for our data set, and this is a really prohibitive step for many people who want to use machine learning or biology,” says Jacqueline Valeri, a fifth-year PhD student. of biological engineering in the laboratory of Collins, who is the first co-author of the paper.
BioAutoMated is an automated machine learning system that can select and build an appropriate model for a given data set and even take care of the laborious task of data preprocessing, reducing a months-long process to just a few hours. Automated machine learning (AutoML) systems are still in a relatively nascent stage of development, and their current use focuses primarily on image and text recognition, but they are largely unused in subfields of biology, notes the first co-author and Jameel Clinic postdoc Luis Soenksen PhD. ‘twenty.
“The fundamental language of biology is based on sequences,” explains Soenksen, who earned his PhD from MIT’s Department of Mechanical Engineering. “Biological sequences such as DNA, RNA, proteins and glycans have the surprising informational property of being intrinsically standardized, like an alphabet. “Many AutoML tools are developed for text, so it made sense to extend them to (biological) sequences.”
Additionally, most AutoML tools can only explore and create small types of models. “But you can’t really know from the beginning of a project which model will be best for your data set,” says Valeri. “By incorporating multiple tools under a single tool, we truly enable a much larger search space than any single AutoML tool could achieve on its own.”
BioAutoMATED’s repertoire of supervised ML models includes three types: binary classification models (which divide data into two classes), multiclass classification models (which divide data into multiple classes), and regression models (which adjust values continuous numerical or measure the strength of key relationships between variables). BioAutoMated can even help determine how much data is required to properly train your chosen model.
“Our tool explores models that are better suited to smaller, more sparse biological data sets, as well as more complex neural networks,” says Valeri. This is an advantage for research groups with new data that may or may not be suitable for a machine learning problem. .
“Performing novel and successful experiments at the intersection of biology and machine learning can cost a lot of money,” explains Soenksen. “Currently, biology-focused labs need to invest in significant digital infrastructure and human resources trained in ai-ML before they can even see if their ideas are ready to pay off. We want to reduce these barriers for biology experts.” BioAutoMated, researchers are free to conduct initial experiments to evaluate whether it is worth hiring a machine learning expert to build a different model for further experiments.
The open source code It is publicly available and, the researchers emphasize, easy to execute. “What we would love to see is people taking our code, improving it, and collaborating with larger communities to make it a tool for everyone,” Soenksen says. “We want to prepare the biological research community and raise awareness related to AutoML techniques, as a really useful avenue that could merge rigorous biological practice with accelerated ai-ML practice better than what is achieved today.”
Collins, the lead author of the paper, is also affiliated with the MIT Institute of Medical Engineering and Sciences, the Harvard-MIT Program in Health Sciences and technology, the Broad Institute of MIT and Harvard, and the Wyss Institute. Other MIT contributors to the article include Katherine M. Collins ’21; Nicolaas M. Angenent-Mari PhD ’21; Felix Wong, former postdoc in the Department of Biological Engineering at IMES and the Broad Institute; and Timothy K. Lu, professor of biological engineering and of electrical and computer engineering.
This work was supported, in part, by a grant from the Defense Threat Reduction Agency, the Defense Advanced Research Projects Agency SD2 program, the Paul G. Allen Frontiers Group, the Wyss Institute for Inspirational Engineering Harvard University Biology; an MIT-Takeda Scholarship, a Siebel Foundation Scholarship, a CONACyT Scholarship, an MIT-TATA Center Scholarship, a Johnson & Johnson Undergraduate Research Scholarship, a Barry Goldwater Scholarship, a Marshall Scholarship, Cambridge Trust, and the National Institute of Allergies and Infection Diseases of the National Institutes of Health. This work is part of the Antibiotics-ai Project, which is supported by the Audacious Project, Flu Lab, LLC, Sea Grape Foundation, Rosamund Zander and Hansjorg Wyss for the Wyss Foundation, and an anonymous donor.