One strategy for cellular reprogramming involves using specific genetic interventions to engineer a cell into a new state. The technique holds great promise in immunotherapy, for example, where researchers could reprogram a patient’s T cells to be more potent against cancer. One day, the approach could also help identify life-saving cancer treatments or regenerative therapies that repair organs ravaged by disease.
But the human body has about 20,000 genes, and a genetic disturbance could be due to a combination of genes or any of the more than 1,000 transcription factors that regulate genes. Because the search space is vast and genetic experiments are expensive, scientists often struggle to find the ideal perturbation for their particular application.
Researchers at MIT and Harvard University developed a new computational approach that can efficiently identify optimal genetic perturbations based on a much smaller number of experiments than traditional methods.
Their algorithmic technique takes advantage of the cause-effect relationship between factors in a complex system, such as genome regulation, to prioritize the best intervention in each round of sequential experiments.
The researchers conducted a rigorous theoretical analysis to determine that their technique did, in fact, identify optimal interventions. With that theoretical framework established, they applied the algorithms to real biological data designed to mimic a cell reprogramming experiment. Their algorithms were the most efficient and effective.
“Too often, large-scale experiments are designed empirically. A careful causal framework for sequential experimentation may allow optimal interventions to be identified with fewer trials, thereby reducing experimental costs,” says co-senior author Caroline Uhler, a professor in the Department of Electrical Engineering and Computer Sciences (EECS) who is also co-director. from the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, and a researcher at the Laboratory for Information and Decision Systems (LIDS) and the Institute for Data, Systems and Society (IDSS) at MIT.
Join Uhler in the paperwhich appears today in Nature Machine Intelligence, are lead author Jiaqi Zhang, a graduate student and member of the Eric and Wendy Schmidt Center; co-senior author Themistoklis P. Sapsis, professor of mechanical and ocean engineering at MIT and member of the IDSS; and others at Harvard and MIT.
Active learning
When scientists try to design an effective intervention for a complex system, such as in cellular reprogramming, they often perform experiments sequentially. These setups are ideal for using a machine learning approach called active learning. Data samples are collected and used to learn a model of the system that incorporates the knowledge collected so far. From this model, an acquisition function is designed: an equation that evaluates all possible interventions and chooses the best one to test in the next trial.
This process is repeated until an optimal intervention is identified (or resources to fund subsequent experiments are exhausted).
“While several generic acquisition functions exist to design experiments sequentially, these are not effective for problems of such complexity, leading to very slow convergence,” explains Sapsis.
Acquisition functions typically consider the correlation between factors, such as which genes are coexpressed. But focusing only on correlation ignores the regulatory relationships or causal structure of the system. For example, a genetic intervention can only affect the expression of downstream genes, but a correlation-based approach would not be able to distinguish between upstream or downstream genes.
“Some of this causal knowledge can be learned from the data and used to design an intervention more efficiently,” Zhang explains.
The MIT and Harvard researchers took advantage of this underlying causal structure for their technique. First, they carefully built an algorithm so that it could only learn models of the system that took causal relationships into account.
The researchers then designed the acquisition function to automatically evaluate interventions using information about these causal relationships. They crafted this function to prioritize the most informative interventions, that is, those most likely to lead to the optimal intervention in subsequent experiments.
“By considering causal models rather than correlation-based models, we can already rule out certain interventions. Then, each time new data is obtained, a more precise causal model can be learned and thus reduce the space for interventions even further,” explains Uhler.
This smaller search space, along with the acquisition function’s special focus on the most informative interventions, is what makes its approach so efficient.
The researchers further improved their acquisition function using a technique known as output weighting, inspired by the study of extreme events in complex systems. This method carefully emphasizes interventions that are likely to be closest to the optimal intervention.
“Essentially, we consider an optimal intervention as an ‘extreme event’ within the space of all possible suboptimal interventions and use some of the ideas we have developed for these problems,” says Sapsis.
Improved efficiency
They tested their algorithms using real biological data in a simulated cell reprogramming experiment. For this test, they looked for a genetic perturbation that would result in a desired change in average gene expression. Its acquisition functions consistently identified better interventions than the reference methods at every step of the multistage experiment.
“If the experiment is interrupted at any stage, ours will still be more efficient than the baselines. This means that fewer experiments could be performed and obtain the same or better results,” says Zhang.
The researchers are currently working with experimenters to apply their technique to cell reprogramming in the laboratory.
Their approach could also be applied to non-genomics problems, such as identifying optimal prices for consumer products or enabling optimal feedback control in fluid mechanics applications.
In the future, they plan to improve their optimization technique beyond those that seek to match a desired mean. Additionally, their method assumes that scientists already understand the causal relationships in their system, but future work could also explore how to use ai to learn that information.
This work was supported, in part, by the Office of Naval Research, the MIT-IBM Watson ai Laboratory, the MIT J Clinic for Machine Learning and Health, the Eric and Wendy Schmidt Center at the Broad Institute, a to Investigator Simons, the Air Force Office of Scientific Research, and a National Science Foundation graduate fellowship.