ai-search-00.png?itok=Mc330nFZ” />
Microbial sequence databases contain a wealth of information on enzymes and other molecules that could be adapted for biotechnology. But these databases have grown so much in recent years that it has become difficult to efficiently search for enzymes of interest.
Now, scientists at MIT’s McGovern Institute for Brain Research, the Broad Institute of MIT and Harvard, and the National Center for Biotechnology Information (NCBI) at the National Institutes of Health have developed a new search algorithm that has identified 188 types of new rare CRISPR Systems in bacterial genomes, spanning thousands of individual systems. The work appears today in Science.
The algorithm, which comes from the laboratory of pioneering CRISPR researcher Professor. Feng Zhang, uses big data clustering approaches to quickly search massive amounts of genomic data. The team used their algorithm, called fast locality-sensitive hash-based clustering (FLSHclust), to mine three major public databases containing data on a wide range of unusual bacteria, including those found in coal mines, breweries, Antarctic lakes and dog saliva. . Scientists found a surprising number and diversity of CRISPR systems, including some that could make edits to the DNA of human cells, others that can target RNA, and many with a variety of other functions.
The new systems could be leveraged to edit mammalian cells with fewer off-target effects than current Cas9 systems. They could also one day be used as diagnostics or serve as molecular records of activity within cells.
The researchers say their search highlights an unprecedented level of CRISPR diversity and flexibility and that there are likely many more rare systems yet to be discovered as databases continue to grow.
“Biodiversity is a hidden treasure, and as we continue to sequence more genomes and metagenomic samples, there is a growing need for better tools, like FLSHclust, to search that sequence space for molecular gems,” says Zhang, co-author of the study. Senior author of the study and James and Patricia Poitras Professor of Neuroscience at MIT with joint appointments in the departments of Brain and Cognitive Sciences and Biological Engineering. Zhang is also an investigator at MIT’s McGovern Institute for Brain Research, a core member of the Broad Institute, and an investigator at the Howard Hughes Medical Institute. Eugene Koonin, a distinguished researcher at NCBI, is also a co-lead author of the study.
Searching for CRISPR
CRISPR, which stands for clustered regularly interspaced short palindromic repeats, is a bacterial defense system that has been engineered into many genome editing and diagnostic tools.
To mine protein and nucleic acid sequence databases for new CRISPR systems, the researchers developed an algorithm based on an approach borrowed from the big data community. This technique, called locality-sensitive hashing, groups together objects that are similar but not exactly identical. Using this approach allowed the team to probe billions of protein and DNA sequences, from the NCBIis Whole Genome Shotgun database, and Joint Genome Institute – in weeks, while previous methods searching for identical objects would have taken months. They designed their algorithm to look for genes associated with CRISPR.
“This new algorithm allows us to analyze data in a short enough period of time that we can retrieve results and formulate biological hypotheses,” says Soumya Kannan PhD ’23, co-author of the study. Kannan was a graduate student in Zhang’s lab when he began the study and is currently a postdoc and junior fellow at Harvard University. Han Altae-Tran PhD ’23, a graduate student in Zhang’s lab during the study and currently a postdoc at the University of Washington, was the study’s other co-author.
“This is proof of what can be done when you improve exploration methods and use as much data as possible,” says Altae-Tran. “It’s really exciting to be able to improve the scale at which we conduct searches.”
New systems
In their analysis, Altae-Tran, Kannan and their colleagues noted that the thousands of CRISPR systems they found fell into some existing categories and many new ones. They studied several of the new systems in greater detail in the laboratory.
They found several new variants of known Type I CRISPR systems, which use a 32-base pair-long guide RNA instead of the 20-nucleotide guide of Cas9. Due to their longer guide RNAs, these Type I systems could be used to develop more precise gene editing technology that is less prone to off-target editing. Zhang’s team showed that two of these systems could make brief edits to the DNA of human cells. And because these Type I systems are similar in size to CRISPR-Cas9, they could likely be delivered to animal or human cells using the same gene delivery technologies used for CRISPR today.
One of the Type I systems also showed “collateral activity”: extensive degradation of nucleic acids after the CRISPR protein binds to its target. Scientists have used similar systems to diagnose infectious diseases such as SHERLOCK, a tool capable of rapidly detecting a single DNA or RNA molecule. Zhang’s team believes the new systems could also be adapted to diagnostic technologies.
The researchers also discovered new mechanisms of action for some type IV CRISPR systems and a type VII system that precisely targets RNA, which could potentially be used in RNA editing. Other systems could potentially be used as recording tools (a molecular document of when a gene was expressed) or as sensors of specific activity in a living cell.
Mining data
The scientists say their algorithm could help in the search for other biochemical systems. “This search algorithm could be used by anyone who wants to work with these large databases to study how proteins evolve or discover new genes,” says Altae-Tran.
The researchers add that their findings illustrate not only how diverse CRISPR systems are, but also that most are rare and only found in unusual bacteria. “Some of these microbial systems were found exclusively in coal mine water,” Kannan says. “If someone hadn’t been interested in that, we might never have seen those systems. Expanding our diversity of samples is really important to continue to expand the diversity of what we can discover.”
This work was supported by the Howard Hughes Medical Institute; the K. Lisa Yang and Hock E. Tan Center for Molecular Therapeutics at MIT; Broad Institute Programmable Therapy Grant Donors; The Pershing Square Foundation, William Ackman and Neri Oxman; James and Patricia Poitras; BT Charitable Foundation; Asness Family Foundation; Kenneth C. Griffin; the Phillips family; David Cheng; and Robert Metcalfe.