Great language models help decipher clinical notes | MIT News

Electronic health records (EHRs) need a new public relations manager. Ten years ago, the US government passed a law that strongly encouraged the adoption of electronic health records with the intent of improving and expediting care. The enormous amount of information in these now-digital records could be used to answer very specific questions beyond the scope of clinical trials: What is the correct dose of this drug for patients of this height and weight? What about patients with a specific genomic profile?

Unfortunately, most of the data that could answer these questions is trapped in doctor’s notes, full of jargon and abbreviations. These notes are difficult for computers to understand using current techniques: extracting information requires training multiple machine learning models. Models trained for one hospital do not perform well at others either, and training each model requires domain experts to label large amounts of data, a costly and time-consuming process.

An ideal system would use a single model that can extract many types of information, perform well across multiple hospitals, and learn from a small amount of labeled data. But how? Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) led by Monica Agrawal, a doctoral candidate in electrical and computer engineering, believed that to unravel the data, they needed to turn to something bigger: big language models. To get that important medical information, they used a very large GPT-3-style model to perform tasks like expanding overloaded jargon and acronyms and extracting drug regimens.

For example, the system takes an input, which in this case is a clinical note, “prompts” the model with a question about the note, such as “expand this abbreviation, CTA.” The system returns a result such as “clear to auscultation”, instead of, for example, a CT angiogram. The goal of extracting this clean data, the team says, is to eventually enable more personalized clinical recommendations.

Medical data is understandably a rather complicated resource to navigate freely. There is a lot of red tape around using public resources to test the performance of large models due to data usage restrictions, so the team decided to put their own together. Using a set of publicly available short clinical snippets, they cobbled together a small data set to allow evaluation of the extraction performance of large language models.

“It is challenging to develop a single, general-purpose clinical natural language processing system that meets everyone’s needs and is robust against the enormous variation seen in healthcare data sets. As a result, to this day, most clinical notes are not used in downstream analysis or to support live decisions in electronic health records. These large language model approaches could potentially transform clinical natural language processing,” says David Sontag, MIT professor of electrical engineering and computer science, principal investigator at CSAIL and the Institute for Medical Engineering and Sciences, and supervising author. of a paper on the work, which will be presented at the Conference on Empirical Methods in Natural Language Processing. “The research team’s advances in zero-trigger clinical information extraction make scaling possible. Even if you have hundreds of different use cases, no problem – you can build each model with a few minutes of work, instead of having to tag a ton of data for that particular task.”

For example, without any labels, the researchers found that these models could achieve 86% accuracy in expanding overloaded acronyms, and the team developed additional methods to further increase this to 90% accuracy, without the need for labels.

Incarcerated in an EHR

Pundits have been steadily building Large Language Models (LLMs) for quite some time, but they broke into the mainstream with GPT-3. widely covered ability to complete sentences. These LLMs are trained on a large amount of text from the Internet to complete sentences and predict the next most likely word.

While earlier, smaller models such as earlier iterations of GPT or BERT have achieved good performance for extracting medical data, they still require considerable manual data labeling effort.

For example, a note, “pt is going to vanish due to an/v” means that this patient (pt) was taking the antibiotic vancomycin (vanco) but experienced nausea and vomiting (n/v) severe enough that the care team to discontinue treatment. (c) the medicine. The team’s research avoids the status quo of training separate machine learning models for each task (extract drugs, record side effects, disambiguate common abbreviations, etc.). In addition to expanding the abbreviations, they investigated four other tasks, including whether the models could analyze clinical trials and extract richly detailed drug regimens.

“Previous work has shown that these models are sensitive to the precise wording of the notice. Part of our technical contribution is a way to format the indicator so that the model gives you results in the correct format,” says Hunter Lang, a CSAIL PhD student and author of the paper. “For these extraction problems, there are structured exit spaces. The output space is not just a string. It can be a list. It may be a quote from the original entry. So there is more structure than just free text. Part of our contribution to the research is to encourage the model to give you output with the correct structure. That significantly reduces post-processing time.”

The approach cannot be applied to out-of-the-box health data in a hospital: that requires sending private patient information over the open Internet to an LLM provider like OpenAI. The authors demonstrated that it is possible to work around this by distilling the model into a smaller one that could be used on the site.

The model, sometimes like humans, is not always indebted to the truth. Here’s what a potential problem might look like: Let’s say you’re asking why someone took medication. Without proper safeguards and controls, the model could show the most common reason for that drug, if nothing is explicitly mentioned in the note. This led to the team’s efforts to force the model to extract more citations from the data and less free text.

The team’s future work includes extending to languages other than English, creating additional methods to quantify uncertainty in the model, and obtaining similar results with open source models.

“Clinical information buried in unstructured clinical notes has unique challenges compared to domain general text, primarily due to the heavy use of acronyms and inconsistent textual patterns used across different healthcare settings,” says Sadid Hasan, AI Lead at Microsoft and former CEO of AI. at CVS Health, which was not involved in the research. “To this end, this work establishes an interesting paradigm for harnessing the power of domain general large language models for several important clinical zero/few attempts NLP tasks. Specifically, the proposed guided rapid design of LLM to generate more structured results could lead to further development of smaller deployable models through iterative utilization of model-generated pseudo-labels.”

“AI has accelerated in the last five years to the point where these great models can predict contextualized recommendations with benefits that span across a variety of domains, such as suggesting new drug formulations, understanding unstructured text, coding recommendations. or create artwork inspired by any number of artists or human styles,” says Parminder Bhatia, who was previously the lead of machine learning at AWS Health AI and is currently the lead of machine learning for low-code applications that leverage large language models in AWS AI Labs.

As part of the MIT Abdul Latif Jameel Clinic for Machine Learning in Health, Agrawal, Sontag, and Lang co-authored the paper with Yoon Kim, an MIT assistant professor and CSAIL principal investigator, and Stefan Hegselmann, a visiting PhD student at the University. from Muenster. First author Agrawal’s research was supported by a Takeda grant, MIT’s Deshpande Center for Technological Innovation, and the MLA@CSAIL initiatives.