<img src="https://news.mit.edu/sites/default/files/styles/news_article__cover_image__original/public/images/202409/MIT-3Q-ai-Label-01-press.jpg?itok=KQRE3kmz” />
artificial intelligence systems are increasingly being used in healthcare situations where safety is critical. However, these models sometimes provide incorrect information, make biased predictions or fail for unexpected reasons, which could have serious consequences for patients and doctors.
In a Opinion article published today in Nature Computational ScienceMIT Associate Professor Marzyeh Ghassemi and Boston University Associate Professor Elaine Nsoesie argue that to mitigate these potential harms, ai systems should be accompanied by responsible use labels, similar to the labels required by the U.S. Food and Drug Administration to be placed on prescription drugs.
MIT News I spoke with Ghassemi about the need for such labels, the information they should convey, and how labeling procedures could be implemented.
Q: Why do we need responsible use labels for ai systems in healthcare settings?
TO: In healthcare, we have an interesting situation where doctors often rely on technologies or treatments that are not fully understood. Sometimes this lack of understanding is fundamental (for example, the mechanism behind acetaminophen), but other times it is simply a limitation of expertise. We don't expect doctors to know how to maintain an MRI machine, for example. Instead, we have certification systems through the FDA or other federal agencies that certify the use of a medical device or drug in a specific setting.
Importantly, medical devices also have service contracts – a manufacturer’s technician will repair your MRI machine if it is miscalibrated. For approved medicines, there are post-marketing surveillance and reporting systems in place to be able to address adverse effects or events – for example, if many people taking a medicine seem to develop a condition or allergy.
Models and algorithms, whether they incorporate ai or not, avoid many of these long-term monitoring and approval processes, and that is something we need to be wary of. Many previous studies have shown that predictive models need more careful evaluation and monitoring. Regarding more recent generative ai, in particular, we cite work that has shown that generation is not guaranteed to be appropriate, robust, or unbiased. Because we do not have the same level of oversight over predictions or model generation, it would be even more difficult to detect problematic responses from a model. Generative models currently used in hospitals could be biased. Having usage labels is one way to ensure that models do not automate biases that are learned from human practitioners or poorly calibrated clinical decision support scores of the past.
Q: Your article describes several components of a responsible use label for ai, following the FDA’s approach to creating prescription labels, including approved use, ingredients, potential side effects, etc. What basic information should these labels convey?
TO: Things a label should make clear are the time, place, and how a model is intended to be used. For example, the user should know that the models were trained at a specific time with data from a specific time. For example, does it include data that did or did not include the Covid-19 pandemic? There were very different healthcare practices during the Covid pandemic that could impact the data. That’s why we recommend disclosing the “ingredients” and “studies completed” of the model.
As for location, we know from previous research that models trained in one location tend to perform worse when moved to another. Knowing where data came from and how a model was optimized within that population can help ensure that users are aware of “potential side effects,” “warnings and precautions,” and “adverse reactions.”
If a model is being trained to predict an outcome, knowing the time and place of training can help make smart decisions about deployment. However, many generative models are incredibly flexible and can be used for many tasks. In this case, the time and place may not be as informative, and more explicit instructions about “labeling conditions” and “approved use” versus “unapproved use” come into play. If a developer has evaluated a generative model to read a patient’s clinical notes and generate prospective billing codes, it may reveal that it has a bias toward overbilling for specific conditions or underrecognizing others. A user would not want to use this same generative model to decide who gets a referral to a specialist, even though they could. This flexibility is why we advocate for additional details about how models should be used.
In general, we recommend training the best model possible, using the tools available. But even then, there should be a lot of transparency. No model is going to be perfect. As a society, we now understand that no pill is perfect: there is always some risk. We should have the same understanding of ai models. Any model, with or without ai, is limited. It can give you realistic, well-trained forecasts of potential futures, but take it with the appropriate amount of caution.
Q: If ai labels were implemented, who would label them and how would they be regulated and enforced?
TO: If you don't intend for your model to be used in practice, then the disclosures you would make for a high-quality research publication are sufficient. But once you intend for your model to be deployed in a human-facing environment, developers and implementers should do some initial labeling, based on some of the established frameworks. There should be validation of these claims before deployment; in a safety-critical environment like healthcare, many agencies within the Department of Health and Human Services might be involved.
For model developers, I think knowing that they will need to label the limitations of a system prompts more careful consideration of the process itself. If I know that at some point I will need to reveal the population that a model was trained on, I wouldn’t want to reveal that it was trained only on dialogue from male chatbot users, for example.
Thinking about things like who the data is being collected about, over what time period, what the sample size was, and how it was decided what data to include or exclude can open your mind to potential problems in implementation.