Generative ai models are increasingly being incorporated into healthcare environments, in some cases perhaps prematurely. Early adopters believe they will unlock greater efficiency while revealing insights that would otherwise be overlooked. Meanwhile, critics point out that these models have flaws and biases that could contribute to worse health outcomes.
But is there a quantitative way to tell how useful or harmful a model might be when tasked with tasks like summarizing patient records or answering health-related questions?
Hugging Face, the ai startup, proposes a solution in a recently launched benchmark test called Open Medical-LLM. Created in partnership with researchers from the non-profit organization Open Life Science ai and the Natural Language Processing Group at the University of Edinburgh, Open Medical-LLM aims to standardize the performance evaluation of generative ai models in a variety of tasks related to medicine.
Open Medical-LLM is not a right from the start benchmark, per se, but rather a union of existing test suites (MedQA, PubMedQA, MedMCQA, etc.) designed to test models for general medical knowledge and related fields, such as anatomy, pharmacology, genetics, and clinical practice. The benchmark contains open-ended and multiple-choice questions that require medical reasoning and understanding, drawing on material including US and Indian medical licensing exams and question banks from college biology exams.
“(Open Medical-LLM) allows researchers and practitioners to identify the strengths and weaknesses of different approaches, drive further advances in the field, and ultimately contribute to better patient care and outcomes,” Hugging Face wrote in a post of blog.
Hugging Face is positioning the benchmark as a “robust evaluation” of generative ai models intended for healthcare. But some medical experts on social media warned against making too much of Open Medical-LLM, lest misinformed deployments occur.
In current clinical practice can be quite broad.
<div class="embed breakout embed-oembed embed–twitter“>
<blockquote class="twitter-tweet” data-width=”550″ data-dnt=”true”>
It's great progress to see these head-to-head comparisons, but it's important that we also remember how big the gap is between the artificial environment of answering medical questions and actual clinical practice. Not to mention the idiosyncratic risks that these metrics cannot capture.
– Liam McCoy, MD MSc (@LiamGMcCoy) twitter.com/LiamGMcCoy/status/1780952462821863715?ref_src=twsrc%5Etfw”>April 18, 2024
Hugging Face research scientist Clémentine Fourrier, co-author of the blog post, agreed.
“These league tables should only be used as a first approximation of which (generative ai model) to explore for a given use case, but then a deeper phase of testing is always needed to examine the limits and relevance of the model under conditions.” real”. twitter.com/clefourrier/status/1780955155300745247″ target=”_blank” rel=”noopener”>Fourier responded at
This is reminiscent of Google's experience when it tried to bring an ai-powered diabetic retinopathy screening tool to health systems in Thailand.
Google created a deep learning system that scans images of the eye for evidence of retinopathy, a leading cause of vision loss. But despite the high theoretical precision, technology/health/healthcare-ai-systems-put-people-center/” target=”_blank” rel=”noopener” data-mrf-link=”https://www.blog.google/technology/health/healthcare-ai-systems-put-people-center/”>the tool proved impractical in real-world testsfrustrating both patients and nurses with inconsistent results and a general lack of harmony with practices on the ground.
Tellingly, of the 139 ai-related medical devices the US Food and Drug Administration has approved to date, none use generative ai. It is exceptionally difficult to test how the performance of a generative ai tool in the lab will translate to hospitals and outpatient clinics and, perhaps more importantly, how the results might evolve over time.
That's not to say that Open Medical-LLM isn't useful or informative. The results leaderboard, at the very least, serves as a reminder of how evil The models answer basic health questions. But Open Medical-LLM, and no other benchmark, is a substitute for carefully thought-out real-world testing.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>