LLM models are increasingly used in healthcare for tasks such as question answering and document summarization, with performance similar to that of subject matter experts. However, their effectiveness on traditional biomedical tasks such as structured information extraction remains to be seen. While LLM models have successfully generated free-text results, current approaches mainly focus on improving the models’ internal knowledge through methods such as fine-tuning and context learning. These methods rely on readily available data, often insufficient in the biomedical domain due to domain shifts and lack of resources for specific structured tasks, making zero-shot performance critical but underexplored.
Researchers from several institutions including ASUS Intelligent Cloud Services, Imperial College London, and the University of Manchester conducted a study to evaluate the performance of LLMs on medical classification and named entity recognition (NER) tasks. Their aim was to analyze how different factors such as task-specific reasoning, domain knowledge, and incorporation of external experts influence the performance of LLMs. Their findings revealed that standard cues outperformed more complex techniques such as chain-of-thought (CoT) reasoning and retrieval-augmented generation (RAG). The study highlights the challenges of applying advanced cuing methods in biomedical tasks and emphasizes better integration of external knowledge into LLMs for real-world applications.
The existing literature on benchmarking LLMs in the medical domain primarily focuses on tasks such as question answering, summary building, and clinical coding, and often neglects structured prediction tasks such as document classification and named entity recognition. While prior work has provided valuable resources for traditionally structured tasks, many benchmarks overlook them in favor of evaluating domain-specific models. Recent approaches to improving the performance of LLMs include domain-specific pretraining, instruction tuning, CoT reasoning, and RAG. However, these methods often need more systematic evaluation in the context of structured prediction, which the study aims to address.
To evaluate the performance of LLM on structured prediction tasks, the study compares a variety of models on biomedical text classification and NER tasks in a real-world zero-shot setting. This approach assesses the inherent parametric knowledge of the models, which is crucial due to the sparsity of annotated biomedical data. We compare this baseline performance with improvements from CoT reasoning, RAG, and self-consistency methods while holding the parametric knowledge constant. The techniques are evaluated using a variety of datasets, including English and non-English sources, and the models are constrained to ensure a structured output.
The evaluation results reveal that reasoning and knowledge enhancement techniques generally do not improve performance. Standard Prompting consistently yields the highest F1 scores for classification tasks across all models, with BioMistral-7B, Llama-2-70B, and Llama-2-7B scoring 36.48%, 40.34%, and 34.92%, respectively. Complex methods such as CoT Prompting and RAG should often outperform Standard Prompting. Larger models such as Llama-2-70B significantly improve, especially on tasks requiring advanced reasoning. However, multilingual and private datasets show inferior performance, and high-complexity tasks still need improvement, with RAG techniques showing inconsistent benefits.
The study compares LLMs in Medical Classification and NER, revealing important insights. Despite advanced techniques such as CoT and RAG, Standard Prompting consistently outperforms these methods across all tasks. This underscores a fundamental limitation in the generalizability and effectiveness of LLMs in extracting structured biomedical information. The results highlight that current state-of-the-art prompting methods need to be better translated to biomedical tasks, emphasizing the need to integrate domain-specific knowledge and reasoning capabilities to improve the performance of LLMs in real-world healthcare applications.
Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our Over 49,000 ML subscribers on Reddit
Find upcoming ai webinars here
Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and ai to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of ai and real-life solutions.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>