Benchmarking large language models in biomedical classification and named entity recognition: Assessing the impact of stimulus techniques and domain knowledge

LLM models are increasingly used in healthcare for tasks such as question answering and document summarization, with performance similar to that of subject matter experts. However, their effectiveness on traditional biomedical tasks such as structured information extraction remains to be seen. While LLM models have successfully generated free-text results, current approaches mainly focus on improving the models’ internal knowledge through methods such as fine-tuning and context learning. These methods rely on readily available data, often insufficient in the biomedical domain due to domain shifts and lack of resources for specific structured tasks, making zero-shot performance critical but underexplored.

Researchers from several institutions including ASUS Intelligent Cloud Services, Imperial College London, and the University of Manchester conducted a study to evaluate the performance of LLMs on medical classification and named entity recognition (NER) tasks. Their aim was to analyze how different factors such as task-specific reasoning, domain knowledge, and incorporation of external experts influence the performance of LLMs. Their findings revealed that standard cues outperformed more complex techniques such as chain-of-thought (CoT) reasoning and retrieval-augmented generation (RAG). The study highlights the challenges of applying advanced cuing methods in biomedical tasks and emphasizes better integration of external knowledge into LLMs for real-world applications.

The existing literature on benchmarking LLMs in the medical domain primarily focuses on tasks such as question answering, summary building, and clinical coding, and often neglects structured prediction tasks such as document classification and named entity recognition. While prior work has provided valuable resources for traditionally structured tasks, many benchmarks overlook them in favor of evaluating domain-specific models. Recent approaches to improving the performance of LLMs include domain-specific pretraining, instruction tuning, CoT reasoning, and RAG. However, these methods often need more systematic evaluation in the context of structured prediction, which the study aims to address.

To evaluate the performance of LLM on structured prediction tasks, the study compares a variety of models on biomedical text classification and NER tasks in a real-world zero-shot setting. This approach assesses the inherent parametric knowledge of the models, which is crucial due to the sparsity of annotated biomedical data. We compare this baseline performance with improvements from CoT reasoning, RAG, and self-consistency methods while holding the parametric knowledge constant. The techniques are evaluated using a variety of datasets, including English and non-English sources, and the models are constrained to ensure a structured output.

The evaluation results reveal that reasoning and knowledge enhancement techniques generally do not improve performance. Standard Prompting consistently yields the highest F1 scores for classification tasks across all models, with BioMistral-7B, Llama-2-70B, and Llama-2-7B scoring 36.48%, 40.34%, and 34.92%, respectively. Complex methods such as CoT Prompting and RAG should often outperform Standard Prompting. Larger models such as Llama-2-70B significantly improve, especially on tasks requiring advanced reasoning. However, multilingual and private datasets show inferior performance, and high-complexity tasks still need improvement, with RAG techniques showing inconsistent benefits.

The study compares LLMs in Medical Classification and NER, revealing important insights. Despite advanced techniques such as CoT and RAG, Standard Prompting consistently outperforms these methods across all tasks. This underscores a fundamental limitation in the generalizability and effectiveness of LLMs in extracting structured biomedical information. The results highlight that current state-of-the-art prompting methods need to be better translated to biomedical tasks, emphasizing the need to integrate domain-specific knowledge and reasoning capabilities to improve the performance of LLMs in real-world healthcare applications.

Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..

Don't forget to join our Over 49,000 ML subscribers on Reddit

Find upcoming ai webinars here

Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and ai to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of ai and real-life solutions.

Join the fastest growing ai research newsletter read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…

Benchmarking large language models in biomedical classification and named entity recognition: Assessing the impact of stimulus techniques and domain knowledge

Technical Terrence Team

EURGBP fails to gain new support

Leave a Reply Cancel reply

Recommended.

Trade credit – Advantages and Disadvantages

Kraken announces its own Ethereum-wrapped Bitcoin

Degod token surges after launch, but history points to a reversal

Ethereum ETF hype leads crypto whales to collect 426,000 ETH. Is this the next price rally?

Polygon Unveils Governance 2.0 With Proposed Protocol Council for Smart Contract Upgrades

Categories

Important Links

Benchmarking large language models in biomedical classification and named entity recognition: Assessing the impact of stimulus techniques and domain knowledge

Related

Technical Terrence Team

EURGBP fails to gain new support

Leave a Reply Cancel reply

Recommended.

Trade credit – Advantages and Disadvantages

Kraken announces its own Ethereum-wrapped Bitcoin

Degod token surges after launch, but history points to a reversal

Ethereum ETF hype leads crypto whales to collect 426,000 ETH. Is this the next price rally?

Polygon Unveils Governance 2.0 With Proposed Protocol Council for Smart Contract Upgrades

Categories

Important Links

Get daily news updates to your inbox!