The extraction, analysis, and interpretation of medical data from unstructured clinical literature falls under the emerging discipline of clinical natural language processing (NLP). Despite its importance, particular difficulties arise when developing methodologies for clinical NLP. For example, clinical texts can confuse ordinary NLP models, as they are often full of acronyms and specialized medical terminology. Fortunately, recent developments in large language models provide a promising solution to these problems, as they are pre-trained on large corpora and include billions of parameters, naturally capturing substantial clinical information.
These developments highlight the need to develop specific methods to modify LLMs for use in clinical settings that address the complexity of terminology and improve models by fitting clinical data. Although generic LLMs have great potential, using them directly to make inferences about data from clinical texts is only sometimes desirable in real-world settings. First, these LLMs typically have billions of parameters, requiring substantial processing power even during conception. This results in high infrastructure costs and long inference times. Confidential patient information contained in clinical text also raises concerns about privacy and regulatory compliance. Creating synthetic training data with LLM is a potential technique to address these issues as it utilizes the capabilities of LLMs in a resource- and privacy-aware manner.
Models can operate at high performance levels while complying with data privacy laws when trained on these artificial data sets, replicating real-world clinical data. In general machine learning, one of the most common areas of study is creating synthetic data using basic models. However, using LLMs trained on available texts to create clinical data presents special obstacles when providing high-quality data that follows the distribution of the original data set. To evaluate the quality of the data produced by existing techniques, they perform a comprehensive analysis focusing on variety and distribution. The Central Moment Discrepancy (CMD) score and t-SNE embedding visualization reveal a notable change in the data distribution.
They also analyze the quantities and frequencies of clinically related entities in the synthetic data; A significant decrease is observed when comparing the synthetic data with the real data. Although several studies have explored creating clinical data using language models, many of these efforts are task-specific. Electronic medical records, clinical notes, medical text extraction, and medical conversations are some examples. These studies can use excessive training data and often use language models directly for text production. There are a limited number of coherent ideas to improve the way LLMs are modified to produce synthetic text that will help with subsequent clinical applications.
Inspired by previous research, researchers from Emory University and the Georgia Institute of technology introduced CLINGEN, a generic framework infused with clinical experience for producing high-quality clinical text in low-shot situations. Its ultimate goals are to promote the variety of topics in the text produced and to close the gap between synthetic data and real data. They provide a method for using clinical knowledge extraction to contextualize indications to achieve this goal. This involves getting ideas for KG and LLM clinical topics and tips on LLM writing styles. In this way, CLINGEN combines internal parametric information embedded in large language models with non-parametric insights from external clinical knowledge graphs.
It is important to note that CLINGEN can be easily used for various fundamental clinical NLP tasks and requires very little additional human work. Below is a summary of his contributions:
• To create clinical text data in low-opportunity circumstances, they suggest CLINGEN, a generic framework filled with clinical information.
• They offer a simple but effective method of using clinical knowledge extraction to tailor prompts to the intended clinical NLP tasks, which can be easily applied to various clinical NLP activities. This involves getting ideas for KG and LLM clinical topics and tips on LLM writing styles.
• They perform a comprehensive analysis of synthetic clinical data creation using 16 data sets and 7 clinical NLP tasks. Experimental results show that CLINGEN increases the variety of the training samples produced while aligning more closely with the original data distribution. The empirical performance increases (8.98% for PubMedBERTBase and 7.27% for PubMedBERTLarge) are consistent across multiple tasks with different LLMs and classifiers.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 32k+ ML SubReddit, 41k+ Facebook community, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
we are also in Telegram and WhatsApp.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Data Science and artificial intelligence at the Indian Institute of technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around it. She loves connecting with people and collaborating on interesting projects.
<!– ai CONTENT END 2 –>