Recent advances in medical multimodal long language models (MLLMs) have demonstrated significant progress in medical decision making. However, many models, such as Med-Flamingo and LLaVA-Med, are task-specific and require large datasets and high computational resources, limiting their feasibility in clinical settings. While the pooling of experts (MoE) strategy offers a solution that uses smaller, task-specific modules to reduce computational cost, its application in the medical domain remains unexplored. Lightweight yet effective models that handle diverse tasks and offer better scalability are essential for broader clinical utility in resource-limited settings.
Researchers from Zhejiang University, the National University of Singapore, and Peking University presented Med-MoE, a lightweight framework for multimodal medical tasks such as Med-VQA and image classification. Med-MoE integrates domain-specific experts with a global meta-expert, emulating hospital workflows. The model aligns medical images and texts, uses instruction adjustment for multimodal tasks, and employs a router to activate relevant experts. Med-MoE outperforms or matches state-of-the-art models such as LLaVA-Med with only 30%-50% of parameters activated. Tested on datasets such as VQA-RAD and Path-VQA, it shows great potential for improving medical decision making in resource-limited settings.
Advances in multi-level ai models such as Med-Flamingo, Med-PaLM M, and LLaVA-Med have significantly improved medical diagnoses by building on general ai models such as ChatGPT and GPT-4. These models improve few-shot learning and medical question answering capabilities, but are often expensive and underutilized in resource-limited settings. The MoE approach in multi-level ai models improves task handling and efficiency, either by activating different experts for specific tasks or replacing standard layers with MoE structures. However, these methods often struggle with modal biases and lack effective specialization for diverse medical data.
The Med-MoE framework is trained in three stages. First, in the Multimodal Medical Alignment phase, the model aligns medical images with textual descriptions using a vision encoder to produce image tokens and integrates them with text tokens to train a language model. Second, during Instruction Tuning and Routing, the model learns to handle medical tasks and generates responses while a router is trained to identify input modalities. Finally, in Domain-Specific MoE Tuning, the framework replaces the model’s feed-forward network with a MoE structure, where a meta-expert captures global information and domain-specific experts handle specific tasks, optimizing the model for accurate medical decision making.
The study evaluates Med-MoE models using multiple datasets and metrics including precision and recall, with the baseline StableLM (1.7B) and Phi2 (2.7B) models. Med-MoE (Phi2) demonstrates superior performance over LLaVA-Med on VQA and medical image classification tasks, achieving 91.4% accuracy on PneumoniaMNIST. MoE-Tuning consistently outperforms traditional SFT, and integration with LoRA benefits GPU memory usage and inference speed. Simpler router architectures and specialized experts improve model efficiency, with 2-4 activated experts effectively balancing throughput and computation.
In conclusion, Med-MoE is an optimized framework designed for multimodal medical tasks, which optimizes performance in resource-constrained environments by aligning medical images with language model tokens, task-specific tuning, and domain-specific fine-tuning. It achieves state-of-the-art results while reducing the activated parameters. Despite its efficiency, Med-MoE faces challenges such as limited medical training data due to privacy concerns and high manual annotation costs. The model also struggles with complex, open-ended questions and must ensure reliable and explainable results in critical healthcare applications. Med-MoE offers a practical solution for advanced medical ai in constrained environments but needs improvements in data scalability and model reliability.
Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and LinkedInJoin our Telegram Channel.
If you like our work, you will love our fact sheet..
Don't forget to join our SubReddit of over 50,000 ml
Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and ai to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of ai and real-life solutions.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>