Large-scale multimodal baseline models have achieved remarkable success in understanding complex visual patterns and natural language, which has generated interest in their application to medical vision and language tasks. Progress has been made by building medical datasets with image-text pairs and fine-tuning domain-general models on these datasets. However, these datasets have limitations. They lack multi-granular annotations that link local and global information within medical images, which is crucial for identifying specific lesions from regional details. Furthermore, current methods for building these datasets rely heavily on linking medical images with reports or captions, which limits their scalability.
Researchers at the University of California, Santa Cruz, Harvard University, and Stanford University have introduced MedTrinity-25M, a large-scale multimodal medical dataset containing over 25 million images across ten modalities. This dataset includes detailed multi-granular annotations for over 65 diseases, spanning global information such as disease type and modality and local annotations such as bounding boxes and segmentation masks for regions of interest (ROIs). Using an automated process, the researchers generated these comprehensive annotations without relying on paired text descriptions, enabling advanced multimodal tasks and supporting large-scale pretraining of medical ai models.
Medical multimodal base models have garnered increasing interest due to their ability to understand complex visual and textual features, leading to advances in medical vision and language tasks. Models such as Med-Flamingo and Med-PaLM have been refined on medical datasets to improve their performance. However, the scale of available training data often limits these models. To address this issue, researchers have focused on building large medical datasets. However, datasets such as MIMIC-CXR and RadGenome-Chest CT are limited by the laborious process of matching images to detailed textual descriptions. In contrast, the MedTrinity-25M dataset uses an automated process to generate comprehensive multi-granular annotations for unmatched photos, offering a significantly larger and more detailed dataset.
The MedTrinity-25M dataset includes over 25 million images organized into triplets of {image, ROI, description}. Images span ten modalities and cover 65 diseases, sourced from repositories such as TCIA and Kaggle. ROIs are highlighted with masks or bounding boxes, which point out key anatomical features or abnormalities. Multi-granular textual descriptions detail the image modality, disease, and ROI details. Construction of the dataset involves generating broad captions, identifying ROIs with models such as SAT and BA-Transformer, and leveraging medical knowledge to obtain accurate descriptions. MedTrinity-25M stands out for its scale, diversity, and detailed annotations compared to other datasets.
The study evaluated LLaVA-Med++ on biomedical visual question answering (VQA) tasks using the VQA-RAD, SLAKE, and PathVQA datasets to assess the impact of pretraining on the MedTrinity-25M dataset. The initial pretraining followed the LLaVA-Med methodology, with additional fine-tuning on the VQA datasets over three epochs. The results show that LLaVA-Med++ with MedTrinity-25M pretraining outperforms the baseline model by approximately 10.75% on VQA-RAD, 6.1% on SLAKE, and 13.25% on PathVQA. It achieves state-of-the-art results on two benchmarks and ranks third on the third, demonstrating significant performance improvements with MedTrinity-25M pretraining.
The study presents MedTrinity-25M, a vast multimodal medical dataset with over 25 million image-ROI description triplets from 90 sources, spanning ten modalities and covering over 65 diseases. Unlike previous methods that relied on paired image-text data, MedTrinity-25M is built using an automated process that generates detailed annotations from unpaired images, leveraging advanced MLLM and expert models. The dataset’s rich multi-granular annotations support a variety of tasks, including captioning, reporting, and classification. The model, pre-trained on MedTrinity-25M, achieved state-of-the-art results on VQA tasks, highlighting its effectiveness for training multimodal medical ai models.
Take a look at the Paper and Project. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our Subreddit with over 48 billion users
Find upcoming ai webinars here
Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and ai to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of ai and real-life solutions.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>