The development of VLM in the biomedical domain faces challenges due to the lack of large-scale, annotated, and publicly accessible multimodal datasets in various fields. While datasets have been built from biomedical literature, such as PubMed, they often focus strictly on domains such as radiology and pathology, neglecting complementary areas such as molecular biology and pharmacogenomics that are critical for holistic clinical understanding. . Privacy concerns, the complexity of expert-level annotations, and logistical limitations further impede the creation of comprehensive data sets. Previous approaches, such as ROCO, MEDICAT, and PMC-15M, have relied on domain-specific filtering and supervised models to extract millions of pairs of images and captions. However, these strategies often fail to capture the broader diversity of biomedical knowledge needed to promote generalist biomedical VLMs.
In addition to data set limitations, training and evaluating biomedical VLMs present unique challenges. Contrastive learning approaches, such as PMC-CLIP and BiomedCLIP, have shown promise by leveraging literature-based datasets and vision transformer models for image and text alignment. However, its performance is limited by smaller data sets and limited computational resources compared to general VLMs. Furthermore, current assessment protocols, primarily focused on radiology and pathology tasks, lack standardization and broader applicability. Reliance on additional learnable parameters and limited data sets undermines the reliability of these evaluations, highlighting the need for scalable data sets and robust evaluation frameworks that can address the diverse demands of biomedical vision and language applications. .
Researchers at Stanford University introduced BIOMEDICA, an open source framework designed to extract, annotate, and organize the entire PubMed Central Open Access subset into one easy-to-use dataset. This archive includes more than 24 million image-text pairs from 6 million articles enriched with metadata and expert annotations. They also launched BMCA-CLIP, a set of CLIP-style models pre-trained on BIOMEDICA via streaming, eliminating the need for local storage of 27TB of data. These models achieve state-of-the-art performance across 40 tasks, including radiology, dermatology, and molecular biology, with an average 6.56% improvement in zero-shot classification and reduced computational requirements.
BIOMEDICA's data curation process involves data set mining, concept tagging, and serialization. Articles and media files are downloaded from the NCBI server, extracting metadata, captions, and figure references from nXML files and the Entrez API. Images are clustered using DINOv2 embeddings and labeled using an expert-refined hierarchical taxonomy. Labels are assigned by majority vote and propagated among groups. The dataset, containing over 24 million image-caption pairs and extensive metadata, is serialized in WebDataset format for efficient transmission. With 12 global and 170 local imaging concepts, the taxonomy covers categories such as clinical imaging, microscopy, and data visualizations, emphasizing scalability and accessibility.
The evaluation of continuous pretraining on the BIOMEDICA dataset used 39 established biomedical classification tasks and a new Flickr retrieval dataset, spanning 40 datasets. The classification benchmark includes tasks from pathology, radiology, biology, surgery, dermatology, and ophthalmology. Metrics such as average precision were used for classification and recall (at 1, 10, and 100). Concept filtering, which excludes overrepresented topics, performed better than concept balancing or pretraining the entire data set. The models trained on BIOMEDICA achieved state-of-the-art results, significantly outperforming previous methods, with improved performance on classification, retrieval and microscopy tasks using less data and calculations.
In conclusion, BIOMEDICA is a comprehensive framework that transforms the PubMed Central Open Access (PMC-OA) subset into the largest deep learning-ready dataset, with 24 million image-caption pairs enriched with 27 metadata fields. Designed to address the lack of diverse and annotated biomedical datasets, BIOMEDICA provides a scalable open source solution for extracting and annotating multimodal data from over 6 million articles. Through continuous pre-training of CLIP-style models using BIOMEDICA, the framework achieves zero-shot classification and state-of-the-art image and text retrieval on 40 biomedical tasks, requiring 10x less computation and 2.5x less data. All resources, including models, datasets, and code, are publicly available.
Verify he Paper and Project page. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 65,000 ml.
Recommend open source platform: Parlant is a framework that transforms the way ai agents make decisions in customer-facing scenarios. (Promoted)
Sana Hassan, a consulting intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and artificial intelligence to address real-world challenges. With a strong interest in solving practical problems, he brings a new perspective to the intersection of ai and real-life solutions.