Supervised learning in medical image classification faces challenges due to the scarcity of labeled data, as expert annotations are difficult to obtain. Visual language models (VLMs) address this problem by leveraging visual-textual alignment, enabling unsupervised learning and reducing reliance on labeled data. Pre-training on large medical image and text datasets enables VLMs to generate accurate labels and captions, reducing annotation costs. Active learning prioritizes key samples for expert annotation, while transfer learning fine-tunes pre-trained models on specific medical datasets. VLMs also generate synthetic images and annotations, improving data diversity and model performance on medical imaging tasks.
Researchers from Mohamed Bin Zayed University of ai and Inception Institute for ai propose MedUnA, an unsupervised medical adaptation method for image classification. MedUnA employs two-stage training: pre-training the adapter using text descriptions generated by an LLM aligned with class labels, followed by unsupervised learning. The adapter is integrated with MedCLIP’s visual encoder, using entropy minimization to align visual and text embeddings. MedUnA addresses the modality gap between textual and visual data, improving classification performance without extensive pre-training. This method efficiently adapts vision and language models for medical tasks, reducing reliance on labeled data and improving scalability.
A common method for using VLMs in medical images involves extensive pre-training on large datasets, followed by fine-tuning for tasks such as classification, segmentation, and reporting. Unlike these resource-intensive strategies, MedUnA leverages the existing alignment between visual and textual embeddings to avoid large-scale pre-training. It uses unlabeled images and automatically generated descriptions from an LLM for disease categories. A lightweight adapter and cue vector are trained to minimize autoentropy, ensuring reliable performance across multiple data augmentations. MedUnA delivers improved efficiency and performance without the need for extensive pre-training.
The methodology consists of two stages: pre-training of the adapter and unsupervised training. In Stage 1, textual descriptions are generated for each class using an LLM and integrated through a text encoder. A cross-modal adapter is trained by minimizing the cross-entropy between the generated logits and the ground truth labels. In Stage 2, the adapter is further trained using medical images in an unsupervised manner, with weak and strong augmentations of the input passed through two branches. The strong branch uses a learnable message and the training minimizes the difference between the outputs of the two branches. Inference is performed using the optimized strong branch.
The experiments tested the proposed method using five public medical datasets, covering diseases such as tuberculosis, pneumonia, diabetic retinopathy, and skin cancer. Text descriptions for classes in each dataset were generated using GPT-3.5 and other language models and then fed into a text classifier. The method was evaluated using CLIP and MedCLIP visual encoders, and MedCLIP performed better overall. Unsupervised learning was used to generate pseudo labels for unlabeled images, and the models were trained using the SGD optimizer. The results showed that MedUnA, the proposed method, achieved superior accuracy compared to the baseline models.
The study analyzes experimental results, highlighting the performance of MedUnA compared to other methods such as CLIP, MedCLIP, LaFTer, and TPT. MedUnA demonstrates notable improvements in accuracy on several medical datasets, in particular outperforming the zero-shot MedCLIP in most cases. Minimal improvement is observed on the pneumonia dataset due to MedCLIP pre-training. Furthermore, t-SNE plots indicate that MedUnA produces clearer clustering, thereby improving classification accuracy. The correlation between the text classifier accuracy of various LLMs and the performance of MedUnA is also explored, along with an ablation study on the impact of different loss functions.
Take a look at the PaperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our SubReddit of over 50,000 ml
FREE ai WEBINAR: 'SAM 2 for Video: How to Optimize Your Data' (Wednesday, September 25, 4:00 am – 4:45 am EST)
Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and ai to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of ai and real-life solutions.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>