When it comes to fashion search and recommendation algorithms, multimodal techniques fuse textual and visual data to achieve greater accuracy and personalization. Users can utilize the system’s ability to evaluate visual and textual descriptions of clothing items to obtain more accurate search results and personalized recommendations. These systems offer a more natural and contextual way to shop by combining image recognition with natural language processing, helping users discover clothing that best suits their tastes and preferences.
Margo releases two new state-of-the-art multimodal models for fashion domain search and recommendation, Marqo-FashionCLIP and Marqo-FashionSigLIP. For use in downstream search and recommendation systems, Marqo-FashionCLIP and Marqo-FashionSigLIP can generate both text and image embeddings. Over one million fashion items with extensive metadata including materials, colors, styles, keywords, and descriptions were trained on the models.
The team used two pre-existing base models (ViT-B-16-laion and ViT-B-16-SigLIP-webli) to fine-tune the models using GCL. The seven-part loss is optimized for keywords, categories, details, colors, materials, and long descriptions. This multi-part loss was far superior to InfoNCE’s conventional text-image loss in terms of contrastive learning and fine-tuning. This results in a model that yields better search application results when working with shorter descriptive text and keyword-like material.
The researchers used seven publicly available fashion datasets, which were not part of the training dataset, to evaluate the models. These include iMaterialist, DeepFashion (in-store), DeepFashion (multimodal), Fashion200K, KAGL, Atlas, and Polyvore. Each dataset is linked to distinct downstream activities based on available metadata. Interactions between text and images, categories and products, and subcategories and products were the three main focuses of the evaluation. The text-to-image task mimics longer descriptive queries (such as tail queries) using distinct text sections. Shorter keyword-like queries (similar to head queries) that can be from the category and subcategory of the product task represent multiple valid outcomes.
In a comprehensive performance comparison, Marqo-FashionCLIP and Marqo-FashionSigLIP outperform their basic and fashion-specific model predecessors in all aspects. For example, Marqo-FashionCLIP achieved improvements in recall@1 (text to image) and precision@1 (category/subcategory to product) of 22%, 8%, and 11% respectively, compared to FashionCLIP2.0. Similarly, Marqo-FashionSigLIP achieved a recall@1 of 57%, precision@1 of 11%, and recall@1 of 13%, demonstrating its superiority over other models.
The study covers queries of varying lengths, from simple categories to long descriptions. The results, broken down by query type, demonstrate the robustness of the models across different query lengths and types. The proposed models, Marqo-FashionCLIP and Marqo-FashionSigLIP, deliver superior performance and ensure efficiency. Compared to current fashion-specific models, they offer a 10% improvement in inference times.
Researchers have released Marqo-FashionCLIP and Marqo-FashionSigLIP under the Apache 2.0 license. Users can download the standard implementation directly from Hugging Face and use it anywhere.
Take a look at the ai/blog/search-model-for-fashion” target=”_blank” rel=”noreferrer noopener”>Details and Model card. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our Subreddit with over 48 billion users
Find upcoming ai webinars here
Dhanshree Shenwai is a Computer Science Engineer with extensive experience in FinTech companies spanning the Finance, Cards & Payments and Banking space and is keenly interested in the applications of artificial intelligence. She is excited to explore new technologies and advancements in today’s ever-changing world, making life easier for everyone.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>