CtrlSynth: Controllable image and text synthesis for data-efficient multimodal learning

Pre-training of robust or multimodal baseline vision models (e.g., CLIP) relies on large-scale data sets that can be noisy, potentially misaligned, and have long-tailed distributions. Previous work has shown promising results in augmenting data sets by generating synthetic samples. However, they only support domain-specific ad hoc use cases (e.g., only image or text, but not both) and have limited data diversity due to the lack of detailed control over the synthesis process. In this paper, we design a controllable image-text synthesis pipeline, CtrlSynth, for robust and data-efficient multimodal learning. The key idea is to decompose the visual semantics of an image into basic elements, apply user-specified control policies (e.g., delete, add, or replace operations), and recompose them to synthesize images or texts. The decompose and recompose function in CtrlSynth allows users to control data synthesis in fine detail by defining custom control policies to manipulate the basic elements. CtrlSynth leverages the capabilities of pre-trained core models, such as large language models or diffusion models, to reason and recompose core elements so that synthetic samples are natural and composed in a variety of ways. CtrlSynth is a modular, training-free, closed-loop framework that makes it easy to support different pre-trained models. With extensive experiments on 31 data sets spanning different vision and vision language tasks, we demonstrate that CtrlSynth substantially improves the zero-shot classification, image and text retrieval, and compositional reasoning performance of CLIP models.

CtrlSynth: Controllable image and text synthesis for data-efficient multimodal learning

Technical Terrence Team

Midday Stock Engines: Tapestry, Capri, Deckers Outdoor, Spirit Airlines and More

Leave a Reply Cancel reply

Recommended.

Six Tips to Help Educators Support Young Readers

The games and gaming devices that helped us escape our hellish reality

Stock Market News Today: Major Averages Trade Mixed After Softer ADP Jobs Report

Ethereum Co-Founder Proposes New Plan to Drive Decentralization in Crypto Betting

Decentraland will highlight artificial intelligence at the AI world fair

Categories

Important Links

CtrlSynth: Controllable image and text synthesis for data-efficient multimodal learning

Related

Technical Terrence Team

Midday Stock Engines: Tapestry, Capri, Deckers Outdoor, Spirit Airlines and More

Leave a Reply Cancel reply

Recommended.

Six Tips to Help Educators Support Young Readers

The games and gaming devices that helped us escape our hellish reality

Stock Market News Today: Major Averages Trade Mixed After Softer ADP Jobs Report

Ethereum Co-Founder Proposes New Plan to Drive Decentralization in Crypto Betting

Decentraland will highlight artificial intelligence at the AI ​​world fair

Categories

Important Links

Get daily news updates to your inbox!

Decentraland will highlight artificial intelligence at the AI world fair