Researchers from the University of Southern California, the University of Washington, Bar-Ilan University, and Google Research presented DreamSync, which addresses the problem of improving alignment and aesthetic appeal in text-to-image (T2I) models based on diffusion without the need for human annotations, model architecture modifications, or reinforcement learning. It achieves this by generating candidate images, evaluating them using visual question answering (VQA) models, and fine-tuning the text-to-image model.
Previous studies proposed using VQA models, exemplified by TIFA, to evaluate T2I generation. With 4K prompts and 25K questions, TIFA facilitates assessment in 12 categories. SeeTrue and training methods, such as RLHF and training adapters, address T2I alignment. Untrained techniques, for example, SynGen and StructuralDiffusion, fine-tune inference for alignment.
DreamSync addresses challenges in T2I models, improving fidelity to user intent and aesthetic appeal without relying on specific architectures or labeled data. Introduces a model-independent framework that uses vision and language models (VLM) to identify discrepancies between generated images and input text. The method involves developing multiple candidate images, evaluating them with VLM, and fitting the T2I model. DreamSync offers improved image alignment, outperforming reference methods and can improve several image characteristics, expanding its applicability beyond alignment improvements.
DreamSync employs a model-agnostic framework to align T2I generation with feedback from VLMs. The process involves generating multiple candidate images from a message and evaluating them for text fidelity and image aesthetics using two dedicated VLMs. The best selected image, determined by VLM feedback, is used to fit the T2I model, and the iteration is repeated until convergence. It also introduces iterative bootstrapping, using VLM as teaching models to label unlabeled data for T2I model training.
DreamSync improves on the SDXL and SD v1.4 T2I models, with three iterations of SDXL resulting in a 1.7 and 3.7 point improvement in fidelity in TIFA. Visual aesthetics also improved by 3.4 points. Applying DreamSync to SD v1.4 produces a 1.0 point fidelity improvement and a 1.7 point absolute score increase in TIFA, with a 0.3 point aesthetic improvement. In a comparative study, DreamSync outperforms SDXL in alignment, producing images with more relevant components and 3.4 more correct responses. It achieves superior textual fidelity without compromising visual appearance in TIFA and DSG benchmarks, demonstrating gradual improvement over iterations.
In conclusion, DreamSync is a versatile framework evaluated on challenging T2I benchmarks, showing significant improvements in alignment and visual appeal in both on- and off-distribution environments. The framework incorporates dual feedback from vision and language models and has been validated using human ratings and a preference prediction model.
Future enhancements to DreamSync include basic feedback with detailed annotations such as bounding boxes to identify misalignments. The customization prompts in each iteration are intended to achieve specific improvements in text-to-image synthesis. The exploration of linguistic structure and attention maps aims to improve attribute-object binding. Training reward models with human feedback can further align generated images with user intent. Extending the application of DreamSync to other model architectures, evaluating performance, and conducting additional studies in various environments are areas of ongoing research.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to join. our 33k+ ML SubReddit, 41k+ Facebook community, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you'll love our newsletter.
Hello, my name is Adnan Hassan. I'm a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a double degree from the Indian Institute of technology, Kharagpur. I am passionate about technology and I want to create new products that make a difference.
<!– ai CONTENT END 2 –>