Diffusion models have wreaked havoc on imaging applications in recent months. Stable diffusion-led movement has been so successful in generating images from given text prompts that the line between human-generated and AI-generated images has become blurred.
Although progress has made them photorealistic image generators, it is still a challenge to align the results with the text prompts. It could be challenging to explain what you actually want to generate for the model, and it could take a lot of trial and error until you get the image you want. This is especially problematic if you want to have text in the output or want to place certain objects at certain locations on the image.
But if you’ve used ChatGPT or any other large language model, you’ve probably noticed that they’re extremely good at understanding what you really want and generating responses for you. So if the alignment problem doesn’t exist for LLMS, why do we still have it for imaging models?
You might ask, “How did the LLMs do that?” first, and the answer is reinforcement learning with human feedback (RLHF). RLHF methods initially develop a reward function that captures the aspects of the task that humans find important, using feedback from humans on the model results. Subsequently, the language model is fitted using the previously learned reward function.
Can’t we use the same approach that fixed the alignment problem for LLMs and apply it to imaging models? This is exactly the same question that the researchers at Google and Berkeley asked. They wanted to bring the successful approach that solved the LLM alignment problem and transfer it to imaging models.
Their solution was to fine-tune the method for better alignment using human feedback. It is a three step solution; generate images from a set of pairs; collect human feedback on these images; train a reward function on this feedback and use it to update the model.
Human data collection begins with a diverse set of imaging using the existing model. This specifically focuses on indications where pretrained models are prone to errors, such as generating objects with specific colors, counts, and backgrounds. These generated images are then evaluated by human feedback, and each one is assigned a binary label.
Once the newly labeled data set is prepared, the reward function is ready to be trained. A reward function is trained to predict human feedback given the image and the text message. It uses an auxiliary task, which is to identify the original text message within a set of disturbed text messages, to take advantage of human feedback to reward learning more effectively. In this way, the reward function can be better generalized to unseen images and text messages.
The last step is to update the weights of the imaging model using reward-weighted probability maximization to better align the results with human feedback.
This approach was tested by tuning Stable Diffusion with 27K text and image pairs with human feedback. The resulting model was better at rendering objects with specific colors and improved compositing rendering.
review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 16k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. She wrote her M.Sc. thesis on denoising images using deep convolutional networks. She is currently pursuing a PhD. She graduated from the University of Klagenfurt, Austria, and working as a researcher in the ATHENA project. Her research interests include deep learning, computer vision, and multimedia networks.