In machine learning, generative models that can produce images based on text input have made significant progress in recent years, with various approaches showing promising results. While these models have garnered considerable attention and potential applications, aligning them with human preferences remains a major challenge due to differences between pretraining distributions and user input, leading to known issues with the generated images.
Several challenges arise when generating images from text prompts. These include difficulties accurately aligning text and images, accurately depicting the human body, adhering to human aesthetic preferences, and avoiding potential toxicity and bias in generated content. Addressing these challenges requires more than just improving the model architecture and pretraining data. An explored approach in natural language processing is reinforcement learning from human feedback, where a reward model is created through expert-annotated comparisons to guide the model towards human preferences and values. However, this annotation process can take time and effort.
To address those challenges, a research team from China has come up with a novel solution for generating images from text prompts. Introducing ImageReward, the first general-purpose text-to-image human preference reward model, trained on 137k pairs of expert comparisons based on real-world user feedback and model output.
To build ImageReward, the authors used a graph-based algorithm to select various prompts and provided annotators with a system consisting of prompt annotation, text image rating, and image classification. They also hired annotators with at least a college education to ensure consensus on the ratings and rankings of the generated images. The authors analyzed the performance of a text-to-image model in different types of indications. They collected a data set of 8,878 useful cues and scored the generated images based on three dimensions. They also identified common problems in the generated images and found that body problems and repeat generation were the most serious. They studied the influence of “feature” words on model performance cues and found that appropriate feature phrases improve text and image alignment.
The experimental step involved training ImageReward, a preference model for generated images, using annotations to model human preferences. BLIP was used as the backbone and some layers of the transformer were frozen to prevent overfitting. Optimal hyperparameters were determined through a grid search using a validation set. The loss function was formulated based on the classified images for each message, and the goal was to automatically select the images that humans prefer.
In the experiment step, the model is trained on a dataset of more than 136,000 pairs of image comparisons and is compared to other models using preference precision, recall, and filter scores. ImageReward outperforms other models, with a 65.14% accuracy bias. The document also includes an analysis of agreement between annotators, researchers, set of annotators, and models. It is shown that the model performs better than other models in terms of image fidelity, which is more complex than aesthetics, and maximizes the difference between upper and lower images. In addition, an ablation study was conducted to analyze the impact of removing specific components or features from the proposed ImageReward model. The main result of the ablation study is that removal of any of the three branches, including the transformer backbone, image encoder, and text encoder, would lead to a significant drop in model preference accuracy. In particular, removal of the transformer backbone would cause the most significant performance drop, indicating the critical role of the transformer in the model.
In this article, we present new research by a Chinese team that introduced ImageReward. This commonly used text-to-image human preference reward model addresses problems in generative models by aligning with human values. They created a pipeline for the annotation and a data set of 137,000 comparisons and 8,878 ads. Experiments showed that ImageReward outperformed existing methods and could be an ideal evaluation metric. The team analyzed the human evaluations and planned to refine the annotation process, extend the model to cover more categories, and explore reinforcement learning to push the limits of text-to-image synthesis.
review the Paper and Github. Don’t forget to join our 20k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Mahmoud is a PhD researcher in machine learning. He also has a
bachelor’s degree in physical sciences and master’s degree in
telecommunication systems and networks. Your current areas of
the research concerns computer vision, stock market prediction and
learning. He produced several scientific articles on the relationship with the person.
identification and study of the robustness and stability of depths
networks