With the advancement of Large Language Models like ChatGPT and DALL-E and the rise in popularity of Generative Artificial Intelligence, generating content like a human is no longer a dream. Everything is now doable, including answering questions, code completion, and generating content from textual descriptions, as well as creating images from text and images. Recently, AI has caught up with human ingenuity. The well-known chatbot developed by OpenAI, called ChatGPT, is based on the GPT 3.5 transformer architecture and is being used by almost everyone. The latest version of GPT, i.e. GPT 4, is multimodal in nature, unlike the previous version, GPT 3.5, which only allows ChatGPT to take text input.
The quality of generative content has increased significantly as a result of the development of broadcast models. Due to these developments, Artificial Intelligence Generative Content (AIGC) platforms, such as DALLE, Stability AI, Runway, and Midjourney, have become increasingly popular, as these systems allow users to create high-quality images based on text prompts provided in natural language. Despite advances in multimodal understanding, vision and language models still struggle to understand the generated images. Compared to real data, synthetic images show a higher degree of content and style variability, making it much more difficult for models to understand them correctly.
To address these issues, a team of researchers introduced JourneyDB, a large-scale dataset specifically curated for multimodal visual understanding of generative images. JourneyDB has 4 million unique and high-quality generated photos that have been created using different text prompts. This dataset focuses on both content and style interpretation and seeks to offer a comprehensive resource for training and evaluating models’ abilities to understand generated images.
The four tasks included in the suggested benchmark are as follows.
- Quick Reverse – Quick Reverse has been used to find the text prompts that the user used to generate an image. This tests the understanding of the content model and the style of the generated images.
- Style retrieval – The team has focused on style retrieval for the model to identify and retrieve similar generative images based on their stylistic attributes. This assesses the model’s competence in discerning stylistic nuances within generative images.
- Image captions: In image captions, the model is tasked with generating descriptive captions that accurately represent the content of the generative image, which assesses the model’s ability to understand and express the visual elements of the generated content effectively. in natural language.
- Visual Question Answering: Through Visual Question Answering (VQA), the model provides accurate answers to questions related to the generative image. The model is able to understand the visual and style content and provide relevant answers based on the questions given.
The team collected 4,692,751 pairs of images and text and divided them into three sets: a training set, a validation set, and a test set. For the evaluation, the team conducted extensive experiments using the reference data set. The results showed that current state-of-the-art multimodal models do not perform as well as on real datasets, but some adjustments to the proposed dataset greatly improved their performance.
review the Paper, Code, and Project. Don’t forget to join our 25k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at asif@marktechpost.com
Check out 100 AI tools at AI Tools Club
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.