A group of researchers from the University of Washington, Stanford, AI2, UCSB, and Google recently developed the OpenFlamingo project, whose goal is to build models similar to those of the DeepMind Flamingo team. OpenFlamingo models can handle any mixed sequence of text and images and produce text as output. Captioning, visual question answering, and image classification are just some of the activities that can benefit from this and the model’s ability to sample in context.
Now the team announces the release of v2 with five OpenFlamingo models trained at levels 3B, 4B and 9B. These models are derived from open source models licensed less strictly than LLaMA, including MPT-1B and 7B from Mosaic and RedPajama-3B from Together.XYZ.
The researchers used the Flamingo modeling paradigm by adding visual features to the layers of a previously trained static language model. The vision encoder and language model remain static, but the plug-in modules are trained using sequences of images and text pulled from the web, similar to Flamingo.
The team tested their closed captioning, VQA, and classification models on vision and language data sets. Their findings show that the team has made significant progress between their v1 version and the OpenFlamingo-9B v2 model.
They combine results from seven data sets and five different contexts to assess the effectiveness of the models: no shots, four shots, eight shots, sixteen shots, and thirty-two shots. They compare the OpenFlamingo (OF) models at the OF-3B and OF-4B levels with those at the Flamingo-3B and Flamingo-9B levels and find that, on average, OpenFlamingo (OF) achieves more than 80% of the performance of Flamingo . . The researchers also compare their results with the optimized SoTAs published in PapersWithCode. The OpenFlamingo-3B and OpenFlamingo-9B models, pretrained on inline data only, achieve more than 55% adjusted performance with 32 in-context instances. OpenFlamingo models lag DeepMind models by an average of 10% on shot 0 and 15% on shot 32.
The team is continually advancing the training and delivery of state-of-the-art multimodal models. They then aim to improve the quality of the data used for pre-training.
review the Github repository and Blog. Don’t forget to join our 25k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
Featured Tools:
🚀 Check out 100 AI tools at AI Tools Club
Dhanshree Shenwai is a Computer Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking domain with strong interest in AI applications. She is enthusiastic about exploring new technologies and advancements in today’s changing world, making everyone’s life easier.