Alibaba researchers have announced the release of Qwen2-VL, the latest version of the Qwen2-based vision language models within the Qwen model family. This new release represents a significant advancement in multimodal ai capabilities, building on the foundation laid by its predecessor, Qwen-VL. The advancements in Qwen2-VL open up exciting possibilities for a wide range of applications in visual understanding and interaction, following a year of intensive development efforts.
Researchers evaluated Qwen2-VL’s visual capabilities across six key dimensions: college-level complex problem solving, math skills, document and table understanding, multilingual text and image understanding, general scenario question answering, video understanding, and agent-based interactions. The 72B model demonstrated top-notch performance across most metrics, often outperforming even closed-source models such as GPT-4V and Claude 3.5-Sonnet. Notably, Qwen2-VL exhibited a significant advantage in document understanding, highlighting its versatility and advanced capabilities in visual information processing.
The 7B-scale model of Qwen2-VL retains support for image, multi-image, and video inputs, delivering competitive performance in a more cost-effective size. This version excels at document understanding tasks, as demonstrated by its performance on benchmarks such as DocVQA. Additionally, the model shows impressive capabilities in understanding multilingual text from images, achieving state-of-the-art performance on the MTVQA benchmark. These achievements highlight the model’s efficiency and versatility across a variety of visual and linguistic tasks.
A new compact Qwen2-VL 2B model has also been introduced, optimized for potential mobile deployment. Despite its small size, this version demonstrates strong performance in image, video, and multilingual understanding. The 2B model particularly excels in video-related tasks, document understanding, and general scenario question answering compared to other similarly scaled models. This development demonstrates the researchers’ ability to create efficient, high-performance models suitable for resource-constrained environments.
Qwen2-VL introduces significant improvements in object recognition, including complex relationships between multiple objects and enhanced recognition of handwritten and multilingual text. The model’s mathematical and coding capabilities have been greatly enhanced, allowing it to solve complex problems by analyzing graphs and interpret distorted images. Extraction of information from real-world images and graphics has been strengthened, along with improved instruction-following capabilities. Additionally, Qwen2-VL now excels at video content analysis, offering summarization, question-answering, and real-time conversation capabilities. These advancements position Qwen2-VL as a versatile visual agent, capable of bridging abstract concepts with practical solutions across multiple domains.
The researchers have retained the Qwen-VL architecture for Qwen2-VL, which combines a Vision Transformer (ViT) model with Qwen2 language models. All variants use a ViT with approximately 600 million parameters, capable of handling both image and video inputs. Key improvements include the implementation of naive dynamic resolution support, allowing the model to process arbitrary image resolutions by mapping them to a dynamic number of visual tokens. This approach more closely mimics human visual perception. Additionally, the innovative Multimodal Rotary Position Embedding (M-ROPE) allows the model to simultaneously capture and integrate positional information from 1D text, 2D visual, and 3D video.
Alibaba has introduced Qwen2-VL, the latest vision language model in the Qwen family, which enhances multi-modal ai capabilities. Available in 72B, 7B, and 2B versions, Qwen2-VL excels at complex problem solving, document understanding, multilingual text and image understanding, and video analysis, often outperforming models such as GPT-4V. Key innovations include better object recognition, enhanced mathematical and coding abilities, and the ability to handle complex visual tasks. The model integrates a vision transformer with naive dynamic resolution and multi-modal rotational pose embedding, making it a versatile and efficient tool for various applications.
Take a look at the Model card and Details. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our SubReddit of over 50,000 ml
Below is a highly recommended webinar from our sponsor: ai/webinar-nvidia-nims-and-haystack?utm_campaign=2409-campaign-nvidia-nims-and-haystack-&utm_source=marktechpost&utm_medium=banner-ad-desktop” target=”_blank” rel=”noreferrer noopener”>'Developing High-Performance ai Applications with NVIDIA NIM and Haystack'
Asjad is a consultant intern at Marktechpost. He is pursuing Bachelors in Mechanical Engineering from Indian Institute of technology, Kharagpur. Asjad is a Machine Learning and Deep Learning enthusiast who is always researching the applications of Machine Learning in the healthcare domain.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>