Nexa AI Launches OmniVision-968M – World's Smallest Vision Language Model with 9x Token Reduction for Edge Devices

Edge ai has long faced the challenge of balancing efficiency and effectiveness. Deploying vision language models (VLMs) on edge devices is difficult due to their large size, high computational demands, and latency issues. Models designed for cloud environments often struggle with the limited resources of edge devices, resulting in excessive battery usage, slower response times, and inconsistent connectivity. Demand for lightweight yet efficient models has been growing, driven by applications such as augmented reality, smart home assistants, and industrial IoT, which require rapid processing of visual and textual inputs. These challenges are further complicated by increasing rates of hallucinations and unreliable results in tasks such as visual question answering or image captioning, where quality and accuracy are essential.

Nexa ai launches OmniVision-968M: the world's smallest vision language model with 9x token reduction for edge devices. OmniVision-968M has been designed with an improved architecture on LLaVA (Large Language and Vision Assistant), achieving a new level of compactness and efficiency, ideal for running at the limit. With a design focused on reducing image tokens by a factor of nine (from 729 to just 81), the latency and computational load typically associated with such models has been dramatically minimized.

The OmniVision architecture is based on three main components:

Base language model: Qwen2.5-0.5B-Instruct serves as a core model for processing text input.
Vision encoder: SigLIP-400M, with a resolution of 384 and a patch size of 14×14, generates image embeddings.
Projection layer: A multilayer perceptron (MLP) aligns the vision encoder embeddings with the token space of the language model. Unlike the standard Key architecture, our projector reduces the number of image tokens by 9 times.

OmniVision-968M integrates several key technical advances that make it the perfect choice for edge deployment. The model architecture has been improved based on LLaVA, allowing it to process visual and text inputs with high efficiency. Reducing image tokens from 729 to 81 represents a significant jump in optimization, making it almost nine times more efficient at processing tokens compared to existing models. This has a profound impact on reducing latency and computational costs, which are critical factors for edge devices. Additionally, OmniVision-968M leverages direct preference optimization (DPO) training with trusted data sources, helping to mitigate the hallucination issue, a common challenge in multimodal ai systems. By focusing on visual question answering and image captioning, the model aims to deliver an accurate and fluid user experience, ensuring reliability and robustness in cutting-edge applications where real-time response and energy efficiency are crucial.

The launch of OmniVision-968M represents a notable advance for several reasons. Mainly, the reduction in token count significantly decreases the computational resources required for inference. For developers and enterprises looking to deploy VLM in restricted environments (such as wearables, mobile devices, and IoT hardware), the OmniVision-968M's compact size and efficiency make it an ideal solution. Additionally, DPO's training strategy helps minimize hallucinations, a common problem when models generate incorrect or misleading information, ensuring OmniVision-968M is efficient and reliable. Preliminary benchmarks indicate that the OmniVision-968M achieves a 35% reduction in inference time compared to previous models, while maintaining or even improving accuracy in tasks such as visual question answering and image captioning. These advancements are expected to drive adoption in industries that require high-speed, low-power ai interactions, such as healthcare, smart cities, and the automotive sector.

In conclusion, Nexa ai's OmniVision-968M addresses a long-standing gap in the ai industry: the need for highly efficient vision language models that can run seamlessly on edge devices. By reducing image tokens, optimizing the LLaVA architecture, and incorporating DPO training to ensure reliable results, OmniVision-968M represents a new frontier in cutting-edge ai. This model brings us closer to the vision of ubiquitous ai, where smart, connected devices can perform sophisticated multimodal tasks locally without the need for constant cloud support.

look at the Model hugging face and <a target="_blank" href="https://nexa.ai/blogs/omni-vision” target=”_blank” rel=”noreferrer noopener”>Other details. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet.. Don't forget to join our SubReddit over 55,000ml.

(<a target="_blank" href="https://landing.deepset.ai/webinar-implementing-idp-with-genai-in-financial-services?utm_campaign=2411%20-%20webinar%20-%20credX%20-%20IDP%20with%20GenAI%20in%20Financial%20Services&utm_source=marktechpost&utm_medium=newsletter” target=”_blank” rel=”noreferrer noopener”>FREE WEBINAR on ai) <a target="_blank" href="https://landing.deepset.ai/webinar-implementing-idp-with-genai-in-financial-services?utm_campaign=2411%20-%20webinar%20-%20credX%20-%20IDP%20with%20GenAI%20in%20Financial%20Services&utm_source=marktechpost&utm_medium=newsletter” target=”_blank” rel=”noreferrer noopener”>Implementation of intelligent document processing with GenAI in financial services and real estate transactions

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. Their most recent endeavor is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has more than 2 million monthly visits, which illustrates its popularity among the public.

Upcoming LinkedIn live event, 'One Platform, Multimodal Possibilities', where Encord CEO Eric Landau and Director of Product Engineering Justin Sharps will talk about how they are reinventing the data development process to Help teams quickly build innovative multimodal ai models.