Vision and language models (VLM) represent an advanced field within artificial intelligence, integrating computer vision and natural language processing to handle multimodal data. These models allow systems to simultaneously understand and process images and text, enabling applications such as medical imaging, automated systems, and digital content analysis. Their ability to bridge the gap between visual and textual data has made them a cornerstone of multimodal intelligence research.
One of the main challenges facing the development of VLMs is guaranteeing the safety of their output. Visual input streams surely contain malicious or insecure information that can sometimes evade the model's defense mechanisms, resulting in dangerous or insensitive responses. Stronger textual countermeasures provide protection, but that is not yet the case in visual modalities because visual embedding is continuous and therefore vulnerable to such attacks. In this sense, the task becomes much more difficult to evaluate under multimodal input flows, especially with regard to security.
Current approaches to VLM security are primarily defenses based on inference and tuning. Tuning methods include supervised tuning and reinforcement learning from human feedback, which is effective but resource-intensive and uses a large amount of data, work, and computational power. These methods also risk compromising the overall usefulness of the model. Inference-based methods use security evaluators to compare results against predefined criteria. These methods, however, focus primarily on textual input, neglecting the security implications of visual content. Therefore, unsafe visual inputs are passed without evaluation, leading to poor model performance and reliability.
Researchers at Purdue University introduced the “Assess Then Align” (ETA) framework to address these issues. This new inference timing method ensures VLM security without requiring additional data or adjustments. ETA addresses the limitations of current methods by dividing the security mechanism into two phases, namely multimodal evaluation and two-level alignment. The researchers designed ETA as a plug-and-play solution adaptable to various VLM architectures while maintaining computational efficiency.
The ETA framework works in two stages. The pre-generation evaluation stage verifies the security of visual inputs by applying predefined security protection that is based on CLIP scores. This filters out potentially harmful visual content before generating the response. Then, in the post-generation evaluation stage, a reward model is used to evaluate the security of the textual results. If unsafe behavior is detected, the framework applies two alignment strategies. Shallow alignment uses interference prefixes to shift the model's generative distribution toward safer results, while deep alignment performs sentence-level optimization to further refine responses. This combination ensures both the safety and usefulness of the generated outputs.
The researchers extensively tested ETA with multiple benchmarks to evaluate its performance. The framework reduced the insecure response rate by 87.5% in multimodal attacks and significantly outperformed existing methods such as ECSO. ETA improved noticeably in experiments with the SPA-VL Harm dataset, decreasing the unsafe rate from 46.04% to 16.98%. On multimodal datasets such as MM-SafetyBench and FigStep, ETA reliably showed improved safety mechanisms in handling adverse and harmful visual inputs. In particular, it achieved a 96.6% tie rate in GPT-4 utility evaluations, demonstrating its ability to maintain model utility and improved security. The researchers also demonstrated the efficiency of the framework, adding just 0.1 seconds to inference time compared to the 0.39 seconds overhead of competing methods like ECSO.
This is how the proposed method achieves security and utility through the root cause of vulnerabilities in VLMs: the continuous nature of visual token embeddings. The ETA framework aligns visual and textual data so that existing security mechanisms can work effectively by mapping visual token embeddings to discrete textual embeddings. This ensures that visual and textual input is subject to rigorous security checks, making it impossible for harmful content to escape.
Through their work, the research team has provided one of the scalable and efficient solutions for one of the most challenging tasks in multimodal ai systems. The ETA framework shows how strategic assessment and alignment strategies can change the security of VLMs while maintaining all their overall capabilities. This advancement addresses current security and lays the foundation for future development and implementation of VLM with much more confidence in real applications.
Verify he Paper and GitHub page. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 65,000 ml.
Recommend open source platform: Parlant is a framework that transforms the way ai agents make decisions in customer-facing scenarios. (Promoted)
Nikhil is an internal consultant at Marktechpost. He is pursuing an integrated double degree in Materials at the Indian Institute of technology Kharagpur. Nikhil is an ai/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in materials science, he is exploring new advances and creating opportunities to contribute.