Vision language models (VLMS) have demonstrated impressive capabilities in the general understanding of the image, but face significant challenges when processing visual content rich in text, such as pictures, documents, diagrams and screenshots. These specialized images require complex reasoning that combines textual understanding with spatial understanding, a set of critical skills to analyze scientific literature, improve accessibility characteristics and allow ai agents to function effectively in real world environments. Current VLMs fight with these tasks mainly due to the shortage of high quality training data that realistically represent the diverse variety of visual formats embedded by text found in practical applications. This data limitation has created a performance gap in scenarios that require a nuanced interpretation of structured visual information, which hinders the implementation of these models in specialized domains where the processing of image rich in text is essential.
Several approaches have been developed to improve vision language models to process visual content. Early architectures explored different integration strategies that include cross -care mechanisms, Q formator structures and MLP projection layers to join visual and linguistic characteristics. However, these models often suffer a significant imbalance, that is, their language components substantially exceed visual processing capabilities, which leads to hallucinations when high quality training data is scarce. The existing reference points for the understanding of text -rich images (graphics, documents, infographics, diagrams, screenshots) remain limited in size, scope and diversity, making them suitable for evaluation but inappropriate for comprehensive training. The previous efforts of synthetic data generation have typically focused on narrow domains using small sets of types of graphics with artisanal question templates. Some approaches use LLM only text to generate notes from tables or descriptions, while others explore the representation of synthetic code code. Despite these advances, current synthetic data sets remain limited in the diversity of topics, the variety of figures and the representation methodology, critical limitations that hinder generalization to novel and outdated tasks.
A team of researchers from the University of Pennsylvania, and the Allen Institute of artificial intelligence introduced the Synthetic data generation system guided in code (COSYN) which offers a flexible frame to address the challenges in the processing of image rich in text by creating various synthetic multimodal training data. This innovative system uses LLM codes generation capacities of only text to produce data and representation code for several text -rich visual formats using 11 compatible representation tools that include Python, HTML and latex. Cosyn generates not only the images but also the corresponding textual instructions based on the representation of the underlying code, creating data sets in comprehensive vision language. The researchers used this framework to develop COSYN-400K, a large-scale diverse synthetic data set specifically designed for the understanding of text-rich images.
The Cosyn system operates through a sophisticated four -stage workflow that begins with a natural language consultation such as “generating a set of book covers data.” First, the system selects one of the pipes of 20 generations built in 11 diverse representation tools that include Matpletlib, plot, latex, HTML, mermaid and specialized tools such as Lilypond for music sheets and RDKIT for chemical structures. The process begins with the generation of guided issues by sampled people that improve content diversity, followed by a detailed generation of data that fills the specific content of the chosen topic. Next, the system generates executable code that makes synthetic images use the appropriate tool. Finally, using only the code as a context, the system asks the language models to generate corresponding textual instructions, including questions, answers and reasoning explanations of the thinking chain. To improve diversity beyond what sampling parameters can be achieved, Cosyn incorporates 200k unique people during the generation of topics, effectively counteracting repetitive output trends of language models. The implementation takes advantage of the Datadreamer Library for a solid generation of several stages, using Claude-3.5-SONNET for code generation and GPT-4O-MINI for the generation of instruction data.
The model trained in Cosyn's synthetic data demonstrates exceptional performance at the reference points for text -rich image comprehension. When evaluated against seven specialized data sets, the 7B parameter model achieves the highest average yield, exceeding the second best model (calls 3.2 11b) in a significant margin of 3.9%. The model occupies the first place in four out of every seven reference points and second in the remaining three, highlighting its capacities consisting of various image tasks in text. Perhaps the most notable, even the zero shooting version of the model without any exposure to training instances of the evaluation data sets exceeds the most competent open and closed models, including those that had been adjusted in the reference training data . This unexpected result provides convincing evidence that the skills acquired in the transfer of synthetic data of Cosyn effectively to the downstream tasks without requiring specific training examples. Additional ablation studies show that the combination of synthetic data with auxiliary and evaluation data sets produces the best performance (80.9%), substantially overcoming the models trained only in evaluation data only (75.9%).
He Cosyn frame It represents a significant advance in the development of the vision language model, using the generation of synthetic data to substantially improve the performance in the tasks of understanding images rich in text. By taking advantage of LLM code generation capabilities, the system creates various high quality training data that allow models to generalize through domains with remarkable efficiency. The analysis confirms that the data generated by COSYN successfully mitigate the biases present in the existing data sets, resulting in models that work in a robust way in realistic consultations and written by humans instead of only template -based questions. The improvements demonstrated in zero shooting learning, the reasoning of multiple jumps and the adaptation of novel domain highlight the crucial role of synthetic data in the development of VLM capable of handling visual content rich in complex text in practical applications.
Verify he Paper and Data set here. All credit for this investigation goes to the researchers of this project. In addition, feel free to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter And don't forget to join our 80k+ ml subject.
Recommended Reading Reading IA Research Liberations: An advanced system that integrates the ai system and data compliance standards to address legal concerns in IA data sets

Asjad is an internal consultant at Marktechpost. He is chasing B.tech in mechanical engineering at the Institute of Indian technology, Kharagpur. ASJAD is an automatic learning and deep learning enthusiast who is always investigating automatic learning applications in medical care.