Business documents such as contracts, reports, invoices, and receipts come with complex layouts. These documents can be interpreted and analyzed automatically, which is useful and can result in the creation of ai-powered solutions. However, there are a number of challenges, as these documents can have rich semantics that lie at the intersection of textual and spatial modalities. The complex layouts of documents provide crucial visual cues that are necessary for efficient interpretation.
While Document ai (DocAI) has made significant progress in areas such as question answering, categorization, and extraction, real-world applications continue to face persistent obstacles related to accuracy, reliability, contextual understanding, and generalization to new domains.
To address these issues, a team of researchers at JPMorgan ai Research has introduced DocLLM, a lightweight version of conventional large language models (LLM) that takes into account both textual semantics and spatial layout and has been created specifically to reason about visual documents.
DocLLM is inherently multimodal as it represents both text semantics and spatial layouts. Unlike traditional methods, it has been developed to use bounding box coordinates acquired through optical character recognition (OCR) to add spatial layout information, thereby eliminating the need for a sophisticated visual encoder. This design decision reduces processing times, only slightly increases model size, and maintains the causal decoder architecture.
The team has shared that for various document intelligence tasks, including understanding forms, aligning tables, and visually answering questions, it is enough to have a spatial layout structure. By separating spatial information from textual information, the method has extended the self-attention mechanism typical of transformers to capture cross-modal interactions.
Visual documents often have fragmented sections of text, erratic layouts, and varied information. To address this, the study has suggested changing the pre-training goal during the self-monitored pre-training phase. It has recommended padding to accommodate various text layouts and cohesive blocks of text. With this setting, the model can more effectively handle mixed data types, complex layouts, contextual endings, and misaligned text.
DocLLM's pre-trained knowledge has been refined on instructional data from many data sets to suit different document intelligence jobs. These tasks include document categorization, visual question answering, natural language inference, and key information extraction.
Instruction wrap data has covered single-page and multi-page documents, and layout cues such as field separators, headings, and legends can be included to make it easier for readers to understand the logical structure of articles. For the Llama2-7B model, the changes made by DocLLM have produced notable performance improvements, ranging from 15% to 61%, on four of the five previously unpublished data sets.
The team has summarized its main contributions as follows.
- A typical LLM has been introduced with a lightweight extension designed especially for visual interpretation of documents.
- The study aims to provide a unique attention mechanism that can distinguish between textual and spatial information, allowing efficient capture of cross-modal alignment between layout and text.
- A pre-training objective has been outlined to address difficulties caused by asymmetric layouts in visual documents.
- A specialized instruction tuning dataset has been designed for visual document intelligence tasks that need to be selected to tune the model effectively.
- In-depth testing was performed that yielded important insights into how the suggested model behaves and works when managing visual documents.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to join. our SubReddit of more than 35,000 ml, 41k+ Facebook community, Discord Channel, LinkedIn Grabove, Twitterand Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you'll love our newsletter.
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.
<!– ai CONTENT END 2 –>