The last few decades have witnessed the rapid development of Optical character recognition (OCR), which has evolved from a academic reference assignment used in the early advances of deep learning research for tangible products available on consumer devices and for third party developers for daily use. These OCR products digitize and democratize valuable information that is stored on paper or image-based sources (for example, books, magazines, newspapers, forms, street signs, restaurant menus) so that they can be indexed, searched, translated, and processed. subsequently. using state-of-the-art natural language processing techniques.
Research on scene text detection and recognition (or scene text detection) has been the main driver of this rapid development by adapting OCR to natural images that have more complex backgrounds than document images. These research efforts, however, focus on the detection and recognition of each individual word in images, without understanding how these words make up sentences and articles.
design analysis is another relevant line of research that takes the image of a document and extracts its structure, that is, title, paragraphs, headings, figures, tables, and legends. These layout analysis efforts parallel OCR and have largely been developed as stand-alone techniques that are typically evaluated only on document images. As such, the synergy between OCR and layout analysis remains underexplored. We believe that OCR and layout analysis are mutually complementary tasks that allow machine learning to interpret text in images and, when combined, could improve the accuracy and efficiency of both tasks.
With this in mind, we announce the Hierarchical Text Detection and Recognition Contest (the HierText Challenge), organized as part of the 17th Annual International Conference on Document Analysis and Recognition (ICDAR 2023). The competition is hosted at the robust reading proficiency website, and represents the first major effort to unify OCR and layout analysis. In this competition, we invite researchers from around the world to build systems that can produce hierarchical annotations of text in images using words grouped into lines and paragraphs. We expect this competition to have a significant and long-term impact on image-based text comprehension with the goal of consolidating research efforts in OCR and design analysis, and creating new signals for downstream information processing tasks.
The concept of hierarchical representation of the text. |
Building a hierarchical text dataset
In this competition, we use the HereText dataset that we publish in CVPR 2022 with our paper “Towards a unified end-to-end scene text detection and layout analysis”. It is the first real image dataset to provide hierarchical text annotations, containing word, lineand paragraph level notes. Here, “wordsare defined as sequences of textual characters not interrupted by spaces. “Lines” are then interpreted as “space“-separate groups of”words” that are logically connected in one direction and aligned in spatial proximity. Finally, “paragraphs” Its composed of “linesthat share the same semantic theme and are geometrically coherent.
To construct this data set, we first annotate images of the Open image dataset using the Google cloud platform (GWP) Text Detection API. We filter these annotated images, keeping only images rich in text content and layout structure. We then work with our third-party partners to manually proofread all transcripts and tag words, lines, and paragraph composition. As a result, we obtained 11,639 transcribed images, divided into three subsets: (1) a train set with 8,281 images, (2) a validation set with 1,724 images, and (3) a test set with 1,634 images. As detailed in the paperwe also check for overlap between our data set, TextOCRand Intel OCR (both also pulled annotated images from Open Images), ensuring that test images in the HierText dataset were not also included in the TextOCR or Intel OCR training and validation splits and vice versa. Next, we visualize examples using the HierText dataset and demonstrate the concept of hierarchical text by shading each text entity with different colors. We can see that HierText has a diversity of image domain, text layout, and high text density.
Samples from the HierText dataset. Left: Illustration of each word entity. Half: Line clustering illustration. Good: Grouping of illustration paragraphs. |
Dataset with the highest text density
In addition to the novel hierarchical representation, HierText represents a new domain of text images. We note that HierText is currently the densest publicly available OCR dataset. Below we summarize the characteristics of HierText compared to other OCR data sets. HierText identifies 103.8 words per image on average, which is more than 3 times the density of TextOCR and 25 times denser than ICDAR-2015. This high density poses unique challenges for detection and recognition, and as a consequence, HierText is used as one of the top data sets for OCR research at Google.
data set | training division | validation division | test division | words per picture | ||||||||||
ICDAR-2015 | 1,000 | 0 | 500 | 4.4 | ||||||||||
TextOCR | 21,778 | 3,124 | 3,232 | 32.1 | ||||||||||
Intel OCR | 19,1059 | 16,731 | 0 | 10.0 | ||||||||||
HierText | 8,281 | 1,724 | 1,634 | 103.8 |
Comparison of various OCR data sets with the HierText data set. |
Space distribution
We also found that the text in the HierText dataset has a much more uniform spatial distribution than other OCR datasets, including TextOCR, Intel OCR, IC19 MLT, COCO-Text and IC19 LSVT. These older datasets tend to have well-composed images, where text is placed in the middle of the images, and therefore easier to identify. By contrast, the text entities in HierText are widely distributed across images. It is proof that our images are from more diverse domains. This feature makes HierText a unique challenge among public OCR data sets.
Spatial distribution of text instances in different data sets. |
The HierText Challenge
The HierText Challenge represents a novel task with unique challenges for OCR models. We invite researchers to participate in this challenge and join us in ICDAR 2023 this year in San Jose, CA. We hope that this competition will spark the research community’s interest in OCR models with information-rich representations that are useful for new downstream tasks.
Thanks
The main contributors to this project are Shangbang Long, Siyang Qin, Dmitry Panteleev, Alessandro Bissacco, Yasuhisa Fujii, and Michalis Raptis. Ashok Popat and Jake Walker provided valuable advice. We also thank Dimosthenis Karatzas and Sergi Robles from the Autonomous University of Barcelona for helping us set up the contest website.