In a recent tweet by Dataquest.io founder Vik Paruchuri recently posted the release of a multilingual document OCR toolset. Surya. The framework can efficiently detect line-level bboxes and column breaks in documents, scanned images or presentations. Existing text detection models like Tesseract work at the word or character level, while this open source ai works at the line level. The biggest challenge in creating a text line detection model is the unavailability of 100 percent correct data sets with line-level annotations.
Surya is an encoder-decoder model that uses a document image as input and produces an image with boxes drawn around the line boxes in the original input image. The initial layers of the decoder contain SegFormer, a transformer for semantic segmentation, while the 2D convolutional layer with batch normalization layers constitutes the end of the decoder network. Before using the image or PDF, the pages are segmented up to the maximum dimension of the image and undergo various preprocessing.
For the model's evaluation of the accuracy of bboxes, the researchers used precision and recall in the coverage area instead of the traditional IoU (intersection over union) metric. Precision calculates how well the predicted bboxes cover the actual bboxes and recall calculates how well the actual bboxes cover the predicted bboxes. Surya is compared with Tesseract, experiments suggested that the precision of Surya is much higher than that of Tesseract, and the recall of Tesseract is slightly higher than that of Surya, but overall Surya outperforms Tesseract. Another advantage of Surya over the Tesseract model is that it can run on both CPU and GPU and is much faster than Tesseract.
Surya, named after the Hindu sun god, has successfully worked in several languages and is expected to work in almost all languages. The limitation of this model is that it does not work with photographs or other images, since it is specialized for documents. Experiments also show that it doesn't work well with images that look like ads. Despite this limitation, the model is still very useful and can be further extended to text, table, and graph detection.
Pragati Jhunjhunwala is a Consulting Intern at MarktechPost. She is currently pursuing B.tech from the Indian Institute of technology (IIT), Kharagpur. She is a technology enthusiast and has a keen interest in the scope of data science software and applications. She is always reading about the advancements in different fields of ai and ML.
<!– ai CONTENT END 2 –>