Scene text recognition using vision-based text recognition

Scene Text Recognition (STR) continues to challenge researchers due to the diversity of text occurrences in natural environments. It is one thing to detect text in images of documents and another when the text is in an image of a person's t-shirt. The introduction of multi-granularity prediction for scene text recognition (MGP-STR), presented at ECCV 2022, represents a transformative approach in this area. MGP-STR combines the robustness of Vision Transformers (ViT) with innovative multi-granularity linguistic predictions. This improves its ability to handle complex scene text recognition tasks. This ensures improved accuracy and usability in a variety of challenging real-world scenarios, creating a simple yet powerful solution for STR tasks.

Learning objectives

Understand the architecture and components of MGP-STR, including Vision Transformers (ViT).
Discover how multi-granularity predictions improve the accuracy and versatility of scene text recognition.
Explore practical applications of MGP-STR in real-world OCR tasks.
Get hands-on experience implementing and using MGP-STR with PyTorch for scene text recognition.

This article was published as part of the Data Science Blogathon.

What is MGP-STR?

MGP-STR is a vision-based STR model designed to excel without relying on a standalone language model. Instead, it integrates linguistic information directly within its architecture through the Multi-Granularity Prediction (MGP) strategy. This implicit approach allows MGP-STR to outperform both pure vision models and augmented language methods, achieving state-of-the-art results in STR.

The architecture consists of two main components, both of which are critical to ensuring the model's exceptional performance and its ability to handle various scene text challenges:

Vision Transformer (ViT)
Modules A³

Fusing predictions at the character, subword, and word levels through a simple yet effective strategy ensures that MGP-STR captures the complexities of visual and linguistic features.

Understanding MGP-STR: Scene Text Recognition

MGP-STR Applications and Use Cases

MGP-STR is primarily designed for optical character recognition (OCR) tasks on text images. Its unique ability to incorporate linguistic knowledge implicitly makes it particularly effective in real-world scenarios where text variations and distortions are common. Examples include:

Read text from natural scenes, such as road signs, billboards, and store names in outdoor environments.
Extraction of handwritten or printed text from scanned forms and official documents.
Text analysis in industrial applications, such as reading labels, barcodes or product serial numbers.
Translate or transcribe text in augmented reality (AR) applications for travel or education. such as road signs and billboards.
Extraction of information from scanned documents or photographs of printed materials.
Help accessibility solutions, such as screen readers for visually impaired users.

MGP-STR Applications and Use Cases: Text Recognition in Scenes

Key features and benefits

Elimination of independent linguistic models
Multi-granularity predictions
Next-generation performance
Ease of use

Getting started with MGP-STR

Before delving into the code snippet, let's understand its purpose and prerequisites. This example demonstrates how to use the MGP-STR model to perform scene text recognition on a sample image. Make sure you have PyTorch, the Transformers library, and the necessary dependencies (such as PILs and requests) installed in your environment to run the code smoothly. Below is an example of how to use the MGP-STR model in PyTorch (laptop).

Step 1: Import dependencies

Start by importing the essential libraries and dependencies required for MGP-STR, including transformers for model processing, PIL for image manipulation, and requests to search for images online. These libraries provide the fundamental tools to efficiently process and display text images.

from transformers import MgpstrProcessor, MgpstrForSceneTextRecognition
import requests
import base64
from io import BytesIO
from PIL import Image
from IPython.display import display, Image as IPImage

Step 2: Loading the base model

Load the base model MGP-STR and its processor from the Hugging Face Transformers library. This initializes the pre-trained model and accompanying utilities, allowing for smooth processing and prediction of scene text from images.

processor = MgpstrProcessor.from_pretrained('alibaba-damo/mgp-str-base')
model = MgpstrForSceneTextRecognition.from_pretrained('alibaba-damo/mgp-str-base')

Step 3: Helper function to predict text on image

Define a helper function to input image URLs, process the images using the MGP-STR model, and generate text predictions. The function handles image conversion, base64 encoding for display, and uses the model results to decode the recognized text efficiently.

def predict(url):
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

    # Process the image to prepare it for the model
    pixel_values = processor(images=image, return_tensors="pt").pixel_values

    # Generate the text from the model
    outputs = model(pixel_values)
    generated_text = processor.batch_decode(outputs.logits)('generated_text')

    # Convert the image to base64 for transmission
    buffered = BytesIO()
    image.save(buffered, format="PNG")
    image_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")

    display(IPImage(data=base64.b64decode(image_base64)))
    print("\n\n")

    return generated_text

Example 1:

predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_7.png?raw=true")

('7')

Example2:

predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_BAR.png?raw=true")

('bar')

Example3:

predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_CROCODILES.png?raw=true")

('crocodiles')

Example4:

predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_DAY.png?raw=true")

('day')

Due to the nature of the images, you will see that the prediction is efficient. With this kind of precision, it is very easy to implement this model and get a good answer. You will also see that the model can run with only one CPU and uses less than 3 GB of RAM. This makes it even more efficient to make additional adjustments for other use cases in domain-specific tasks.

Conclusion

MGP-STR exemplifies the combination of vision and language knowledge within a unified framework. By innovatively integrating multi-granularity predictions into the STR process, MGP-STR ensures a holistic approach to scene text recognition by combining information at the character, subword, and word levels. This results in greater accuracy, adaptability to diverse data sets, and efficient performance without relying on external language models. It simplifies the architecture while achieving remarkable precision. For researchers and developers in OCR and STR, MGP-STR offers a next-generation tool that is efficient and accessible. With its open source implementation and comprehensive documentation, MGP-STR is poised to drive further advances in the field of scene text recognition.

Golf course

Key takeaways

MGP-STR integrates vision and linguistic knowledge without relying on separate linguistic models, streamlining the STR process.
The use of multi-granularity predictions allows MGP-STR to excel in various text recognition challenges.
MGP-STR sets a new benchmark for STR models by achieving state-of-the-art results with a simple and efficient architecture.
Developers can easily adapt and implement MGP-STR for a variety of OCR tasks, enhancing both research and practical applications.

Frequently asked questions

Q1: What is MGP-STR and how is it different from traditional STR models?

A1: MGP-STR is a scene text recognition model that integrates linguistic predictions directly into its vision-based framework using Multi-Granularity Prediction (MGP). Unlike traditional STR models, it eliminates the need for separate language models, simplifying the process and improving accuracy.

Q2: What data sets were used to train MGP-STR?

A2: The base-size MGP-STR model was trained on the MJSynth and SynthText datasets, which are widely used for scene text recognition tasks.

P3. Can MGP-STR handle distorted or low-quality text images?

A3: Yes, MGP-STR's multi-granularity prediction mechanism allows you to handle various challenges, including distorted or low-quality text images.

Q4. Is MGP-STR suitable for languages other than English?

A4: While the current implementation is optimized for English, the architecture can be adapted to support other languages by training it on relevant data sets.

Q5. How does the A³ module contribute to the performance of the MGP-STR?

A5: The A³ module refines the ViT results by mapping combinations of tokens to characters and allowing predictions at the subword level, incorporating linguistic knowledge into the model.

The media shown in this article is not the property of Analytics Vidhya and is used at the author's discretion.

Shadow of Mobarak

I am an artificial intelligence engineer with a deep passion for research and solving complex problems. I provide ai solutions that leverage large language models (LLM), GenAI, transformer models, and stable diffusion.

Scene text recognition using vision-based text recognition

Technical Terrence Team

This is how an investor could start generating a second income of £10,000 for £180 a month in 2025

Leave a Reply Cancel reply

Recommended.

Over $500 Million Ethereum (ETH) Leaving CEXs as Market Prepares for Impulsive Move: Data

Khan Academy and Microsoft Education partner to offer real AI tools for free

What’s going on at Huobi?

The 5 Best Workflow Automation Software in 2023

Hong Kong Bitcoin ETFs May Open to Mainland Chinese

Categories

Important Links

Scene text recognition using vision-based text recognition

Learning objectives

What is MGP-STR?

MGP-STR Applications and Use Cases

Key features and benefits

Getting started with MGP-STR

Step 1: Import dependencies

Step 2: Loading the base model

Step 3: Helper function to predict text on image

Example 1:

Example2:

Example3:

Example4:

Conclusion

Golf course

Key takeaways

Frequently asked questions

Related

Technical Terrence Team

This is how an investor could start generating a second income of £10,000 for £180 a month in 2025

Leave a Reply Cancel reply

Recommended.

Over $500 Million Ethereum (ETH) Leaving CEXs as Market Prepares for Impulsive Move: Data

Khan Academy and Microsoft Education partner to offer real AI tools for free

What’s going on at Huobi?

The 5 Best Workflow Automation Software in 2023

Hong Kong Bitcoin ETFs May Open to Mainland Chinese

Categories

Important Links

Get daily news updates to your inbox!