Scene Text Recognition (STR) continues to challenge researchers due to the diversity of text occurrences in natural environments. It is one thing to detect text in images of documents and another when the text is in an image of a person's t-shirt. The introduction of multi-granularity prediction for scene text recognition (MGP-STR), presented at ECCV 2022, represents a transformative approach in this area. MGP-STR combines the robustness of Vision Transformers (ViT) with innovative multi-granularity linguistic predictions. This improves its ability to handle complex scene text recognition tasks. This ensures improved accuracy and usability in a variety of challenging real-world scenarios, creating a simple yet powerful solution for STR tasks.
Learning objectives
- Understand the architecture and components of MGP-STR, including Vision Transformers (ViT).
- Discover how multi-granularity predictions improve the accuracy and versatility of scene text recognition.
- Explore practical applications of MGP-STR in real-world OCR tasks.
- Get hands-on experience implementing and using MGP-STR with PyTorch for scene text recognition.
This article was published as part of the Data Science Blogathon.
What is MGP-STR?
MGP-STR is a vision-based STR model designed to excel without relying on a standalone language model. Instead, it integrates linguistic information directly within its architecture through the Multi-Granularity Prediction (MGP) strategy. This implicit approach allows MGP-STR to outperform both pure vision models and augmented language methods, achieving state-of-the-art results in STR.
The architecture consists of two main components, both of which are critical to ensuring the model's exceptional performance and its ability to handle various scene text challenges:
- Vision Transformer (ViT)
- Modules A³
Fusing predictions at the character, subword, and word levels through a simple yet effective strategy ensures that MGP-STR captures the complexities of visual and linguistic features.
MGP-STR Applications and Use Cases
MGP-STR is primarily designed for optical character recognition (OCR) tasks on text images. Its unique ability to incorporate linguistic knowledge implicitly makes it particularly effective in real-world scenarios where text variations and distortions are common. Examples include:
- Read text from natural scenes, such as road signs, billboards, and store names in outdoor environments.
- Extraction of handwritten or printed text from scanned forms and official documents.
- Text analysis in industrial applications, such as reading labels, barcodes or product serial numbers.
- Translate or transcribe text in augmented reality (AR) applications for travel or education. such as road signs and billboards.
- Extraction of information from scanned documents or photographs of printed materials.
- Help accessibility solutions, such as screen readers for visually impaired users.
Key features and benefits
- Elimination of independent linguistic models
- Multi-granularity predictions
- Next-generation performance
- Ease of use
Getting started with MGP-STR
Before delving into the code snippet, let's understand its purpose and prerequisites. This example demonstrates how to use the MGP-STR model to perform scene text recognition on a sample image. Make sure you have PyTorch, the Transformers library, and the necessary dependencies (such as PILs and requests) installed in your environment to run the code smoothly. Below is an example of how to use the MGP-STR model in PyTorch (laptop).
Step 1: Import dependencies
Start by importing the essential libraries and dependencies required for MGP-STR, including transformers
for model processing, PIL
for image manipulation, and requests
to search for images online. These libraries provide the fundamental tools to efficiently process and display text images.
from transformers import MgpstrProcessor, MgpstrForSceneTextRecognition
import requests
import base64
from io import BytesIO
from PIL import Image
from IPython.display import display, Image as IPImage
Step 2: Loading the base model
Load the base model MGP-STR and its processor from the Hugging Face Transformers library. This initializes the pre-trained model and accompanying utilities, allowing for smooth processing and prediction of scene text from images.
processor = MgpstrProcessor.from_pretrained('alibaba-damo/mgp-str-base')
model = MgpstrForSceneTextRecognition.from_pretrained('alibaba-damo/mgp-str-base')
Step 3: Helper function to predict text on image
Define a helper function to input image URLs, process the images using the MGP-STR model, and generate text predictions. The function handles image conversion, base64 encoding for display, and uses the model results to decode the recognized text efficiently.
def predict(url):
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
# Process the image to prepare it for the model
pixel_values = processor(images=image, return_tensors="pt").pixel_values
# Generate the text from the model
outputs = model(pixel_values)
generated_text = processor.batch_decode(outputs.logits)('generated_text')
# Convert the image to base64 for transmission
buffered = BytesIO()
image.save(buffered, format="PNG")
image_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
display(IPImage(data=base64.b64decode(image_base64)))
print("\n\n")
return generated_text
Example 1:
predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_7.png?raw=true")
('7')
Example2:
predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_BAR.png?raw=true")
('bar')
Example3:
predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_CROCODILES.png?raw=true")
('crocodiles')
Example4:
predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_DAY.png?raw=true")
('day')
Due to the nature of the images, you will see that the prediction is efficient. With this kind of precision, it is very easy to implement this model and get a good answer. You will also see that the model can run with only one CPU and uses less than 3 GB of RAM. This makes it even more efficient to make additional adjustments for other use cases in domain-specific tasks.
Conclusion
MGP-STR exemplifies the combination of vision and language knowledge within a unified framework. By innovatively integrating multi-granularity predictions into the STR process, MGP-STR ensures a holistic approach to scene text recognition by combining information at the character, subword, and word levels. This results in greater accuracy, adaptability to diverse data sets, and efficient performance without relying on external language models. It simplifies the architecture while achieving remarkable precision. For researchers and developers in OCR and STR, MGP-STR offers a next-generation tool that is efficient and accessible. With its open source implementation and comprehensive documentation, MGP-STR is poised to drive further advances in the field of scene text recognition.
Golf course
Key takeaways
- MGP-STR integrates vision and linguistic knowledge without relying on separate linguistic models, streamlining the STR process.
- The use of multi-granularity predictions allows MGP-STR to excel in various text recognition challenges.
- MGP-STR sets a new benchmark for STR models by achieving state-of-the-art results with a simple and efficient architecture.
- Developers can easily adapt and implement MGP-STR for a variety of OCR tasks, enhancing both research and practical applications.
Frequently asked questions
A1: MGP-STR is a scene text recognition model that integrates linguistic predictions directly into its vision-based framework using Multi-Granularity Prediction (MGP). Unlike traditional STR models, it eliminates the need for separate language models, simplifying the process and improving accuracy.
A2: The base-size MGP-STR model was trained on the MJSynth and SynthText datasets, which are widely used for scene text recognition tasks.
A3: Yes, MGP-STR's multi-granularity prediction mechanism allows you to handle various challenges, including distorted or low-quality text images.
A4: While the current implementation is optimized for English, the architecture can be adapted to support other languages by training it on relevant data sets.
A5: The A³ module refines the ViT results by mapping combinations of tokens to characters and allowing predictions at the subword level, incorporating linguistic knowledge into the model.
The media shown in this article is not the property of Analytics Vidhya and is used at the author's discretion.