Image segmentation has become a popular technology, with different fine-tuned models available for various purposes. The model labels each pixel of an image by transmitting each region of the input image; This concept makes the idea of semantic segmentation a reality and application.
This facial analysis model is a semantic segmentation technology optimized from Nvidia's mit-b5 and Celebmask HQ. Its intended use is face analysis, which labels different areas of an image, especially facial features.
It can also detect objects and label them with pre-trained data. So you can get tags for everything from the background to the eyes, nose, skin, eyebrows, clothing, hat, neck, hair, and other features.
Learning objective
- Understand the concept of face analysis as a semantic segmentation model.
- Highlights some key points about face analysis.
- Learn how to run the face analysis model.
- Learn about real-life applications of this model.
This article was published as part of the Data Science Blogathon.
What is facial analysis?
Facial analysis is a computer vision technology that completes tasks that assist in facial analysis of an input image. This process occurs by segmenting the facial parts of the image and other visible areas into pixels. With this image segmentation task, users can further modify, analyze and use the applications of this model in various ways.
Understanding the architecture of the model is a key concept of how this model works. Although this process has a lot of pre-trained data, the vision transformer architecture of this model is more efficient.
Model architecture of face analysis model
This model uses a transformer-based architecture for semantic segmentation, which provides a good foundation for how other similar models like Segformer are built. In addition to integrating the transformer system, it also focuses on a lightweight decoding mechanism when processing an image.
If you look at the key component of how this mechanism works, you will see that it consists of a transformer encoder, an MLP decoder, and no positioning embeddings. These are vital attributes of the working system of transformer models in image segmentation.
The transformer encoder is an essential part of the mechanism and helps extract multi-scale features from the input image. Thus, images with information at different spatial scales can be captured to improve the efficiency of the model.
The lightweight decoder is another vital part of this model's architecture. It is based on a multi-layer perception decoder, allowing it to collect information from different layers of the transformer encoder. This model can achieve this using local and global attention mechanisms; local attention helps to recognize facial features, while global attention ensures good coverage of facial structure.
This mechanism balances the performance and efficiency of the model. Thus, this architecture allows resources to be minimized without affecting the result.
Positionless encoding is another essential part of the facial analysis architecture, which has become a staple in many computer vision and transformer models. This feature is designed to avoid image resolution problems, even for images beyond a limit. Therefore, it maintains efficiency regardless of position codes.
Overall, the model design performs well on standard facial segmentation benchmarks. It is efficient and can generalize across a variety of face images, making it a strong choice for tasks like facial recognition, avatar generation, or AR filters. The model maintains sharp boundaries between facial regions, an essential requirement for accurate facial analysis.
How to run the face analysis model
This section describes the steps to run the code for this model with resources from the Hugging Faces library. The result would show labels for each facial feature it can recognize. You can run this model using the inference API and libraries. So, let's explore these methods.
Running inferences on the face Analysis using face hugging
You can use the inference API available in Hugging Face to complete face analysis tasks. The model inference API tool takes an image as input and the face analysis labels the face parts in the image using colors.
import requests
API_URL = "https://api-inference.huggingface.co/models/jonathandinu/face-parsing"
headers = {"Authorization": "Bearer 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"}
def query(filename):
with open(filename, "rb") as f:
data = f.read()
response = requests.post(API_URL, headers=headers, data=data)
return response.json()
output = query("/content/IMG_20221108_073555.jpg")
The above code starts with the requests library to handle HTTPS requests and communicate with the API across web platforms. So using Hugging Face as API, you can get authorization using the token that can be created for free on the platform. While the URL specifies the model endpoint, the token is used for authentication when making requests to the Hugging Face API.
The rest of the code sends an image file to the API and gets the results. The query function is called with a file showing the location of the image. The function sends the image to the API and stores the response (JSON format) in the output variable.
output
Next, you enter your 'output'. variable to show the result of the inference.
Import essential libraries
This code imports the necessary libraries for the image segmentation task, using Segformer as the base model. It also brings an image processor from the Transformers library to process and run the Segformer model. Then, import PIL to handle image loading and Matplotlib to visualize the segmentation results. Finally, requests to retrieve images from URLs are imported.
import torch
from torch import nn
from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
from PIL import Image
import matplotlib.pyplot as plt
import requests
Attractive hardware: GPU/CPU/
device = (
"cuda"
# Device for NVIDIA or AMD GPUs
if torch.cuda.is_available()
else "mps"
# Device for Apple Silicon (Metal Performance Shaders)
if torch.backends.mps.is_available()
else "cpu"
)
This code activates the available hardware of the appropriate local device to run this model. As shown in the code, it maps 'cuda' for NVIDIA or AMD GPUs and mps for Apple Silicon devices. By default, this model only uses the CPU with no other hardware available.
Charging the processors
The following code loads the segformer image processor and semantic segmentation model, pre-trained on 'jonathandinu/face-parsing' with datasets for face analysis tasks.
image_processor = SegformerImageProcessor.from_pretrained("jonathandinu/face-parsing")
model = SegformerForSemanticSegmentation.from_pretrained("jonathandinu/face-parsing")
model.to(device)
The next step is to search the image for the image segmentation task. You can do this by uploading the file or uploading the image URL as shown in the image below;
url = "https://images.unsplash.com/photo-1539571696357-5a69c17a67c6"
image = Image.open(requests.get(url, stream=True).raw)
This code processes an image using `image_processor`, converting it to a PyTorch tensor and moving it to the specified device (GPU, MPS or CPU).
inputs = image_processor(images=image, return_tensors="pt").to(device)
outputs = model(**inputs)
logits = outputs.logits # shape (batch_size, num_labels, ~height/4, ~width/4)
The processed tensor is fed into the Segformer model to generate segmentation results. The logits are extracted from the model output and represent the raw scores for each pixel in different labels, with dimensions reduced by 4 for height and width.
Production
To get the result, there are some lines of code that will help you display the image results. First, it resizes the output to make sure it matches the dimensions of the input image. This is done by using linear interpolation to obtain a value to estimate the points size of the image.
# resize output to match input image dimensions
upsampled_logits = nn.functional.interpolate(logits,
size=image.size(::-1), # H x W
mode="bilinear",
align_corners=False)
Secondly, you should run the tag masks to help output value into the dimensions of the class.
# get label masks
labels = upsampled_logits.argmax(dim=1)(0)
Finally, you can display the image using the 'metaplotlib' library.
# move to CPU to visualize in matplotlib
labels_viz = labels.cpu().numpy()
plt.imshow(labels_viz)
plt.show()
The image brings the facial feature labels as shown below;
Real life application of facial analysis model
This model has several applications in different industries and many similar refined models are already in use. These are some of the popular applications of face analysis technology;
- Security: This model has facial recognition capabilities, which allow it to identify people through facial features. It can also help identify a list of people allowed into a private event or meeting while blocking unrecognized faces.
- Social networks: Image segmentation has become rampant in the social media space and this model also brings value to this industry. The model can modify skin tones and other facial features, which can be used to create beauty effects in photos, videos and online meetings.
- Entertainment: Facial analysis has a great influence on the entertainment industry. Various analysis attributes can help producers change colors and tones at different positions in an image. You can analyze the image, add embellishments, and modify some parts of an image or video.
Conclusion
The facial analysis model is a powerful semantic segmentation tool designed to accurately label and analyze facial features in images and videos. This model uses a transformer-based architecture to efficiently extract multi-scale features while ensuring performance through a lightweight decoding mechanism and the absence of positional encodings.
Its versatility enables a variety of real-world applications, from improving security through facial recognition to providing advanced image editing features in social media and entertainment.
Key takeaways
- Transformer-based architecture: This mechanism plays an essential role in the efficiency and performance of this model. Furthermore, the no positional encoding attribute of this system avoids image resolution problems.
- Versatile applications: This model can be used in different industries; Security, entertainment, and social media spaces can find valuable uses for facial analysis technology.
- Semantic segmentation: By precisely segmenting each pixel related to facial features, the model facilitates detailed analysis and manipulation of images, providing users with valuable information and capabilities in facial analysis.
Resources
Frequently asked questions
A. Facial analysis is a computer vision technology that segments an image into different facial features, labeling each area, such as the eyes, nose, mouth and skin.
A. The model processes input images through a transformer-based architecture that captures multi-scale features. This is followed by a lightweight decoder that aggregates information to produce accurate segmentation results.
A. Key applications include security (facial recognition), social media (photo and video enhancement), and entertainment (image and video editing).
A. The transformer architecture enables efficient image processing, better handling of different image resolutions, and improved segmentation accuracy without the need for positional encoding.
The media shown in this article is not the property of Analytics Vidhya and is used at the author's discretion.