Introduction
Image tinting technology is an image processing and computer vision technique that separates foreground objects from the background in an image. The goal of image matting is to accurately calculate the transparency or alpha values of each pixel in the image, indicating how much of the pixel belongs to the foreground and how much to the background.
Image matting is commonly used by several applications, including image and video editing, which requires precise extraction of objects. It allows you to create realistic compositions, seamlessly integrating objects from one image into another and preserving intricate details such as hair or fur. Advanced image matting techniques often involve the use of machine learning models and deep learning algorithms to improve the accuracy of the matting process. These models analyze the visual characteristics of the image, including color, texture, and lighting, to effectively estimate alpha values and separate foreground and background elements.
Image matting involves accurately estimating the foreground object in an image. People employ this technique in various applications, a notable example being the use of background blur effects during video calls or when capturing portrait selfies on smartphones. ViTMatte is the latest addition to the Transformers library, a next-generation model designed for image matting. ViTMatte leverages a vision transformer (ViT) as its backbone, complemented by a lightweight decoder, making it capable of distinguishing intricate details such as individual hairs.
Learning objectives
In this article, we will delve into;
- The ViTMatte model and a demonstration for image matting.
- We will provide a step by step guide using the code.
This article was published as part of the Data Science Blogathon.
Understanding the ViTMatte model
The ViTMatte architecture is based on a vision transformer (ViT), the backbone of the model. The key advantage of this design is that the spine handles the heavy lifting, benefiting from large-scale self-supervised pre-training. This results in high performance when it comes to image matting. It was presented in the paper “Boosting Image Matting with Pretrained Simple Vision Transformers” by Jingfeng Yao, Xinggang Wang, Shusheng Yang, and Baoyuan Wang. This model leverages the power of simple Vision Transformers (ViTs) to address images, which involves accurately estimating the foreground object in images and videos.
ViTMatte’s critical contributions, as shown in the article summary, are as follows:
- Hybrid care mechanism: It uses a hybrid attention mechanism with a convolutional “neck”. This approach helps ViTs strike a balance between performance and computation in matting.
- Details capture module: To enhance crucial information for matting, ViTMatte introduces the detail capture module. This module consists of light convolutions, which complement the information.
ViTMatte inherits several traits from ViTs, including various pre-training strategies, optimized architectural design, and adaptive inference strategies.
Next-generation performance
The model was evaluated on Composition-1k and Distinctions-646, benchmarks for image tinting. In both cases, ViTMatte has achieved state-of-the-art performance and has surpassed the capabilities of previous matting methods by a margin. This underlines the potential of ViTMatte in the field of image matting.
The diagram above comes from their article and shows an overview of ViTMatte and other simple vision transformer applications. The previous approach used a simple function pyramid designed by ViTDet. Figure (b) presents a new adaptation strategy for image matting called ViTMatte. It leverages simple convolution layers to extract information from the image and uses the feature map generated by vision transformers (ViT).
Practical implementation
Let’s delve into the practical implementation of ViTMatte. We will follow the steps to use ViTMatte to take advantage of its capabilities. We cannot continue this article without mentioning Niels Rogge’s constant efforts for the HF family. ViTMatte’s contribution to HF is attributed to Niels https://huggingface.co/nielsr. The original ViTMatte code can be found here. Let’s dive into the code!
Configuration environment
To get started with ViTMatte, you must first configure your environment. Find this tutorial code here: https://github.com/inuwamobarak/ViTMatte . Start by installing the Transformers library, which includes the ViTMatte model. Use the following code for installation:
!pip install -q git+https://github.com/huggingface/transformers.git
Loading the image and Trimap
In image matting, we manually label a suggestion map known as a trimap, which outlines the foreground in white, the background in black, and unknown regions in gray. The ViTMatte model expects both the input image and the trimap to perform image matting.
The following code demonstrates how to load an image and its corresponding trimap:
# Import necessary libraries
import matplotlib.pyplot as plt
from PIL import Image
import requests
# Load the image and trimap
url = "https://github.com/hustvl/ViTMatte/blob/main/demo/bulb_rgb.png?raw=true"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
url = "https://github.com/hustvl/ViTMatte/blob/main/demo/bulb_trimap.png?raw=true"
trimap = Image.open(requests.get(url, stream=True).raw)
# Display the image and trimap
plt.figure(figsize=(15, 15))
plt.subplot(1, 2, 1)
plt.imshow(image)
plt.subplot(1, 2, 2)
plt.imshow(trimap)
plt.show()
Loading the ViTMatte model
Next, let’s load the ViTMatte model and its processor from the hub. The processor is responsible for image preprocessing, while the model itself is the core of image matting.
# Import ViTMatte related libraries
from transformers import VitMatteImageProcessor, VitMatteForImageMatting
# Load the processor and model
processor = VitMatteImageProcessor.from_pretrained("hustvl/vitmatte-small-distinctions-646")
model = VitMatteForImageMatting.from_pretrained("hustvl/vitmatte-small-distinctions-646")
Perform a forward pass
Now that we have set up the image, trimap, and model, let’s run a forward pass for the predicted alpha values. These alpha values represent the transparency of each pixel in the image. This means that with the model and processor in place, you can now perform a forward pass to obtain the expected alpha values, which represent the transparency of each pixel in the image.
# Import necessary libraries
import torch
# Perform a forward pass
with torch.no_grad():
outputs = model(pixel_values)
# Extract the alpha values
alphas = outputs.alphas.flatten(0, 2)
Viewing the foreground
To display the foreground object, we can use the following code, which crops the foreground of the image based on the expected alpha values:
# Import necessary libraries
import PIL
from torchvision.transforms import functional as F
# Define a function to calculate the foreground
def cal_foreground(image: PIL.Image, alpha: PIL.Image):
image = image.convert("RGB")
alpha = alpha.convert("L")
alpha = F.to_tensor(alpha).unsqueeze(0)
image = F.to_tensor(image).unsqueeze(0)
foreground = image * alpha + (1 - alpha)
foreground = foreground.squeeze(0).permute(1, 2, 0).numpy()
return foreground
# Calculate and display the foreground
fg = cal_foreground(image, prediction)
plt.figure(figsize=(7, 7))
plt.imshow(fg)
plt.show()
Background replacement
An awesome use of image matte is to replace the background with a new one. The following code shows how to merge the intended alpha matte with a new background image:
# Load the new background image
url = "https://github.com/hustvl/ViTMatte/blob/main/demo/new_bg.jpg?raw=true"
background = Image.open(requests.get(url, stream=True).raw).convert("RGB")
plt.imshow(background)
# Define a function to merge with the new background
def merge_new_bg(image, background, alpha):
image = image.convert('RGB')
bg = background.convert('RGB')
alpha = alpha.convert('L')
image = F.to_tensor(image)
bg = F.to_tensor(bg)
bg = F.resize(bg, image.shape(-2:))
alpha = F.to_tensor(alpha)
new_image = image * alpha + bg * (1 - alpha)
new_image = new_image.squeeze(0).permute(1, 2, 0).numpy()
return new_image
# Merge with the new background
new_image = merge_new_bg(image, background, prediction)
plt.figure(figsize=(7, 7))
plt.imshow(new_image)
plt.show()
Find the complete code here and don’t forget to follow my GitHub. This is a powerful addition to image matting, making it easy to accurately estimate foreground objects in images and videos. Video conferencing like ZOOM can use this technology to effectively eliminate backgrounds.
Conclusion
ViTMatte is an innovative addition to the world of image matting, making it easier than ever to accurately estimate foreground objects in images and videos. With the ability to leverage pre-trained vision transformers, ViTMatte delivers results. By following the steps outlined in this article, you can take advantage of ViTMatte’s capabilities to improve image matting and explore creative applications such as background replacement. Whether you’re a developer, researcher, or curious about the latest advances in computer vision, ViTMatte is a valuable tool.
Key takeaways:
- ViTMatte is a model with simple Vision Transformers (ViT) to excel in image matting, accurately estimating the foreground object in images and videos.
- ViTMatte incorporates a hybrid attention mechanism and detail capture module to achieve a balance between performance and computation, making it efficient and robust for image matting.
- ViTMatte has achieved state-of-the-art performance on benchmark data sets, outperforming previous image matting methods by a margin.
- It inherits properties from ViT, including pre-training strategies, architectural design, and flexible inference strategies.
Frequent questions
A1: Image matting is the process of accurately estimating the foreground object in images and videos. It is crucial for applications, blurred background in video calls and portrait photography.
A2: ViTMatte leverages simple vision transformers (ViTs) and innovative attention mechanisms to achieve next-generation performance and adaptability in image matting.
A3: The key contributions of ViTMatte include the introduction of ViT in image matting, a hybrid attention mechanism, and a detail capture module. It inherits the strengths of ViT and achieves performance on benchmark data sets.
A4: ViTMatte was contributed by Nielsr in HuggingFace by Yao et al. (2023) and the original code can be found here (https://github.com/hustvl/ViTMatte).
A5: Image matting with ViTMatte opens up creative possibilities, background replacement, artistic effects and video conferencing features such as virtual backgrounds. Developers and researchers to explore new ways to enhance images and videos.
References
The media shown in this article is not the property of Analytics Vidhya and is used at the author’s discretion.ViTMatte: Revealing the latest in image matting technology