ByteDance AI Research proposes a new self-supervised learning framework to create high-quality stylized 3D avatars with a mix of continuous and discrete parameters

A key entry point into the digital world, which is more prevalent in modern life for socializing, shopping, gaming, and other activities, is a visually appealing, animated 3D avatar. A decent avatar should be attractive and customized to match the appearance of the user. Many popular avatar systems, such as Zepeto1 and ReadyPlayer2, employ stylized, cartoonish looks because they are fun and easy to use. However, choosing and modifying an avatar by hand usually involves painstaking modifications to many graphic elements, which is time-consuming and challenging for novice users. In this research, they investigate the automated generation of stylish 3D avatars from a single selfie taken from the front.

Specifically, given a selfie image, its algorithm predicts an avatar vector as the full setup for a graphics engine to generate a 3D avatar and render avatar images from predefined 3D assets. The avatar vector consists of predefined asset-specific parameters, which can be continuous (eg head length) or discrete (eg hair types). A naive solution is to annotate a set of selfie images and train a model to predict the avatar vector through supervised learning. However, large-scale annotations are needed to handle a large variety of assets (usually hundreds). Self-supervised approaches are suggested to train a differentiable mimic that replicates renderings from the graphics engine to automatically match the produced avatar image to the selfie image using different IDs and semantic segmentation losses, which would reduce the cost of annotation. .

To be more precise, given a selfie photo, your system predicts an avatar vector as the full configuration for a graphics engine to produce a 3D avatar and render avatar images from specific 3D assets. The features that make up the avatar vector are particular to the preset assets and can be continuous (such as head length) or discrete (such as hair types). One simple method is to annotate a collection of selfies and use supervised learning to build a model to predict the avatar’s vector. However, large-scale annotations are required to manage a wide variety of assets (usually hundreds).

Avatar vector conversion, self-monitored avatar parameterization, and portrait styling make up the three steps of its innovative architecture. According to Fig. 1, identifying information (hairstyle, skin tone, glasses, etc.) is retained throughout the pipeline while the domain gap gradually closes over the three stages. The portrait stylization stage first concentrates on the crossover of real to stylized 2D visual appearance domains. This step maintains the image space while producing the input selfie as a stylized avatar. A crude use of current stylization techniques for translation will keep things like expression, annoyingly complicating later stages of the process.

Figure 1

As a result, they developed a modified version of AgileGAN to ensure expression homogeneity while maintaining user identification. The self-monitoring avatar parameterization step handles the transition from the pixel-based image to the vector-based avatar. They found that strict enforcement of parameter discretion prevents the optimization from achieving convergent behavior. They adopt a lenient formulation known as a relaxed avatar vector to overcome this problem, encoding discrete parameters as one-hot continuous vectors. They taught a mimic to behave like the non-differentiable motor to allow for differentiation in training. All discrete parameters are converted to one-hot vectors in the Avatar vector conversion step. The domain traverses from the relaxed avatar vector space to the strict avatar vector space. The graphics engine can then build the final avatars and render them using the strict avatar vector. They use a unique search technique that produces results superior to direct quantification. They use human preference research to evaluate their findings and compare the results against benchmark approaches like F2P and manual production to see how effectively their method protects personal uniqueness. Its results reach scores substantially higher than those of the reference techniques and quite similar to those of manual creation.

They also provide an ablation study to support your pipeline design decisions. His technical contributions include, in summary, the following:

• A novel self-monitored learning framework to produce high-quality, stylized 3D avatars with a mix of continuous and discrete parameters.

• A novel approach to closing the substantial gap in styling proficiency in creating stylized 3D avatars through portrait stylization

• A cascade search and relaxation pipeline to address the convergence problem in discrete avatar parameter optimization.

You can find a video demo of the paper on their site.

review the Paper Y Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our reddit page, discord channel, Y electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.