Since the advent of large-scale image-text pairs and sophisticated generative model topologies such as diffusion models, generative models have made great progress in producing high-fidelity 2D images. These templates eliminate manual input by allowing users to create realistic images from text prompts. Due to the lack of diversity and accessibility of 3D learning models compared to their 2D counterparts, 3D generative models continue to face significant problems. The availability of high-quality 3D models is limited by the arduous and highly specialized manual development of 3D assets in software engines.
Recently, researchers have investigated pretrained image-text generative methods to create high-fidelity 3D models to address this problem. These models include detailed backgrounds of the geometry and appearance of the elements, which can make it easy to create realistic and varied 3D models. In this study, researchers from Tencent, Nanyang Technological University, Fudan University, and Zhejiang University present a unique method for creating 3D-style avatars that uses already-trained text-to-image diffusion models that allow users choose the styles and face of the avatars. features through text prompts. They use EG3D, a GAN-based 3D rendering network, specifically because it has several benefits.
First, EG3D uses calibrated photos instead of 3D data for training, allowing the variety and realism of 3D models to be continuously increased using enhanced image data. This feat is quite simple for 2D photography. Second, they can produce each view independently, effectively controlling randomness during image training because the images used for training do not require strict multi-view uniformity in appearance. Their method uses StableDiffusion-based ControlNet, which enables the production of images directed by predetermined postures, to create calibrated 2D training images for EG3D training.
Reusing camera features from pose photos for learning purposes allows these poses to be synthesized or retrieved from avatars in current engines. Even when using precise photos of the posture as a guide, ControlNet often has difficulty creating views with huge angles, such as the back of the head. The generation of full 3D models should be improved with these failed results. They have taken two separate approaches to the problem to address it. First, they have created view-specific warnings for various views during image production to drastically reduce failure occurrences. Synthesized photos can partially match posture photos, even with sight-specific clues.
To address this mismatch, they created a coarse-to-fine discriminator for 3D GAN training. Every image data in your system has a coarse and fine pose annotation. They select a training annotation at random during GAN training. They offer a high chance of taking a good pose annotation for safe views like the front face, but the learning for the rest of the views is more based on general ideas. This method can produce more accurate and varied 3D models even when the input photos include messy annotations. In addition, they have created a latent diffusion model in StyleGAN’s latent style space to enable 3D conditional creation using image input.
The diffusion model can be trained quickly due to the small dimensions, the great expressiveness and the compactness of the style code. They directly sample captcha and style pairs from their trained 3D generators to learn the diffusion model. They did extensive testing on many massive data sets to measure the effectiveness of their suggested strategy. Their findings show that their method surpasses current state-of-the-art techniques in terms of quality and visual variety. In conclusion, this research presents a unique method that uses trained image and text diffusion models to produce high-fidelity 3D avatars.
Its architecture greatly increases the versatility of avatar production by allowing facial features and styles to be determined via text messages. To address the issue of image position misalignment, they have also suggested a coarse-to-fine pose-aware discriminator, which will allow better use of image data with erroneous pose annotations. Last but not least, they have created a conditional rendering plug-in that enables 3D conditional rendering using image input in latent style space. This module further increases the adaptability of the framework and allows users to create custom 3D models to their liking. They also plan to open their source.
review the Paper and github link. Don’t forget to join our 22k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.