Generative models, such as Generative Adversarial Networks (GAN), have the ability to generate realistic images of clothed objects and individuals after being trained on an extensive collection of images. Although the resulting output is a 2D image, many applications require diverse, high-quality 3D virtual avatars. These avatars should allow control of camera pose and viewpoint while ensuring 3D coherence. To address the demand for 3D avatars, the research community is exploring generative models capable of automatically generating 3D shapes of humans and clothing based on input parameters such as posture and body shape. Despite considerable advances, most existing methods ignore texture and rely on precise and clean 3D scans of humans for training. Acquiring such scans is expensive, limiting their availability and diversity.
Developing a method for learning to generate 3D human shapes and textures from unstructured image data presents a challenging and underconstrained problem. Each training instance exhibits unique shapes and appearances, observed only once from specific viewpoints and postures. While recent advances in GANs with 3D recognition have shown impressive results for rigid objects, these methods face difficulties in generating realistic humans due to the complexity of human articulation. Although some recent work demonstrates the feasibility of learning articulated humans, existing approaches struggle with limited quality, resolution, and challenges when modeling loose clothing.
The paper presented in this paper presents a novel method for 3D human generation from 2D image collections, achieving state-of-the-art image quality and geometry while effectively modeling baggy clothing.
The overview of the proposed method is illustrated below.
This method adopts a monolithic design capable of modeling both the human body and loose clothing, starting from the approach of representing humans with separate body parts. Multiple discriminators are incorporated to enhance geometric details and focus on perceptually important regions.
A novel generator design is proposed to address the goal of high image quality and flexible handling of loose clothing by comprehensively modeling 3D humans in a canonical space. The articulation module, Fast-SNARF, is responsible for the movement and positioning of body parts and adapts to the generative environment. Furthermore, the model adopts empty space omission, optimizing and accelerating the rendering of areas without significant content to improve overall efficiency.
2D modular discriminators are guided by normal information, that is, they consider the directionality of surfaces in 3D space. This guidance helps the model focus on regions that are perceptually important to human observers, contributing to a more accurate and visually pleasing result. Additionally, the discriminators prioritize geometric details, improving the overall quality of the generated images. This improvement is likely to contribute to a more realistic and visually appealing representation of 3D human models.

The experimental results reported above demonstrate a significant improvement of the proposed method over previous methods with joint recognition and 3D in terms of geometry and texture quality, validated quantitatively, qualitatively and through perception studies.
In summary, this contribution includes a generative model of 3D articulated humans with state-of-the-art appearance and geometry, an efficient generator for loose clothing, and specialized discriminators that improve visual and geometric fidelity. The authors plan to release the code and models for further exploration.
Review the Paper and Project page. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 33k+ ML SubReddit, 41k+ Facebook community, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. He is a candidate at the Institute of Information technology (ITEC) at the Alpen-Adria-Universität (AAU) in Klagenfurt. He currently works at the ATHENA Christian Doppler Laboratory and his research interests include adaptive video streaming, immersive media, machine learning and QoS/QoE evaluation.
<!– ai CONTENT END 2 –>