Not the Vader You Think of: 3D VADER is an AI Model That Diffuses 3D Models

Image generation has never been easier. With the rise of generative AI models, the process became really easy to start. It’s like you have a designer working for you, and all you need to do is guide it to generate the image you would like to see.

The same applies to image editing. These generative models can be used not only to generate new images but also for editing existing ones, thanks to the recent upgrades provided by extensive research.

All these were made possible thanks to the denoising diffusion models. They have transformed the image generation domain entirely. It was one of the most giant leaps we have witnessed in this area. These models have been applied in image, audio, and video applications.

Though, we are missing one component here, if you have noticed. Where is the third dimension? Image generation has already reached a point of photorealism, and there have been numerous attempts at video and audio generation, which are getting better day by day. One can expect them to reach a really realistic level soon as well. But why we don’t hear much about 3D object generation?

We live in a 3D world. It is characterized by both static and dynamic 3D objects. This makes bridging the gap between 2D and 3D a formidable challenge. Let us meet with 3DVADER, a new challenger that is trying to bridge this gap.

3DVADER addresses the core challenge in 3D generative models: how to seamlessly tackle the geometric details of the 3D world with the impressive capabilities of modern image generation techniques.

3DVADER rethinks how we design and train models for 3D content. Unlike previous methods, which struggled with scalability and diversity, this implementation boldly tackles these challenges, offering a fresh perspective on the future of 3D content generation.

3DVADER achieves this with a unique approach. Instead of relying on conventional autoencoders for training, it introduces a volumetric auto decoder. This auto decoder maps a 1D vector to each object, removing the need for 3D supervision and catering to a wide range of object categories. The approach learns 3D representations from 2D observations, utilizing rendering consistency as its guiding principle. This novel representation accommodates articulated parts, a necessity to model non-rigid objects.

The other issue is about the dataset. Since images and monocular videos have made up the most available data, preparing a robust and versatile 3D dataset is an open issue. Unlike previous approaches, which rely on painstakingly captured 3D data, 3DVADER leverages multi-view images and monocular videos to generate 3D-aware content. It navigates the challenges of the lack of diversity of object poses by offering robustness to ground-truth, estimated, or even entirely unprovided pose information during training. Moreover, 3DVADER accommodates datasets spanning multiple categories of diverse objects, which tackles the scalability problem.

Overall, 3DVADER is a novel approach for generating static and articulated 3D assets, with a 3D auto decoder serving as its core. It accommodates the utilization of existing camera supervision or the learning of this information during training. It achieves superior performance of the generation compared to state-of-the-art alternatives.

Check out the Paper, Project, and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled “Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning.” His research interests include deep learning, computer vision, video encoding, and multimedia networking.