In recent years, generative model content production has advanced significantly, enabling user-controllable, high-quality synthesis of images and videos. Users can interactively generate and modify a high-resolution image using a 2D input label map and image-to-image translation techniques. However, current image-to-image translation techniques only work in 2D and do not explicitly consider the underlying 3D structure of the content. As seen in Figure 1, his goal is to make conditional image synthesis 3D-compatible, allowing for the creation of 3D material and the manipulation of viewpoints and the modification of attributes (for example, modifying the shape of cars in 3D). It can be difficult to create 3D material that depends on human intervention. Obtaining large data sets with coupled user inputs and predicted 3D outputs is expensive for model training.
Although a user may wish to describe the details of 3D objects using 2D interfaces from various angles, the production of 3D content often requires user input from multiple views. Meanwhile, these inputs might not be 3D-consistent, giving mixed signals for 3D content production. To overcome these problems, they apply 3D neural scene representations to conditional generative models. They also contain 3D semantic information to facilitate cross-view editing, which can later be rendered as 2D label maps from various angles. They only need 2D supervision in the form of image reconstruction and adversary losses to learn the 3D rendering mentioned above.
However, its pixel-aligned conditional discriminator promotes the appearance and labels to look realistic while being pixel-aligned when rendered in new views. At the same time, the loss of reconstruction ensures alignment between the 2D user input and the matching 3D material. They also suggest a loss of cross-view consistency to require latent codes to be constant across multiple perspectives. They focus on the CelebAMask-HQ, AFHQ-cat, and shapenetcar datasets for 3D-aware semantic image synthesis. His approach effectively uses different 2D user inputs such as slice maps and edge maps. His approach surpasses several 2D and 3D baselines, including SEAN, SofGAN, and Pix2NeRF versions. Furthermore, they minimize the effects of different design decisions and show how their methodology can be used in applications such as cross-editing and explicit user control over semantics and styling.
To see more finds and codes, visit their website. His current approach has two significant drawbacks. First, it primarily focuses on modeling the appearance and geometry of an element type. However, determining a canonical stance for generic scenes presents a difficult task. An interesting next step is to extend the approach to more complicated scene data sets with many objects. Second, training your model needs camera poses associated with each training image, while your approach does not require poses during inference time. The range of applications will be further expanded by eliminating the need for pose information.
review the Paper, Projectand Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 14k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.