Meet NerfDiff: An AI Framework to Enable the Synthesis of High-Quality, Consistent Multiple Views from a Single Image

The synthesis of new views is a hot topic in computer graphics and vision applications, such as virtual and augmented reality, immersive photography, and the development of digital replicas. The goal is to generate additional views of an object or scene from limited initial viewpoints. This task is particularly demanding because newly synthesized views must consider occluded areas and previously unseen regions.

Recently, neural radiation fields (NeRFs) have shown exceptional results in generating high-quality novel views. However, NeRF relies on a significant number of images, ranging from tens to hundreds, to capture the scene effectively, making it susceptible to overfitting and lacking the ability to generalize to new scenes.

Previous attempts have introduced generalizable NeRF models that condition NeRF representation based on 3D point projection and extracted image features. These approaches produce satisfactory results, particularly for views close to the input image. However, when the target views differ significantly from the input, these methods produce fuzzy results. The challenge lies in resolving the uncertainty associated with large invisible regions in novel views.

JOIN the fastest ML subreddit community

An alternative approach to address the uncertainty problem in single-image view synthesis involves using 2D generative models that predict new views while conditioning the input view. However, the risk of these methods is the inconsistency in the generation of images with the underlying 3D structure.

For this, a new technique called NerfDiff has been presented. NerfDiff is a framework designed to synthesize high-quality multi-view consistent images based on single-view input. An overview of the workflow is presented in the following figure.

The proposed approach consists of two stages: training and tuning.

During the training stage, a triplane-based NeRF model of the camera space and a 3D-aware conditional diffusion model (CDM) are co-trained on a collection of scenes. The NeRF representation is initialized using the input image in the fine tuning stage. The NeRF model parameters are then fitted based on a set of virtual images generated by the CDM, which is conditioned on the outputs generated by NeRF. However, a simple fine-tuning strategy that optimizes the NeRF parameters directly using the CDM outputs produces poor quality representations due to the inconsistency of multiple views of the CDM outputs. To address this problem, the researchers propose NeRF-guided distillation, an alternative process that updates the NeRF representation and guides the multi-view diffusion process. Specifically, this approach enables the resolution of the uncertainty in the single-image view synthesis by taking advantage of the additional information provided by the CDM. Simultaneously, the NeRF model guides the CDM to ensure consistency across multiple views during the diffusion process.

Reported below are some of the results obtained through NerfDiff (where NGD stands for Nerf Guided Distillation).

This was the brief for NerfDiff, a novel AI framework that enables consistent, high-quality multiple views from a single input image. If you are interested, you can learn more about this technique at the links below.

review the Paper and Project. Don’t forget to join our 22k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at asif@marktechpost.com

Check out 100 AI tools at AI Tools Club

Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He currently works at the Christian Doppler ATHENA Laboratory and his research interests include adaptive video streaming, immersive media, machine learning and QoS / QoE evaluation.