Artificial intelligence is evolving with the introduction of generative AI and extensive language models (LLM). Well-known models like GPT, BERT, PaLM, etc., are great additions to the long list of LLMs that are transforming the way humans and computers interact. In image generation, diffusion models have drawn much attention from researchers, as these models capture the complex probability distribution of an image data set and generate new samples that resemble the training data. Understanding of 3D scenes is also evolving, enabling the development of geometry-free neural networks that can be trained on a large scene data set to learn scene representations. These networks generalize well to unseen scenes and objects, represent views from a single or a few input images, and need only a few observations per scene for training.
By combining the capabilities of diffusion models and 3D scene representation learning models, a team of researchers from UC Berkeley, Google Research, and Google DeepMind introduced DORSal (Diffusion for Object-centric Representations of Scenes et al.), which is an approach for generating novel perspectives in three-dimensional scenes by combining object representations with broadcast decoders. DORSal is geometry-free as it learns the structure of the 3D scene solely from data without the need for expensive volume rendering.
In order to create 3D scenes, DORSal uses a video broadcast architecture that was initially created for image synthesis purposes. The main concept is to rely on object-centric slot-based scene renderings to constrain the diffusion model. These representations capture crucial details about the objects in the scene and their characteristics. DORSal facilitates innovative high-fidelity perspective synthesis of 3D scenes by setting the diffusion model to these object-centric renderings. It also maintains scene editing capability at the object level, allowing users to change and modify particular elements in the scene.
The main contributions shared by the team are the following:
- DORSal, an approach to novel 3D view synthesis, uses the strengths of diffusion models and object-centric scene renderings to improve the quality of rendered views.
- DORSal outperforms previous methods in the 3D scene understanding literature and can generate views that are significantly more accurate, with a 5x-10x improvement in Fréchet Initiation Distance (FID).
- Compared to previous work on 3D diffusion models, DORSal shows superior performance in handling more complex scenes. When evaluating real-world Street View data, DORSal performs significantly better in terms of rendering quality.
- DORSal is capable of conditioning the diffusion model into a structured, object-based scene representation. Using this representation, DORSal learns to compose scenes using individual objects, allowing for basic object-level scene editing during inference, allowing users to manipulate and modify specific objects within the scene.
In conclusion, the efficacy of DORSal can be seen in experiments performed both on complex scenes of multiple synthetic objects and on large-scale real-world data sets such as Google Street View. Its ability to successfully enable scalable neural rendering of 3D scenes with object-level editing makes it a promising approach for the future. Its improved rendering quality shows potential to advance the understanding of the 3D scene.
review the project page and Paper. Don’t forget to join our 25k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
Featured Tools:
🚀 Check out 100 AI tools at AI Tools Club
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.