When choosing a venue, we are often faced with questions such as: Does this restaurant have the right atmosphere for a date? Is there good outdoor seating? Are there enough screens to watch the game? While photos and videos can partially answer questions like these, they don’t replace the feeling of being there, even when visiting in person isn’t an option.
Immersive experiences that are interactive, photorealistic, and multidimensional serve to bridge this gap and recreate the feel and ambience of a space, allowing users to naturally and intuitively find the information they need. To help with this, Google Maps released immersive viewthat uses advances in machine learning (ML) and computer vision to merge billions of street view and aerial images to create a rich digital model of the world. Beyond that, it adds useful information to the top, like weather, traffic, and how busy a place is. Immersive View offers interior views of restaurants, cafes, and other venues to give users a virtual up-close view that can help them decide where to go with confidence.
Today we describe the work done to offer these interior views in Immersive View. we build on neural radiation fields (NeRF), a state-of-the-art approach to fusing photographs to produce a realistic, multidimensional reconstruction within a neural network. We describe our pipeline for NeRF creation, which includes capturing custom photos of the space with DSLR cameras, image processing, and scene playback. We take advantage of Alphabet’s recent progress in the countryside to design a method that equals or exceeds the prior state of the art in visual fidelity. These models are then embedded as interactive 360° videos following selected flight paths, making them available on smartphones.
The reconstruction of The Seafood Bar in Amsterdam in Immersive View. |
From photos to NeRF
At the center of our work is NeRF, a recently developed method for 3D reconstruction and novel view synthesis. Given a collection of photos describing a scene, NeRF distills these photos into a neural fieldwhich can then be used to render photos from viewpoints not present in the original collection.
While NeRF largely solves the reconstruction challenge, a user-facing product based on real-world data brings a wide variety of challenges to the table. For example, the quality of the build and the user experience must remain consistent in everything from dimly lit bars to outdoor cafes and hotel restaurants. At the same time, privacy must be respected and any potentially personally identifiable information must be removed. Importantly, scenes must be captured consistently and efficiently, resulting in high-quality reconstructions and minimizing the effort required to capture the necessary photos. Ultimately, the same natural experience should be available to all mobile users, regardless of the device at hand.
Immersive View’s interior rebuild pipeline. |
Capture and preprocessing
The first step in producing a high-quality NeRF is the careful capture of a scene: a dense collection of photos from which 3D geometry and color can be derived. To obtain the best possible reconstruction quality, each surface must be viewed from multiple different directions. The more information a model has about an object’s surface, the better it will be at figuring out the object’s shape and how it interacts with lights.
Also, NeRF models make more assumptions about the camera and the scene itself. For example, most camera properties, such as white balance and aperture, are assumed to be fixed during capture. Likewise, the scene itself is assumed to be frozen in time: lighting changes and movement should be avoided. This must be balanced against practical concerns, including the time required to capture, available lighting, equipment weight, and privacy. In partnership with professional photographers, we developed a strategy to quickly and reliably capture photos of the location with DSLR cameras in just one hour. This approach has been used for all of our NeRF reconstructions to date.
Once the catch is uploaded to our system, processing begins. Because photos may inadvertently contain sensitive information, we automatically scan and blur personally identifiable content. Then we apply a structure-from-movement pipeline to solve for each photo camera parameters: Its position and orientation relative to other photos, along with lens properties such as focal length. These parameters associate each pixel with a point and a direction in 3D space and constitute a key signal in the NeRF reconstruction process.
NeRF reconstruction
Unlike many ML models, a new NeRF model is trained from scratch at each captured location. To achieve the best possible reconstruction quality within a target computational budget, we incorporated features from a variety of NeRF published works developed at Alphabet. Some of these include:
- we build on mip-NeRF 360, one of the best performing NeRF models to date. While it is more computationally intensive than Nvidia’s widely used instant NGPwe found that mip-NeRF 360 consistently produces fewer artifacts and higher reconstruction quality.
- We incorporate the low-dimensional generative latent optimization (GLO) vectors introduced in NeRF in the wild as an auxiliary input to the radiation network of the model. These are learned real-valued latent vectors that incorporate appearance information for each image. By mapping each image into its own latent vector, the model can capture phenomena such as lighting changes without resorting to cloudy geometry, a common artifact in NeRF haphazard captures.
- We also incorporated exposure conditioning as introduced in Block-NeRF. Unlike GLO vectors, which are non-interpretable model parameters, exposure is derived directly from a photo image. metadata and is fed as an additional input to the radiation network of the model. This offers two great benefits: it opens up the possibility of varying I LIKE THIS and provides a method to control the brightness of an image at inference time. We find both properties invaluable for capturing and rebuilding low-light locations.
We train each NeRF model on GPU or TPU accelerators, which provide different compensation points. As with all Google products, we continue to look for new ways to improve, from reducing computing requirements to improving rebuild quality.
A side-by-side comparison of our method and a mip-NeRF 360 baseline. |
A scalable user experience
Once a NeRF is trained, we have the ability to produce new photos of a scene from any vantage point and camera lens of our choosing. Our goal is to provide a meaningful and useful user experience: not just the reconstructions themselves, but also guided and interactive tours that give users the freedom to explore spaces naturally from the comfort of their smartphones.
To this end, we designed a controllable 360° video player that simulates flying through an indoor space along a predefined path, allowing the user to freely look around and travel forwards or backwards. As the first Google product to explore this new technology, 360° video was chosen as the format to deliver the generated content for several reasons.
On the technical side, real time inference and baked representations they are still resource-intensive per client (whether on-device or cloud-computed), and relying on them would limit the number of users who can access this experience. Using video, we can scale video storage and delivery to all users by leveraging the same video management and serving infrastructure that YouTube uses. On the operations side, video gives us clearer editorial control over the browsing experience and is easier to inspect for quality in high volumes.
While we had considered capturing the space directly with a 360° camera, using a NeRF to reconstruct and render the space has several advantages. A virtual camera can fly anywhere in space, including over obstacles and through windows, and can use any desired camera lens. The camera path can also be edited post-hoc for smoothness and speed, unlike live recording. A NeRF capture also does not require the use of specialized camera hardware.
Our 360° videos are rendered by ray projection through each pixel of a virtual spherical camera and composing the visible elements of the scene. Each video follows a smooth path defined by a sequence of keyframed photos taken by the photographer during capture. The camera position for each image is calculated during the frame from the motion and the image sequence is smoothly interpolated into a flight path.
To keep the speed constant at different locations, we calibrated the distances for each one by capturing pairs of images, each of which is 3 meters apart. Knowing the measurements in space, we scaled the generated model and rendered all videos at a natural speed.
The final experience is displayed to the user within immersive view: User can seamlessly fly to restaurants and other indoor places and discover space by flying through the 360° photorealistic videos.
open research questions
We believe this feature is the first step of many on a journey toward immersive, AI-powered, and universally accessible experiences. From a NeRF research perspective, more open questions remain. Some of these include:
- Improve reconstructions with scene segmentation, adding semantic information to scenes that could make scenes, for example, searchable and easier to navigate.
- Adaptation of NeRF to outdoor photo collections, as well as indoor. By doing so, we would unlock similar experiences in all corners of the world and change the way users can experience the world outdoors.
- Enable real-time interactive 3D exploration through on-device neural rendering.
Reconstruction of an outdoor scene with a NeRF model trained on Street View panoramas. |
As we continue to grow, we look forward to getting involved and contributing to the community to build the next generation of immersive experiences.
expressions of gratitude
This work is a collaboration between several teams at Google. Contributors to the project include Jon Barron, Julius Beres, Daniel Duckworth, Roman Dudko, Magdalena Filak, Mike Harm, Peter Hedman, Claudio Martella, Ben Mildenhall, Cardin Moffett, Etienne Pot, Konstantinos Rematas, Yves Sallat, Marcos Seefelder, Lilyana Sirakovat , Sven Tresp and Peter Zhizhin.
In addition, we would like to extend our thanks to Luke Barrington, Daniel Filip, Tom Funkhouser, Charles Goran, Pramod Gupta, Mario Lučić, Isalo Montacute, and Dan Thomasset for their valuable comments and suggestions.