In our daily life, we often have to use natural language to explain our 3D environment. To do this, we make use of various properties of objects present in the physical world. They can include things like their semantics, associated entities, and overall appearance. On the other hand, when it comes to a digital environment, neural radiation fields, commonly known as NeRFs, are a type of neural network that has become a powerful tool for capturing photorealistic digital representations of real-world 3D scenarios. These state-of-the-art neural networks can produce sophisticated views of even the most complicated configurations using just a small collection of 2D photos.
However, one major shortcoming is associated with NeRFs, ie the immediate output produced by NeRFs is quite difficult to understand because it simply consists of a multicolor density field with no context or meaning. This makes it extremely tedious for researchers to build interfaces that interact with the resulting 3D scenes. For example, consider a scenario where a person can orient themselves in a 3D environment, such as their study, by asking where the “papers” or “pens” are, for example, through normal everyday conversation. This is where integrating natural language queries with neural networks like NeRF can be extremely useful, as such a combination can make it very easy to navigate 3D scenarios. For this purpose, a team of graduate researchers at the University of California, Berkeley, proposed a unique approach called Language Embedded Radiance Fields (LERF) to substantiate language embeddings from off-the-shelf vision-language models such as CLIP ( Contrastive Language- Image Pretraining) in NeRF. This method allows you to use natural language to explain various ideas, including abstract concepts like electricity and visual features like size, color, and other attributes. With each text prompt, an RGB image and relevance map is generated in real time, focusing on the area with the highest relevance activation.
The Berkeley group of researchers built LERF by combining a NeRF model with a language field. This model inputs both the position and the physical scale for a single CLIP vector. During the training process. the language field is monitored by a multi-scale image pyramid containing CLIP function embeddings generated from the image clippings of the training views. This allows the CLIP encoder to capture the various context scales present in an image, ensuring consistency across multiple views and connecting the same 3D position with embeddings from different languages. During the testing phase, the language field can be queried at arbitrary scales to obtain real-time 3D relevance maps. This demonstrates how multiple items of the same configuration are relevant to the language query. To regularize the CLIP functions, the researchers also used DINO functions. While 3D CLIP keystones can be sensitive to floating elements and poorly viewed regions, this helped considerably to make qualitative improvements to object boundaries.
Instead of 2D CLIP embeds, relevance maps resulting from text queries are obtained using 3D CLIP embeds. This has the advantage that CLIP 3D overlays are substantially more resistant to clogging and viewpoint changes than CLIP 2D overlays. In addition, 3D CLIP keys are more localized and fit better with the structure of the 3D scene, giving them a much cleaner appearance. To test their approach, the team ran several experiments on a collection of wild, hand-captured scenarios and found that LERF can localize detailed queries related to very specific parts of the geometry and even abstract queries related to multiple objects. . This innovative method generates 3D view-consistent relevance maps for a variety of queries and configurations. The researchers concluded that LERF’s zero-shot capabilities had enormous potential in several areas, including robotics, decoding vision and language models, and interacting with 3D environments.
Although LERF use cases have shown that it has a lot of potential, it still has several drawbacks. As a hybrid of CLIP and NeRF, it is subject to the limitations of both technologies. Capturing spatial relationships between objects is difficult for LERF, like CLIP, and is prone to false positives with queries that appear visually or semantically comparable. For example, “a wooden spoon” or any other similar utensil. In addition, LERF requires NeRF quality multi-view imaging and known calibrated camera arrays, which are not always accessible. In a nutshell, LERF is an advanced technique for densely integrating raw CLIP embeddings into a NeRF without the need for fine tuning. The Berkeley researchers also demonstrated that LERF significantly outperforms current state-of-the-art approaches in terms of enabling a wide variety of natural language queries in various real-world settings.
review the Paper and project page. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 16k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Khushboo Gupta is a consulting intern at MarktechPost. He is currently pursuing his B.Tech at the Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing, and web development. She likes to learn more about the technical field by participating in various challenges.