Developing robots that can do daily tasks for us is an enduring dream of mankind. We want them to walk and help us with daily tasks, improve production in the factories, increase the results of our agriculture, etc. Robots are the assistants we have always wanted to have.
Developing intelligent robots that can navigate and interact with objects in the real world requires accurate 3D mapping of the environment. Without them being able to correctly understand the environment around them, it would not be possible to call them true assistants.
There have been many approaches to teaching robots about their environment. However, most of these approaches are limited to closed-set environments, which means that they can only reason about a finite set of concepts that are predefined during training.
On the other hand, we have new developments in the AI domain that could “understand” concepts in relatively open data sets. For example, CLIP can be used to caption and explain images that were never seen during the training set and produce reliable results. Or take DINO, for example; can understand and draw boundaries around objects it hasn’t seen before. We need to find a way to bring this capability to robots so that we can say that they can really understand their environment for real.
What is required to understand and model the environment? If we want our robot to have broad applicability to a variety of tasks, it should be able to use its modeling environment without needing to retrain itself for each new task. The modeling they do must have two main properties; be an open and multimodal set.
Open set modeling means they can capture a wide variety of concepts in great detail. For example, if we ask the robot to bring us a can of soda, it must understand it as “something to drink” and it must be able to associate it with a brand, flavor, etc. So we have multimodality. This means that the robot should be able to use more than one “sense”. You must understand text, image, audio, etc., all together.
let’s meet with ConceptFusiona solution to deal with the aforementioned limitations.
ConceptFusion it is an open and inherently multimodal form of stage performance. It allows reasoning beyond a closed set of concepts and enables a wide range of possible queries to the 3D environment. Once up and running, the robot can use language, images, audio, or even 3D geometry-based reasoning with the environment.
ConceptFusion utilizes advances in large-scale models in language, image, and audio domains. It works with a simple observation; Pixel-aligned open set features can be merged into 3D maps via traditional simultaneous localization and mapping (SLAM) approaches and multi-view fusion. This allows for effective zero shot reasoning and does not require any additional adjustment or training.
The input images are processed to generate generic object masks that do not belong to any particular class. Next, the local features of each object are extracted and a global feature is calculated for the entire input image. Our zero-shot pixel alignment technique is used to combine the region-specific features with the global feature, resulting in pixel-aligned features.
ConceptFusion is evaluated in a combination of real and simulated scenarios. It can retain long-tail concepts better than supervised approaches and outperform existing SoTA methods by more than 40%.
In general, ConceptFusion it is an innovative solution to the limitations of existing 3D mapping approaches. By introducing an open and multimodal scene representation, ConceptFusion allows more flexible and effective reasoning about the environment without the need for additional training or adjustments.
review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 16k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. She wrote her M.Sc. thesis on denoising images using deep convolutional networks. She is currently pursuing a PhD. She graduated from the University of Klagenfurt, Austria, and working as a researcher in the ATHENA project. Her research interests include deep learning, computer vision, and multimedia networks.