Visual Language Maps for Robot Navigation – Google AI Blog

Posted by Oier Mees, a PhD student at the University of Freiburg, and Andy Zeng, a robotics research scientist at Google

People are excellent navigators of the physical world, due in part to their remarkable ability to build cognitive maps which form the basis of space memory — from locating landmarks at different ontological levels (such as a book on a shelf in the living room) to determining whether a design allows navigation from point A to point B. Building robots that are competent at navigation requires an interconnected understanding of (a) vision and natural language (to associate landmarks or follow instructions), and (b) spatial reasoning (to connect a map representing an environment with the true spatial distribution of objects). Although there have been many recent progress In training joint visual language models on Internet-scale data, figuring out how to best connect them to a spatial representation of the physical world that robots can use remains an open research question.

To explore this, we collaborated with researchers from the University of Freiburg and nuremberg develop visual language maps (VLMaps), a map representation that directly merges pre-trained visual language embeddings into a 3D reconstruction of the environment. VLMaps, which is configured to appear on ICRA 2023is a simple approach that allows robots to (1) index visual landmarks on the map using natural language descriptions, (2) employ Code as Policies to navigate to space targetssuch as “get between the sofa and the TV” or “move three meters to the right of the chair”, and (3) generate open vocabulary obstacle maps — allow multiple robots with different morphologies (mobile manipulators vs. drones, for example) to use the same VLMap for route planning. VLMaps can be used out-of-the-box without additional labeled data or model fine-tuning, and outperforms other zero-shot methods by more than 17% in challenging object-target and spatial-target navigation tasks in Habitat and Matterport3D. We are also launching the code used for our experiments together with an interactive simulated robot demo.

VLMaps can be built by fusing pre-trained visual language embeddings into a 3D reconstruction of the environment. At runtime, a robot can query the VLMap to locate visual landmarks given natural language descriptions, or to build open vocabulary obstacle maps for route planning.

Classic 3D maps with a modern multimodal twist

VLMaps combines the geometric structure of classical 3D reconstructions with the expression of modern visual language models pretrained on Internet-scale data. As the robot moves, VLMaps uses a pre-trained visual language model to calculate per-pixel dense keys from posed RGB camera views and integrates them into a map-sized 3D tensor aligned with an existing 3D reconstruction of the robot. physical world. . This representation allows the system to locate landmarks given their natural language descriptions (such as “a book on a shelf in the living room”) by comparing their text embeds to all locations in the tensor and finding the closest match. Querying these target locations can be used directly as target coordinates for language conditioned navigation, since the primitive API function requires Code as Policies to process spatial targets (for example, code writing models interpret “in the middle” as arithmetic between two locations), or to sequence multiple navigation targets for long-horizon instructions.

# move first to the left side of the counter, then move between the sink and the oven, then move back and forth to the sofa and the table twice.
robot.move_to_left('counter')
robot.move_in_between('sink', 'oven')
pos1 = robot.get_pos('sofa')
pos2 = robot.get_pos('table')
for i in range(2):
   robot.move_to(pos1)
   robot.move_to(pos2)
# move 2 meters north of the laptop, then move 3 meters rightward.
robot.move_north('laptop')
robot.face('laptop')
robot.turn(180)
robot.move_forward(2)
robot.turn(90)
robot.move_forward(3)

VLMaps can be used to return the map coordinates of waypoints given natural language descriptions, which can be wrapped as a primitive API function call to Code as Policies to sequence multi-target long-horizon navigation instructions.

Results

We evaluated VLMaps in challenging zero-shot spatial target and object-target navigation tasks in Habitat and Matterport3D, without additional training or adjustments. The robot is asked to navigate to four subgoals specified sequentially in natural language. We find that VLMaps significantly outperforms strong baselines (including Cow and LM-Nav) by up to 17% due to its improved visuo-lingual base.

Tasks	Number of tracked subgoals				Independent subgoals
Tasks	1	2	3	4	Independent subgoals
LM-Nav	26	4	1	1	26
Cow	42	fifteen	7	3	36
MAP CLIP	33	8	2	0	30
VLMaps (ours)	59	3. 4	22	fifteen	59
GT Map	91	78	71	67	85

The VLMaps approach performs favorably over open vocabulary alternative baselines in multi-object navigation (success rate [%]) and excels specifically at longer-term tasks with multiple secondary goals.

A key advantage of VLMaps is its ability to understand spacesl Objectives, such as “get between the sofa and the television” or “move ten feet to the right of the chair.” Experiments for space navigation with long-term goals show an improvement of up to 29%. the regions on the map that are activated for different language queries, we visualize the heat maps for the object type “chair”.

The improved vision and grounding capabilities of the VLMaps language, which contain significantly fewer false positives than competing approaches, allow you to seamlessly navigate to waypoints using language descriptions.

Open Vocabulary Obstacle Maps

A single VLMap of the same environment can also be used to create open vocabulary obstacle maps for route planning. This is done by taking the union of binary thresholded detection maps over a list of reference point categories that the robot may or may not traverse (such as “tables”, “chairs”, “walls”, etc.). This is useful as robots with different morphologies can move in the same environment in different ways. For example, “tables” are obstacles for a large mobile robot, but can be traversed by a drone. We found that using VLMaps to create multiple robot-specific obstacle maps improves navigation efficiency by up to 4% (measured in terms of path length-weighted task success rates) compared to using a single robot. single shared obstacle map for each robot. Watch the paper for more details.

Experiments with a mobile robot (LoCoBot) and drone in AI2THOR simulated environments. Left: Top-down view of an environment. intermediate columns: Observations of agents during navigation. Good: Obstacle maps generated for different realizations with the corresponding navigation routes.

Conclusion

VLMaps takes an initial step toward grounding pretrained visual language information into spatial map representations that robots can use for navigation. Experiments in simulated and real environments show that VLMaps can enable language-using robots to (i) index landmarks (or spatial locations relative to them) given their natural language descriptions, and (ii) generate vocabulary obstacle maps Open for route planning. Extending VLMaps to handle more dynamic environments (eg with people on the move) is an interesting avenue for future work.

open source release

We have released the code needed to reproduce our experiments and an interactive demo of the simulated robot on the project websitewhich also contains additional videos and codes to compare agents in simulation.

expressions of gratitude

We would like to thank the co-authors of this research: Chenguang Huang and Wolfram Burgard.

Visual Language Maps for Robot Navigation – Google AI Blog

Technical Terrence Team

Bitcoin Defenders Discuss Alleged US Government Mission to Eliminate Crypto Access - Bitcoin News

Leave a Reply Cancel reply

Recommended.

Binance and Kuna Crypto Exchanges Suspend Ukrainian Hryvnia Card Transactions – Bitcoin News

Ethereum Developers Confirm Mainnet Deployment of Dencun Upgrade, Grayscale Calls It ETH's 'Coming of Age' Moment

Move Over Bitcoin, Robert Kiyosaki Advocates Ethereum Investment

On Cryptocurrency And Drugs: Why This Former Belgian Finance Minister Wants Cryptocurrency Banned Like Drugs

Bitcoin price drops by 6%, triggering a collapse in open interest

Categories

Important Links

Visual Language Maps for Robot Navigation – Google AI Blog

Classic 3D maps with a modern multimodal twist

Results

Open Vocabulary Obstacle Maps

Conclusion

open source release

expressions of gratitude

Related

Technical Terrence Team

Bitcoin Defenders Discuss Alleged US Government Mission to Eliminate Crypto Access - Bitcoin News

Leave a Reply Cancel reply

Recommended.

Binance and Kuna Crypto Exchanges Suspend Ukrainian Hryvnia Card Transactions – Bitcoin News

Ethereum Developers Confirm Mainnet Deployment of Dencun Upgrade, Grayscale Calls It ETH's 'Coming of Age' Moment

Move Over Bitcoin, Robert Kiyosaki Advocates Ethereum Investment

On Cryptocurrency And Drugs: Why This Former Belgian Finance Minister Wants Cryptocurrency Banned Like Drugs

Bitcoin price drops by 6%, triggering a collapse in open interest

Categories

Important Links

Get daily news updates to your inbox!