Google is training its robots with Gemini ai so they can improve their navigation and task performance, DeepMind's robotics team explained in a statement New research article How Gemini 1.5 Pro's use of the long context window (which determines how much information an ai model can process) allows users to more easily interact with their RT-2 robots using natural language instructions.
This works by filming a video tour of a designated area, such as a home or office, and researchers use the Gemini 1.5 Pro to have the robot “watch” the video and learn about the environment. The robot can then execute commands based on what it’s observed using verbal and/or image output, such as guiding users to an electrical outlet after showing them a phone and asking “where can I charge it?” DeepMind says its Gemini-powered robot had a 90 percent success rate across more than 50 user instructions given across an operating area of more than 9,000 square feet.
The researchers also found “preliminary evidence” that Gemini 1.5 Pro allowed its droids to plan how to carry out instructions beyond mere navigation. For example, when a user with a bunch of cans of Coca-Cola on their desk asks the droid if their favorite drink is available, the team said Gemini “knows that the robot should navigate to the fridge, inspect for any Cokes, and then return to the user to report the result.” DeepMind says it plans to investigate these results further.
The video demonstrations provided by Google are impressive, though obvious cuts after the droid acknowledges each request obscure that it takes between 10 and 30 seconds to process these instructions, according to the research paper. It may be some time before we share our homes with more advanced environmental mapping robots, but at least these might be able to find our lost keys or wallets.