With the growing popularity and advances in artificial intelligence, AI has successfully entered the field of robotics. Robotics is a branch of engineering in which machines are developed and programmed to perform tasks without human intervention. Various AI technologies are being used in robotics such as the use of Natural Language Processing (NLP) to give voice commands to a robot, edge computing for better data management and safety best practices in robotics etc. Development of generalizable perceptions and good communication Systems for robots have always been the subject of research. With recent developments in robotics, various approaches have been introduced to absorb visual representations by robots.
Recently, researchers at Stanford University have devised a new framework called Voltron that is capable of learning representations driven by language and images. For a long time, many researchers have been trying to find methods to make a robot learn by watching humans on video. Some of the methods already used are masked self-coding and contrastive learning. In addition to possessing the ability to control their actions, robots must also have the potential to understand the way humans do and communicate effectively. The combination of visual and language information is necessary for a robot to understand the human intent of a video. Voltron allows the understanding of minute details of a video. It focuses on low-level visual reasoning as well as high-level semantic understanding in robotics of any action that takes place in a video.
Voltron works by taking related language texts as input from the videos. It uses a masked auto-encoding pipeline and rebuilds frames from a masked context. Voltron makes use of language monitoring to produce associated subtitles. This enables low-level pattern recognition at the spatial level and gives rise to high-level features in terms of intent. Language supervision ensures impromptu learning of visual representations for robotics. Videos, including humans performing everyday tasks, consisting of various resources, can act as data sets. These videos contain many natural language annotations useful in robotic manipulation and representation learning. Voltron does the same by improvising on rendering learning using these large human video data sets.
When comparing Voltron to currently existing approaches, the team has shared that Voltron is much more consistent than the other two methods. Masked machine coding and contrastive learning do not overcome the problems of comprehension ability prediction, language-conditioned imitation learning, and intention scoring for human-robot collaboration. According to the researchers, these methods show inconsistent results as masked machine coding chooses low-level spatial features at the expense of high-level semantics. On the other hand, contrastive learning captures high-level semantics at the price of low-level attributes. The team even introduced the Voltron Evaluation suite, which consists of assessment problems spanning five applications: comprehension ability prediction, reference utterance grounding, single-task visuomotor control, language-conditioned imitation learning in a real robot and intent score. Voltron far exceeds previous approaches in all of these applications.
Voltron is definitely a great addition to both robotics and AI. It not only performs a single task, but can be used for five subsequent tasks. It is a breakthrough in learning visual representations for robotics and looks promising for future advances.
review the Paper, Modelsand Assessment. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 15k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.