An Embedded Multimodal Language Model – Google AI Blog

Posted by Danny Driess, Research Student, and Pete Florence, Research Scientist, Robotics at Google

There have been tremendous advances in the domains of machine learning in recent years, from models that can explain jokes or answer visual questions in a variety of languages to models that can produce images based on text descriptions. Such innovations have been possible due to the increase in the availability of large-scale data sets along with novel advances that enable model training on these data. While robotic model scaling has had some success, it is outpaced by other domains due to the lack of available datasets on a scale comparable to large corpora of text or image datasets.

today we present Palm-E, a new model of generalist robotics that overcomes these problems by transferring knowledge from various visual and linguistic domains to a robotic system. We started with PaLM, a powerful large language model, and “embedded” it (the “my” in PaLM-E), supplementing it with sensor data from the robotic agent. This is the key difference from previous efforts to bring large language models to robotics: instead of relying only on text input, with PaLM-E we train the language model to directly ingest raw streams of data from robot sensors. The resulting model not only enables highly effective robotic learning, but is also a state-of-the-art general-purpose visual language model, while maintaining excellent language-only task capabilities.

<!–

PaLM-E is a generalist model competent with robotics, vision, and language tasks. It can control robots, answer visual questions, and write text – and quantitatively excels at all three relative to state-of-the-art models.

–>

A incarnate language model, and also a visual language generalist

On the one hand, PaLM-E was primarily developed to be a model for robotics and solves a variety of tasks. in multiple types of robots and for multiple modalities (images, robot states and representations of neural scenes). At the same time, PaLM-E is a generally capable vision and language model. You can perform visual tasks, such as describing images, detecting objects, or classifying scenes, and you are also proficient in language tasks, such as quoting poetry, solving mathematical equations, or generating code.

PaLM-E combines our latest large language model, PaLM, along with one of our most advanced vision models, Vit-22B. The largest instantiation of this approach, based on PaLM-540B, is called PaLM-E-562B and sets a new state of the art in visual language. OK-VQA benchmark, without task-specific fine-tuning and essentially retaining the same general language performance as the PaLM-540B.

How does PaLM-E work?

Technically, PaLM-E works by injecting observations into a previously trained language model. This is achieved by transforming sensor data, eg images, into a representation through a procedure that is comparable to how natural language words are processed by a language model.

Language models are based on a mechanism to mathematically represent text in a way that neural networks can process. This is achieved by first splitting the text into so-called tokens that encode (sub)words, each of which is associated with a high-dimensional vector of numbers, the token embedding. The language model can then apply mathematical operations (eg, matrix multiplication) on the resulting sequence of vectors to predict the next most likely word sample. By returning the newly predicted word to the input, the language model can iteratively generate longer and longer text.

He tickets for PaLM-E they are text and other modalities (images, robot states, scene embeddings, etc.) in an arbitrary order, which we call “multimodal sentences”. For example, an entry might look like “What happened between and ?”, where and are two images. He production is a text generated autoregressively by PaLM-E, which could be a response to a question or a sequence of decisions in text form.

Architecture of the PaLM-E model, showing how PaLM-E ingests different modalities (states and/or images) and approaches tasks through multimodal language modelling.

The idea of PaLM-E is to train encoders that convert a variety of inputs in the same space as natural word token embeddings. These continuous entries map to something resembling “words” (although they don’t necessarily form discrete sets). Since word and image embeddings now have the same dimensionality, they can be fed into the language model.

We initialize PaLM-E for training with pretrained models for both the language (PaLM) and the vision components (Vision Transformer, also known as ViT). All model parameters can be updated during training.

Knowledge transfer from large-scale training to robots

PaLM-E offers a new paradigm for training a generalist model, which is achieved by framing robot tasks and vision-language tasks together through a common representation: taking images and text as input and generating text. A key result is that PaLM-E achieves important positive knowledge transfer of the domains of vision and language, improving the efficiency of robot learning.

Positive transfer of knowledge of general vision-language tasks results in more effective robot learning, shown for three different robot realizations and domains.

The results show that PaLM-E can address a large set of robotics, vision, and language tasks simultaneously without performance degradation compared to training single models on single tasks. In addition, the visual language data actually improves significantly the performance of the robot’s tasks. This transfer allows PaLM-E to learn robotics tasks efficiently in terms of the number of examples it requires to solve a task.

Results

We evaluated PaLM-E in three robotic environments, two of which involve real robots, as well as general vision and language tasks such as Visual Question Answering (VQA), image captioning, and general language tasks. When PaLM-E is tasked with making decisions about a robot, we combine it with a low-level language-to-action policy to translate the text into low-level robot actions.

In the first example below, a person asks a mobile robot to bring him a bag of chips. To successfully complete the task, PaLM-E produces a plan to find the crate and open it, and then responds to changes in the world by updating its plan as it executes the task. In the second example, the robot is asked to pick up a green block. Even though that robot hasn’t seen the block, PaLM-E still generates a step-by-step plan that generalizes beyond that robot’s training data.

PaLM-E controls a mobile robot that operates in a kitchen environment. Left: The task is to get a bag of chips. PaLM-E shows robustness against adverse disturbances, such as putting the bag of chips back in the drawer. Good: The final steps to execute a plan to recover a never-before-seen block (green star). This ability is facilitated by the transfer of learning from vision and language models.

In the second environment below, the same PaLM-E model solves very long-term precise tasks, such as “sorting the blocks by colors in the corners”, in a different type of robot. It looks directly at the images and produces a sequence of shorter actions represented textually, eg “Push the blue cube to the bottom right corner”, “Push the blue triangle there too”. — long-term tasks that were beyond the reach of autonomous accomplishment, even in our own most recent models. We also demonstrate the ability to generalize to new tasks that were not seen during the training time (zero-throw generalization), such as pushing red blocks into the coffee cup.

PaLM-E controls a tabletop robot to successfully complete long-term tasks.

The third robot environment is inspired by the field of planning of tasks and movements (TAMP), which studies combinatorially challenging planning tasks (object rearrangement) that confront the robot with a very high number of possible action sequences. We show that with a modest amount of training data from an expert TAMP planner, PaLM-E can not only solve these tasks, but also take advantage of language and visual knowledge transfer to do so more effectively.

PaLM-E produces plans for a task and motion planning environment.

As a visual language generalist, PaLM-E is a competitive model, even compared to the best visual language-only models, which include Flemish and PALI. Notably, PaLM-E-562B achieves the highest number ever reported in the challenging OK-VQA data set, which requires not only visual understanding but also external knowledge of the world. Furthermore, this result is achieved with a generalist model, without specifically tuning just that task.

PaLM-E exhibits capabilities such as visual thought chain reasoning in which the model breaks down its response process into smaller steps, a capability that has so far only been demonstrated in the language-only domain. The model also demonstrates the ability to make inferences across multiple images, although it is trained only on cues from a single image. The image of the New York Knicks and Boston Celtics is under the terms CC by 2.0 and was posted on flickr by kowarski. The image of Kobe Bryant is in the Public Domain. The other images were taken by us.

Conclusion

PaLM-E pushes the limits of how general ability models can be trained to simultaneously address vision, language, and robotics, while also being able to transfer knowledge from vision and language to the robotics domain. There are additional topics investigated in more detail in the papersuch as how to take advantage representations of neural scenes with PaLM-E and also the extent to which PaLM-E, with larger model scale, experiences less catastrophic forgetting of their language abilities.

PaLM-E not only provides a path toward building more capable robots that benefit from other data sources, but could also be a key enabler for other broader applications using multimodal learning, including the ability to unify tasks that until now they seemed separate.

Thanks

This work was done in collaboration with various teams at Google, including the Robotics team at Google and the Brain team, and with TU Berlin. Co-authors: Igor Mordatch, Andy Zeng, Aakanksha Chowdhery, Klaus Greff, Mehdi SM Sajjadi, Daniel Duckworth, Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Fei Xia, Brian Ichter, Karol Hausman, Tianhe Yu, Quan Vuong, Yevgen Chebotar, Wenlong Huang , Pierre Sermanet, Sergey Levine, Vincent Vanhoucke and Marc Toussiant. Danny is a PhD student mentored by Marc Toussaint at TU Berlin. We would also like to thank other colleagues for their advice and help, including Xi Chen, Etienne Pot, Sebastian Goodman, Maria Attarian, Ted Xiao, Keerthana Gopalakrishnan, Kehang Han, Henryk Michalewski, Neil Houlsby, Basil Mustafa, Justin Gilmer, Yonghui Wu, Erica Moreira, Victor Gomes, Tom Duerig, Mario Lucic, Henning Meyer, and Kendra Byrne.

An Embedded Multimodal Language Model – Google AI Blog

Technical Terrence Team

What is Education by Discovery? tips and tricks

Leave a Reply Cancel reply

Recommended.

What is Python caching?

Bitcoin Optimism Rises in Developing World Despite Price Drop: Block Survey

FTX says it will refund customers' money

IMF warns of AI's potential to impact 40% of global jobs

LangGraph ReAct function call – Analytics Vidhya

Categories

Important Links

An Embedded Multimodal Language Model – Google AI Blog

A incarnate language model, and also a visual language generalist

How does PaLM-E work?

Knowledge transfer from large-scale training to robots

Results

Conclusion

Thanks

Related

Technical Terrence Team

What is Education by Discovery? tips and tricks

Leave a Reply Cancel reply

Recommended.

What is Python caching?

Bitcoin Optimism Rises in Developing World Despite Price Drop: Block Survey

FTX says it will refund customers' money

IMF warns of AI's potential to impact 40% of global jobs

LangGraph ReAct function call – Analytics Vidhya

Categories

Important Links

Get daily news updates to your inbox!