Regardless of the industry in which they are used, artificial intelligence (AI) and machine learning (ML) technologies have always sought to improve people’s quality of life. One of the main applications of AI in recent times is to design and create agents that can perform decision-making tasks in various domains. For example, large language models like GPT-3 and PaLM and vision models like CLIP and Flamingo have proven to be exceptionally good at zero-shot learning in their respective fields. However, there is a major drawback associated with the training of such agents. This is because such agents exhibit the inherent property of environmental diversity during training. In simple terms, training for different tasks or environments requires the use of various state spaces, which can occasionally impede learning, knowledge transfer, and the generalizability of models across domains. Also, for reinforcement learning (RL) based tasks, it becomes difficult to create reward functions for specific tasks in all settings.
Working on this issue statement, a team at Google Research investigated whether such tools can be used to build more multipurpose agents. For their research, the team focused specifically on text-guided image synthesis, in which a desired target in the form of text is fed to a scheduler, which creates a sequence of frames representing the intended course of action, after from which the control actions are extracted. of the generated video. The Google team therefore proposed a universal policy (UniPi) that addresses challenges in environmental diversity and specifying rewards in their recent article titled “Universal Policy Learning Through Text-Guided Video Generation.” . UniPi’s policy uses text as the universal interface for task descriptions and video as the universal interface to communicate action and observation behavior in various situations. Specifically, the team designed a video generator as a scheduler that accepts the current image frame and a text message indicating the current target as input to generate a trajectory in the form of a sequence of images or video. The generated video is then fed into a reverse dynamics model that extracts the underlying actions executed. This approach stands out because it allows you to take advantage of the universal nature of language and video to generalize to new goals and tasks in various settings.
In recent years, significant progress has been made in the domain of text-guided image synthesis, which has produced models with exceptional ability to generate sophisticated images. This further motivated the team to choose this as their decision-making task. The UniPi approach proposed by the Google researchers consists mainly of four components: trajectory consistency across tiles, hierarchical scheduling, flexible behavior modulation, and task-specific action adaptation, which are described in detail below:
1. Consistency of the trajectory through the mosaic:
Existing text-to-video methods often produce videos with an underlying environment state that changes substantially. However, ensuring that the environment is constant across all timestamps is essential to building an accurate trajectory planner. Therefore, to enforce environment consistency in conditional video synthesis, the researchers also provide the observed image while denoising each frame in the synthesized video. To preserve the state of the underlying environment over time, UniPi directly concatenates each noisy midframe with the conditioned observed image across sampling steps.
2. Hierarchical Planning:
It is difficult to generate all the necessary actions when making plans in complex and sophisticated environments that require a lot of time and measurements. Planning methods overcome this problem by taking advantage of a natural hierarchy by creating rough plans in a smaller space and refining them into more detailed plans. Similarly, in the video generation process, UniPi first creates videos at a rough level that demonstrates the desired behavior of the agent, and then enhances them to make them more realistic by filling in missing frames and smoothing them out. This is done by using a hierarchy of steps, with each step improving the quality of the video until it reaches the desired level of detail.
3. Flexible behavior modulation:
When planning a sequence of actions for a smaller goal, external constraints can easily be included to modify the generated plan. This can be done by incorporating a probabilistic prior that reflects the desired constraints based on the plan properties. The above can be described using a learned classifier or Dirac delta distribution on a particular image to guide the plan to specific states. This approach is also supported by UniPi. The researchers used the video broadcast algorithm to train the text-conditioned video generation model. This algorithm consists of pretrained language functions encoded from the Text-to-Text Transfer Transformer (T5).
4. Adaptation of specific actions of the task:
A small inverse dynamics model is trained to translate video frames into low-level control actions using a set of synthesized videos. This model is separate from the scheduler and can be trained on a separate smaller dataset generated by a simulator. The reverse dynamics model takes input boxes and text descriptions of current goals, synthesizes the image boxes, and generates a sequence of actions to predict future steps. An agent then performs these low-level control actions using closed-loop control.
In summary, Google researchers have made an impressive contribution in showing the value of using text-based video generation to render policies capable of enabling combinatorial generalization, multi-task learning, and transfer in the real world. The researchers evaluated their approach to a number of novel language-based tasks, and concluded that UniPi generalizes well between seen and unknown combinations of language messages, compared to other baselines such as Transformer BC, Trajectory Transformer, and Diffuser. These encouraging findings highlight the potential of using generative models and the vast amount of data available as valuable resources for building versatile decision-making systems.
review the Paper and google blog. Don’t forget to join our 19k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Khushboo Gupta is a consulting intern at MarktechPost. He is currently pursuing his B.Tech at the Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing, and web development. She likes to learn more about the technical field by participating in various challenges.