Powerful AI models can now be operated and interacted with through language commands, making them widely available and adaptable. Stable Diffusion, which transforms natural language into an image, and ChatGPT, which can respond to messages written in natural language and perform various tasks, are examples of such models. While the cost of training those models can range from tens of thousands to millions of dollars, there has been an equally exciting development where robust open source base models such as LLaMA can be improved with surprisingly little computation and data to become. in instruction. -following.
Researchers from the University of Toronto and the Vector Institute for Artificial Intelligence investigate the feasibility of such a strategy in sequential decision-making domains in this research. Miscellaneous data for sequential decision making is very expensive and often does not have a user-friendly “instructions” label such as image captions, unlike text and image domains. They suggest modifying previously trained generative behavior models using instructional data, building on previous developments in instruction-adjusted LLMs such as Alpaca. In the last year two basic models for the popular open video game Minecraft have been made available: MineCLIP, a model for aligning text and video clips, and VPT, a behavior model.
This has created a fascinating opportunity to investigate instruction-following optimization in Minecraft’s sequential decision-making domain. The agent has extensive knowledge of the world of Minecraft because VPT trained in 70,000 hours of Minecraft game time. However, the VPT model may have the potential for extensive and controlled behavior if it is tuned to follow instructions, just as the enormous potential of LLMs was unlocked by aligning them to obey instructions. Specifically they show in their research how to tune VPT to obey short-horizon text instructions using only $60 of computation and around 2,000 instruction-labeled path segments.
Their methodology is influenced by unCLIP, which was used to develop the popular DALLe 2 text-to-image model. They break down the challenge of designing a Minecraft agent that follows instructions in a VPT model tuned to achieve visual goals stored in MineCLIP’s Latent Space. and an older model that converts text instructions into visual MineCLIP embeds. They employ visual MineCLIP embeds instead of expensive text instruction tags to tune VPT through behavioral cloning with self-monitored data produced by retrospective retagging.
They combine unCLIP with classifierless guidance to develop their agent, called STEVE-1, which significantly exceeds the benchmark established by Baker et al. for tracking open commands in Minecraft using low-level controllers (mouse and keyboard) and raw pixel input.
The following are his main contributions:
• They develop STEVE-1, a Minecraft agent with high precision when executing open text and visual commands. They perform in-depth analysis of their agent, showing that it can perform various short-term1 tasks in Minecraft. They show that direct chaining of ads can significantly increase the performance of longer-term operations such as construction and crafts.
• Explain how to build STEVE-1 with only $60 of computing power, demonstrating that unCLIP and classifierless guidance are crucial for effective performance in sequential decision making.
• Make the STEVE-1 model weights, assessment scripts, and training scripts available to encourage future study on open and teachable sequential decision makers.
The website has video demos of the in-game agent.
review the Paper, Code, and project page. Don’t forget to join our 23k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.