A team of researchers from Peking University, UCLA, Beijing University of Posts and Telecommunications, and Beijing Institute of Artificial General Intelligence presents JARVIS-1, a multimodal agent designed for open-world tasks in Minecraft. Leveraging pre-trained multimodal language models, JARVIS-1 interprets visual observations and human instructions, generating sophisticated plans for body control.
JARVIS-1 uses language models and multimodal input for planning and control. Built on pre-trained multimodal language models, JARVIS-1 integrates a multimodal memory for planning based on pre-trained knowledge and in-game experiences. Achieving near-perfect performance on 200 diverse tasks, it notably excels at the challenging long-horizon diamond pickaxe task, achieving a five-fold improvement in completion rate. The study emphasizes the importance of multimodal memory to improve agent autonomy and general intelligence in open-world scenarios.
The research addresses challenges in creating sophisticated agents for complex tasks in open-world environments. Existing approaches need help with multimodal data, long-term planning, and lifelong learning. The proposed JARVIS-1 agent, built on pre-trained multimodal language models, excels at Minecraft tasks. JARVIS-1 achieves near-perfect performance on more than 200 tasks, significantly improving the long-horizon diamond spike task. The agent demonstrates autonomous learning, evolving with minimal external intervention, contributing to the search for generally capable artificial intelligence.
JARVIS-1, designed on pre-trained multimodal language models, combines visual and textual input to generate plans. The agent’s multimodal memory integrates previously trained knowledge with in-game experiences for planning. Existing approaches use a hierarchical execution goal architecture and large language models as high-level schedulers. JARVIS-1 is evaluated on 200 Minecraft Universe Benchmark tasks, revealing challenges in diamond functions due to the controller’s imperfect execution of short-horizon text instructions.
JARVIS-1’s multimodal memory encourages self-improvement, improving general intelligence and autonomy by outperforming other agents following instructions. JARVIS-1 outperforms memoryless DEPS on challenging tasks, and the success rate on diamond-related tasks nearly triples. The study highlights the importance of refining plan generation for easier execution and improving the controller’s ability to follow instructions, particularly in diamond-related tasks.
JARVIS-1, an open-world agent built on pre-trained multimodal language models, masters multimodal perception, plan generation, and embedded control within the Minecraft universe. Incorporating multimodal memory improves decision making by leveraging pretrained knowledge and real-time experiences. JARVIS-1 substantially increases completion rates for tasks like the Long Horizon Diamond Spike, surpassing previous records by up to five times. This advance lays the foundation for future developments in versatile and adaptable agents within complex virtual environments.
Additional research suggests improving the generation of plans for task execution, improving the controller’s ability to follow instructions in diamond-related tasks, and investigating methods to facilitate execution. It aims to explore ways to drive decision-making in open-world scenarios through multimodal memory and real-time experiences. Expanding the capabilities of JARVIS-1 for a broader range of tasks in Minecraft and its possible adaptation to other virtual environments is recommended. The study encourages continuous improvement through lifelong learning, encouraging self-improvement and the development of greater general intelligence and autonomy in JARVIS-1.
Review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 33k+ ML SubReddit, 41k+ Facebook community, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
Hello, my name is Adnan Hassan. I’m a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a double degree from the Indian Institute of technology, Kharagpur. I am passionate about technology and I want to create new products that make a difference.
<!– ai CONTENT END 2 –>