For natural language to be an effective form of communication, the parties involved must be able to understand the words and their context, assume that the content is shared largely in good faith and is trustworthy, reason about the information being shared, and then apply it. to real-world scenarios. MIT PhD students interning in the MIT-IBM Watson ai Lab (Athul Paul Jacob SM '22, Maohao Shen SM '23, Victor Butoi, and Andi Peng SM '23) are working to attack each step of this integrated process in natural language. models, so that ai systems can be more reliable and accurate for users.
To achieve this, Jacob's research strikes at the heart of existing natural language models to improve the outcome, using game theory. His interests, he says, are twofold: “One is understanding how humans behave, using the lens of multi-agent systems and language understanding, and the second is: 'How can that be used as knowledge to build better ai?' ?systems?'” His work stems from the board game “Diplomacy,” where his research team developed a system that could learn and predict human behaviors and negotiate strategically to achieve a desired optimal outcome.
“This was a game in which you need to build trust; you need to communicate using language. You also need to play against six other players at the same time, which was very different from all the types of tasks that people did in the past,” says Jacob, referring to other games such as poker and GO that the researchers applied to neural networks. . “In doing so, there were many research challenges. One was: 'How do you model humans? How do you know if humans tend to act irrationally?'” Jacob and his research mentors, including Associate Professor Jacob Andreas and Assistant Professor Gabriele Farina of the MIT Department of Electrical Engineering and Computer Science (EECS), and the MIT-IBM Watson. ai Lab's Yikang Shen reframed the language generation problem as a two-player game.
Using “generator” and “discriminator” models, Jacob's team developed a natural language system to produce answers to questions and then observe the answers and determine if they are correct. If so, the ai system receives one point; If not, no points are awarded. It is notorious that linguistic models tend to hallucinate, which makes them less trustworthy; This regret-free learning algorithm collaboratively takes a natural language model and encourages the system's responses to be more truthful and reliable, while keeping solutions close to those of the pre-trained language model. Jacob says that using this technique in conjunction with a smaller language model could probably make it competitive with the same performance of a model many times larger.
Once a language model generates a result, researchers ideally want confidence in its generation to align with its accuracy, but this is often not the case. Hallucinations can occur when the model reports high confidence when it should be low. Maohao Shen and his group, with their mentors Gregory Wornell, Sumitomo Professor of Engineering at EECS, and IBM Research lab researchers Subhro Das, Prasanna Sattigeri, and Soumya Ghosh, seek to solve this problem through uncertainty quantification (UQ). “Our project aims to calibrate linguistic models when they are poorly calibrated,” says Shen. Specifically, they are looking at the classification problem. To do this, Shen allows a language model to generate free text, which is then converted into a multiple-choice classification task. For example, they could ask the model to solve a math problem and then ask if the answer it generated is correct: “yes, no, or maybe.” This helps determine whether the model is overconfident or underconfident.
By automating this, the team developed a technique that helps fine-tune trust production using a pre-trained language model. The researchers trained an auxiliary model using ground truth data so that their system could correct the language model. “If your model is too confident in its prediction, we can detect that and make it less confident, and vice versa,” Shen explains. The team evaluated their technique on multiple popular benchmark data sets to show how well it generalizes to unseen tasks to realign the accuracy and confidence of the language model's predictions. “After training, you can simply go online and apply this technique to new tasks without any other supervision,” Shen says. “All you need is the data for that new task.”
Victor Butoi also improves the model's capabilities, but instead, his laboratory team, which includes John Guttag, Dugald C. Jackson Professor of Computer Science and Electrical Engineering at EECS; laboratory researchers Leonid Karlinsky and Rogerio Feris of IBM Research; and lab affiliates Hilde Kühne of the University of Bonn and Wei Lin of the Graz University of technology, are creating techniques that allow vision and language models to reason about what they are seeing and are designing prompts to unlock new learning skills and understand key phrases. .
Compositional reasoning is just another aspect of the decision-making process that we ask machine learning models to perform in order to be useful in real-world situations, Butoi explains. “You need to be able to think about problems compositionally and solve subtasks,” says Butoi, “for example, if you say the chair is to the left of the person, you need to recognize both the chair and the person. You need to understand the instructions.” And then, once the model understands “left,” the research team wants the model to be able to answer other questions related to “left.”
Surprisingly, vision-language models don't reason well about composition, Butoi explains, but they can be helped by using a model that can “guide the witness,” so to speak. The team developed a model that was modified using a technique called low-rank adaptation of large language models (LoRA) and trained on an annotated data set called Visual Genome, which has objects in an image and arrows that denote relationships, such as addresses. In this case, the trained LoRA model would be guided to say something about “leftist” relations, and this title output would then be used to provide context and drive the vision-language model, making it a “significantly easier task.” “says Butoi. .
In the world of robotics, artificial intelligence systems also interact with their environment through computer vision and language. Environments can range from warehouses to the home. Andi Peng and her mentors, MIT Professor HN Slater of Aeronautics and Astronautics, Julie Shah and Chuang Gan of the lab and the University of Massachusetts at Amherst, are focusing on helping people with physical limitations, using worlds virtual. To do this, Peng's group is developing two embedded ai models (a “human” who needs support and an auxiliary agent) in a simulated environment called ThreeDWorld. Focusing on human-robot interactions, the team leverages the semantic background captured by large language models to help auxiliary ai infer what skills the “human” agent might not be able to perform and the motivation behind the robot's actions. “human”, using natural resources. language. The team seeks to strengthen the helper's sequential decision making, two-way communication, ability to understand the physical scene, and how best to contribute.
“Many people think that ai programs should be autonomous, but I think an important part of the process is that we build robots and systems for humans, and we want to pass on human knowledge,” Peng says. “We don't want a system to do something strangely; “We want them to do it in a human way that we can understand.”