Mark Hamilton, an MIT doctoral student in electrical and computer engineering and an affiliate of MIT's Computer Science and artificial intelligence Laboratory (CSAIL), wants to use machines to understand how animals communicate. To achieve this, he first set out to create a system that could learn human language “from scratch.”
“Interestingly, the key moment of inspiration came from the movie 'March of the Penguins'. There's a scene where a penguin falls while crossing the ice and lets out a small whimper as he gets up. When you look at it, it's almost obvious that this moan replaces a four-letter word. This was the moment when we thought maybe we needed to use audio and video to learn a language,” says Hamilton. “Is there any way we can let an algorithm watch TV all day and then figure out what we're talking about?”
“Our model, 'DenseAV', aims to learn language by predicting what it sees from what it hears, and vice versa. For example, if you hear the sound of someone saying “bake the cake at 350,” you are probably looking at a cake or an oven. To be successful in this game of combining audio and video across millions of videos, the model has to learn what people are talking about,” says Hamilton.
Once they trained DenseAV on this matching game, Hamilton and his colleagues looked at which pixels the model looked for when it heard a sound. For example, when someone says “dog,” the algorithm immediately starts searching for dogs in the video stream. By seeing which pixels the algorithm selects, you can discover what the algorithm thinks a word means.
Interestingly, a similar search process occurs when DenseAV hears a dog barking: it searches for a dog in the video stream. “This piqued our interest. “We wanted to see if the algorithm knew the difference between the word ‘dog’ and a dog barking,” says Hamilton. The team explored this by giving DenseAV a “two-sided brain.” Interestingly, they found that one side of DenseAV's brain naturally focused on language, like the word “dog,” and the other side focused on sounds like barking. This showed that DenseAV not only learned the meaning of words and the location of sounds, but also learned to distinguish between these types of cross-modal connections, all without human intervention or any knowledge of written language.
One branch of applications is learning from the enormous amount of video that is posted on the Internet every day: “We want systems that can learn from large amounts of video content, such as how-to videos,” says Hamilton. “Another interesting application is the understanding of new languages, such as communication with dolphins or whales, which do not have a written form of communication. Our hope is that DenseAV can help us understand these languages that have eluded human translation efforts all along. “Eventually, we hope this method can be used to discover patterns between other pairs of signals, such as the seismic sounds produced by the Earth and its geology.”
A formidable challenge awaited the team: learning a language without having to enter text. Its goal was to rediscover the meaning of language from scratch, avoiding the use of pre-trained linguistic models. This approach is inspired by how children learn by observing and listening to their environment to understand language.
To achieve this feat, DenseAV uses two main components to process audio and visual data separately. This separation made it impossible for the algorithm to cheat by allowing the visual side to look at the audio and vice versa. It forced the algorithm to recognize objects and created detailed and meaningful features for visual and audio signals. DenseAV learns by comparing pairs of audio and visual signals to find which signals match and which don't. This method, called contrastive learning, does not require labeled examples and allows DenseAV to discover important predictive patterns from the language itself.
An important difference between DenseAV and previous algorithms is that previous work focused on a single notion of similarity between sound and images. A full audio clip, in which someone said “the dog sat on the grass,” was compared to a full image of a dog. This did not allow previous methods to discover fine details, such as the connection between the word “grass” and the grass under the dog. The team's algorithm finds and adds all possible matches between an audio clip and the pixels of an image. This not only improved performance, but allowed the team to accurately locate sounds in a way that previous algorithms could not. “Conventional methods use a single class token, but our approach compares every pixel and every second of sound. This detailed method allows DenseAV to make more detailed connections for better localization,” says Hamilton.
The researchers trained DenseAV on AudioSet, which includes 2 million YouTube videos. They also created new data sets to test how well the model can link sounds and images. In these tests, DenseAV outperformed other top models on tasks such as identifying objects by their names and sounds, demonstrating its effectiveness. “Previous datasets only supported coarse evaluations, so we created a dataset using semantic segmentation datasets. This helps with pixel-perfect annotations for accurate evaluation of our model performance. We can trigger the algorithm with specific sounds or images and get those detailed locations,” says Hamilton.
Due to the enormous amount of data involved, the project took approximately a year to complete. The team says the transition to a large transformer architecture presented challenges, as these models can easily miss fine details. Encouraging the model to focus on these details was a major hurdle.
Looking ahead, the team aims to create systems that can learn from massive amounts of video or audio data alone. This is crucial for new domains where there are many of both modes, but not together. They also aim to expand this using larger backbones and possibly integrate knowledge from language models to improve performance.
“Recognizing and segmenting visual objects in images, as well as environmental sounds and spoken words in audio recordings, are difficult problems in themselves. “Historically, researchers have relied on expensive human-provided annotations to train machine learning models to perform these tasks,” says David Harwath, an assistant professor of computer science at the University of Texas at Austin, who was not involved in the work. “DenseAV is making significant progress in developing methods that can learn to solve these tasks simultaneously simply by observing the world through sight and sound, building on the idea that the things we see and interact with often make sounds.” , and we also use spoken language to speak. About them. This model also makes no assumptions about the specific language being spoken and could therefore in principle learn from data in any language. “It would be exciting to see what DenseAV could learn by scaling it to thousands or millions of hours of video data in a multitude of languages.”
Additional authors in a document describing the job are Andrew Zisserman, professor of computer vision engineering at the University of Oxford; John R. Hershey, ai perception researcher at Google; and William T. Freeman, professor of electrical and computer engineering at MIT and principal investigator of CSAIL. His research was supported, in part, by the US National Science Foundation, a Royal Society Research Professorship, and a grant from the EPSRC Visual ai program. This work will be presented at the IEEE/CVF Computer Vision and Pattern Recognition Conference this month.