Researchers at MIT, the MIT-IBM Watson AI Lab, IBM Research, and elsewhere have developed a new technique for analyzing unlabeled audio and visual data that could improve the performance of machine learning models used in applications like speech recognition. and object detection. The work, for the first time, combines two self-supervised learning architectures, contrastive learning and masked data modeling, in an effort to scale machine learning tasks such as event classification in single-modal and multi-modal data without the need for annotation, thus replicating how humans understand and perceive our world.
“A lot of human knowledge is learned in a self-supervised way, because we don’t always get supervision signals and we want to allow the machine learning model to have the same capability,” says Yuan Gong, an MIT postdoc. at the Computer Science and Artificial Intelligence Laboratory (CSAIL).
“So another way of saying it is that self-monitored learning often forms the foundation of an initial model, because it can learn on large amounts of unlabeled data. And then you can use classic supervised learning or reinforcement learning to tune the model to something in particular if you want,” says Jim Glass, a senior research scientist at MIT and a member of the MIT-IBM Watson AI Lab.
The technique, called contrastive audiovisual masking autoencoder (CAV-MAE), is a type of neural network that can learn to extract and map meaningful latent representations in high-dimensional space from acoustic and visual data by training on large sets of YouTube data of 10-second audio and video clips. The researchers say the technique is more effective than previous approaches because it explicitly models the relationships between audio and visual data in a way that other methods do not.
Uniting Gong and Glass in the study They are graduate students Andrew Rouditchenko and Alexander H. Liu from MIT, David Harwath PhD ’18 from the University of Texas at Austin, and MIT-IBM Watson AI Lab members Leonid Karlinsky and Hilde Kuehne. Kuehne is also affiliated with the Goethe University of Frankfurt. The method was recently presented at the International Conference on Learning Representations.
A joint and coordinated approach
The CAV-MAE works by “learning by prediction” and “learning by comparison,” says Gong. Masked data modeling, or the prediction method, takes a video along with its coordinated audio waveform, converts the audio into a spectrogram, and masks 75 percent of both. The unmasked data is tokenized, then fed to separate audio and visual encoders before entering a joint encoder/decoder, where the model is asked to retrieve the missing data. The difference (reconstruction loss) between the resulting reconstructed prediction and the original audiovisual combination is then used to train the model for better performance. An example of this would be masking part of a piano video and part of a piano music spectrogram, and then asking the model to try to determine the masked inputs. Unfortunately, this method may not capture the association between the video and audio pair, while contrastive learning takes advantage of this, but it can discard some information unique to the modality, such as the background of a video.
Contrastive learning aims to map representations that are similar close to each other. For example, the model will try to place different video and audio data from different parrots closer to each other and further away from video and audio pairs of guitars playing. Similar to automatic masked encoding, the audiovisual pairs are passed to separate modality encoders; however, the audio and visual components are kept separate within the joint encoder before binning and contrast loss is performed by the model. In this way, contrastive learning tries to identify the parts of each audio or video that are most relevant to the other. For example, if a video shows someone speaking and the corresponding audio clip contains speech, the autoencoder will learn to associate the speaker’s mouth movements with the spoken words. You’ll then adjust the model parameters so that those inputs are rendered close to each other. Ultimately, the CAV-MAE method combines both techniques with multiple direct data streams with masking as the first step, modality-specific encoders, and layer normalization so that representation strengths are similar.
“Us [then] I wanted to compare the proposed CAV-MAE with a model trained only with a masked autoencoder and a model trained only with contrastive learning, because we want to show that by combining masked autoencoder and contrastive learning, we can improve performance,” says Gong, “and the results support our hypothesis that there is an obvious improvement.”
The researchers tested CAV-MAE, as well as its lossless contrast method or a masked autoencoder, with other state-of-the-art methods in AV retrieval and AV event classification tasks using standard AudioSet (20K and 2M) and VGGSound data sets. : short realistic tagged clips, which could include multiple sounds. Audiovisual retrieval means that the model sees the audio or visual component of a pair of queries and looks for the missing one; event classification includes identifying actions or sounds within the data, such as a person singing or driving a car.
In general, they found contrastive learning and masked data modeling to be complementary methods. CAV-MAE was able to outperform previous techniques (with fully self-monitored pretraining) by about 2 percent for event classification performance against models with comparable computation and, most impressively, kept pace with or exceeded to models with industrial-grade computational resources. The team model was classified similarly to the models trained with the contrastive loss only. And surprisingly, the team says, incorporating multimodal data into CAV-MAE pretraining greatly improves fine-tuning of single-modality representation through supervised learning (with some labeled data) and task performance. classification of audio-only events. . This demonstrates that, like humans, multimodal information provides an additional “soft label” boost even for audio-only or visual-only tasks; for example, it helps the model understand whether you are looking for an electric or acoustic guitar, a richer monitoring signal.
“I think people like the elegance of this model for combining information in the different audio and visual streams. It has contrastive and reconstruction loss, and compared to models that have been tested with similar data, it clearly does very well on a variety of these tasks,” says Glass.
Building on this, “a special thing is that our model can do both classification and recovery, which is not common,” Gong adds. “Before this work, these methods were used separately, but after this work, I see that most AV learning frameworks use contract loss and masked autoencoder together, implicitly or explicitly.”
Bringing self-monitored audiovisual learning to our world
The researchers see their contribution of the masked contrastive audiovisual autoencoder (CAV-MAE) as an important milestone and a step forward for applications, which are increasingly moving from single modality to multimodality and which require or take advantage of audiovisual fusion. They hypothesize that it could one day be used for action recognition in fields like sports, education, entertainment, motor vehicles, and public safety. It could also, one day, be extended to other modalities. Right now, the fact that “this only applies to audiovisual data may be a limitation, but we are targeting multimodal learning, which is the trend of machine learning,” says Gong. “As humans, we have multiple modalities, we have smell, touch, many more things than just audiovisual. So when we try to build AI, we try to mimic humans in some way, not necessarily from the biological perspective, and this method could [potentially be] generalized to other unexplored modalities.”
As machine learning models continue to play an increasingly important role in our lives, techniques like this will become increasingly valuable.
This research was supported by the MIT-IBM Watson AI Lab.