Knowing how people interact with content is a crucial skill. The invisible mental states connected with thinking and feeling are called emotions. Without physical cues, they would have to rely on human movements like speech, gesture, and sound to identify them.
Emotion Recognition in Conversations (ERC) aims to analyze textual, visual and auditory information to identify the emotions expressed in a conversation. The use of ERC to analyze and moderate multimedia information has quickly become more important. It can be used for AI interviews, one-on-one chat interfaces, user sentiment analysis, and contextualizing material on social media sites like YouTube, Facebook, and Twitter.
Many state-of-the-art methods for performing robust ERC rely on text-based processing, which ignores the vast amounts of information available from the auditory and visual channels.
Sony Research India’s media analytics group believes that the performance and robustness of existing systems can be improved by merging the three modalities present in ERC data: text, visual and audio. The ERC system accepts a sample of emotional expressions in three modalities as input and predicts the corresponding emotion for each.
Their new study presents a multi-modal fusion network (M2FNet) that uses a novel multi-headed fusion attention layer to take full advantage of the inherent diversity of media. Audio and visual data layers map to the latent space of textual properties, allowing for the generation of rich representations that are emotionally relevant. Using all three modalities improves accuracy, and the suggested method’s Fusion process further increases accuracy.
There are two key phases in this concept:
- Expression level performs feature extraction at a single utterance (intra-speaker) and modality level.
- In the middle of the dialogue level, the functions for each inter-speaker (Inter-Speaker) are retrieved and the contextual information is recorded.
The final emotion labels are estimated when the link between the modalities is recovered.
A previous study showed that treating speech data as an image rather than a Mel spectrogram of plotted frequency features improves emotion recognition accuracy. Taking inspiration from this, M2FNet extracts features from a spoken language, such as images extracted from a text. To extract more emotion-related data from videos, M2FNet introduces a dual network that takes into account not only the person’s facial emotions, but also the entire framework to capture the context.
Furthermore, they also suggest a new model for feature extraction using exHere. They develop a new adaptive margin-based triplet loss function that facilitates the ability of the proposed extractor to acquire accurate representations.
The team says that the inability of each key to increase accuracy on its own demonstrates the importance of scene context in addition to aspects of facial expressions in emotion recognition. They present a dual network inspired by the fusion of the emotional content of the scene, considering the different people that compose it. Furthermore, research shows that the performance of state-of-the-art ERC approaches declines on more complicated datasets like MELD, despite their success on a benchmark dataset like IEMOCAP.
More than 1,400 chats and 13,000 statements from the television series “Friends” make up MELD. Seven emotion labels (anger, contempt, sadness, joy, surprise, fear, and neutral) are applied to each statement. The precast Train/Valid is used exactly as is.
IEMOCAP is a conversational database with six emotion labels: happy, sad, neutral, angry, excited, and irritated. In the experiment, 10% of the training data was chosen at random and used to fit the hyperparameters. 10% of the training data was randomly chosen to create the database.
The team experimented by comparing the performance of the proposed network against existing text-based and multimodal ERC techniques, verifying the robustness of the network. They compared the MELD and IEMOCAP data sets as weighted average F1 scores. The results suggest that the M2FNet model outperforms the competition by a significant margin when weighted average F1 scores are compared. The findings also suggest that M2FNet effectively used multimodal features to improve emotion recognition accuracy.
review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 13k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.