Using talking face creation, it is possible to create realistic video portraits of a target individual that correspond to the content of the speech. Since it provides the visual material of the person concerned in addition to the voice, it holds great promise in applications such as virtual avatars, online conferences, and animated movies. The most commonly used techniques for dealing with audio-driven talking face generation use a two-stage framework. First, an intermediate representation is predicted from the input audio; A renderer is then used to synthesize the video portraits based on the expected representation (eg, 2D landmarks, combined shape coefficients of 3D facial models, etc.). By getting natural head movements, increasing lip sync quality, creating emotional expression, etc. Along this path, great progress has been made towards improving the overall realism of video portraiture.
However, it should be noted that the creation of talking faces is inherently a one-to-many mapping problem. On the contrary, the algorithms mentioned above are biased towards learning a deterministic mapping of the audio provided to a video. This indicates that there are several possible visual representations of the target individual given an input audio clip due to the variety of phonemic contexts, moods, and lighting conditions, among other factors. This makes it more difficult to provide realistic visual results when deterministic mapping is learned, since ambiguity is introduced during training. The two-stage framework, which divides the one-to-many mapping challenge into two sub-problems, could help facilitate this one-to-many mapping (ie, an audio-to-speech problem and a neural rendering problem). Although efficient, each of these two phases is still designed to forecast the data that the input missed, making prediction difficult. As an illustration, the audio-to-expression model learns to create an expression that semantically corresponds to the input audio. Still, it ignores high-level semantics, like habits, attitudes, etc. Compared to this, the neural rendering model loses information at the pixel level, such as wrinkles and shadows, as it creates visual appearances based on emotion prediction. This study suggests MemFace, which creates an implicit memory and an explicit memory that follow the direction of the two phases differently, to supplement the missing information with memories to further ease the one-to-many mapping problem.
More precisely, explicit memory is constructed in a non-parametric and personalized way for each target individual to complement visual features. In contrast, implicit memory is optimized in conjunction with the audio-to-utterance model to complete the semantically aligned information. Therefore, your audio-to-expression model uses the extracted audio function as a query to service implicit memory instead of directly using the input audio to predict the expression. The auditory feature is combined with the attention output, which previously functioned as semantically aligned data, to provide an utterance output. The semantic gap between the input audio and the output utterance is reduced by enabling end-to-end training, which encourages implicit memory to associate high-level semantics in the common space between the audio and the utterance.
The neural rendering model synthesizes visual appearances based on mouth shapes determined from expression estimates after the expression has been obtained. They first build the explicit memory for each individual using the vertices of the 3D facial models and their accompanying image patches as keys and values, respectively, to complement the pixel-level information between them. The accompanying image patch is returned as pixel-level information to the neural rendering model for each input sentence. Their corresponding vertices are used as a query to get similar keys in explicit memory.
Intuitively, explicit memory eases the generation process by allowing the model to selectively map the information required by the expression without generating it. Extensive testing on various commonly used datasets (such as Obama and HDTF) show that the proposed MemFace provides state-of-the-art lip-syncing and rendering quality, consistently and considerably outperforming all basic approaches in various contexts. For example, their MemFace improves the subjective score of the Obama dataset by 37.52% compared to the reference. Working samples of this can be found on his website.
review the Paper Y Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our reddit page, discord channel, Y electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.