Recently, the disciplines of computer vision and computer graphics have paid much attention to the production of text-based video. Video material can be created and managed using text as input, which has many uses in academia and business. Text-to-video production is still plagued with difficulties, particularly in the face-centric situation where the quality or relevance of the video frames could be better. One of the main problems is the lack of a facial text and video dataset suitable for facial recognition and containing high-quality video samples and text descriptions of several crucial attributes for video recognition.
There are several difficulties in creating a high-quality face text video collection in three areas. 1) Data collection: The quantity and quality of the video samples substantially influence the quality of the final videos. It is not easy to acquire a data set of this size with high quality samples while preserving a normal distribution and smooth video motion.
2) data annotation: Verify that text and video pairings are relevant. This requires complete coverage of the text to describe the information and movement seen in the video, such as lighting and head movements.
🔥 Must Read: The transformative impact of artificial intelligence on hardware development: its applications, the need for chip redesign, market growth, and who is the leading AI chip maker
3) text production: The creation of varied and authentic texts is complex. Manual text production is expensive and needs to be more scalable. Autotext output can be easily expanded, but its naturalness is restricted.
They meticulously plan an entire data creation pipeline comprising data collection and processing, data annotation, and semi-automated text production to overcome the issues mentioned above. They begin to use the collection procedures of CelebV-data HQ, which have successfully obtained raw images. They make a small adjustment at the processing stage to further improve the smoothness of the video.
They then examine video of both temporal dynamics and static material to ensure highly relevant text and video pairings. They build a set of qualities that may or may not vary over time. Lastly, they suggest a template-based, semi-automated approach to producing varied and organic writing. His strategy uses the benefits of both automated and manual text techniques. To analyze handwriting and annotations, they specifically create a wide range of grammar templates that can be dynamically mixed and adjusted for great diversity, complexity, and naturalness.
Using the suggested pipeline, researchers from the University of Sydney, SenseTime Research, NTU, and the Shanghai AI Lab produce CelebV-Text, a large-scale facial text and video dataset, consisting of 1,400,000-word descriptions. and 70,000 nature video clips, each with a minimum resolution of 512512. CelebV-Text combines high-quality video samples with text descriptions to create realistic facial videos, as seen in Figure 1. Three categories of attributes static (40 general looks, five detailed looks, and six lighting conditions) and three dynamic attribute categories are noted for each video (37 actions, 8 emotions, and 6 light directions). Although manual texts are provided for labels that cannot be discretized, all dynamic attributes are extensively annotated with start and end timestamps.
They have also created three templates for each type of feature, for 18 templates that can be combined in various ways. The texts produced for all properties and manuals are naturally described. Existing face video datasets can’t compete with CelebV: better text resolution (more than 2x), larger sample size, and more varied distribution. Furthermore, compared to the text and video data sets, the CelebV-Text sentences had greater diversity, richness, and naturalness. According to CelebVText’s text and video retrieval experiments, text and video pairs are highly relevant.
They are evaluating CelebV-Text against a typical baseline for facial text-to-video generation to further investigate its efficacy and potential. Compared to a state-of-the-art large-scale pretrained model, their findings demonstrate higher relevance between produced facial videos and words. Furthermore, they show how a specific change in the use of text interpolation can greatly improve temporal consistency. To standardize the work of text-to-video generation of faces, they provide a new creation benchmark comprising representative models across three text and video data sets.
The following is a summary of the key contributions of this work: 1) To aid research on facial text-to-video production, they suggest CelebV-Text, the first large-scale facial text video dataset featuring video high quality so rich and very relevant words. 2) CelebV-Text is superior through extensive statistical analysis that looks at video/text quality, diversity, and relevance of text and video. 3) To show the efficiency and potential of CelebV-Text, many self-tests are done. 4) A new benchmark is created for the process to encourage standardization of the text-to-video facial generation task.
review the Paper, Project and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 17k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.