Text to image conversion is a challenging task in machine vision and natural language processing. Generating high-quality visual content from textual descriptions requires capturing the intricate relationship between language and visual information. If text-to-image is already a challenge, text-to-video synthesis extends the complexity of 2D content generation to 3D, given the temporal dependencies between video frames.
A classic approach when dealing with such complex content is to exploit diffusion models. Diffusion models have emerged as a powerful technique to address this problem, harnessing the power of deep neural networks to generate photorealistic images that align with a given textual description or video frames with temporal consistency.
Broadcast models work by iteratively refining the generated content through a sequence of broadcast steps, where the model learns to capture the complex dependencies between the textual and visual domains. These models have shown impressive results in recent years, achieving state-of-the-art text-to-image and text-to-video synthesis performance.
Although these models offer new creative processes, they are mostly limited to creating new images rather than editing existing ones. Some recent approaches have been developed to fill this gap, focusing on preserving particular features of the image, such as facial features, background, or foreground, while editing others.
For video editing, the situation changes. To date, only a few models have been used for this task, and with little success. The goodness of a technique can be described by alignment, fidelity and quality. Alignment refers to the degree of consistency between the input text message and the output video. Fidelity accounts for the degree of preservation of the original input content (or at least of that part that is not referenced in the text message). Quality represents the definition of the image, such as the presence of fine-grained detail.
The most challenging part of this type of video editing is maintaining temporal consistency between frames. Since the application of editing methods at the image level (frame by frame) cannot guarantee such consistency, different solutions are needed.
An interesting approach to tackle the video editing task comes from Dreamix, a novel text-to-image artificial intelligence (AI) framework based on diffusion models.
Dreamix overview is shown below.
The core of this method is to enable a text conditioned video diffusion model (VDM) to maintain high fidelity to the given input video. But how?
First, instead of following the classical approach and feeding pure noise as model initialization, the authors use a degraded version of the original video. This version has little spatiotemporal information and is obtained by downscaling and adding noise.
Second, the generation model is tuned to the original video to further improve fidelity.
Fine tuning ensures that the learning model can understand the finer details of a high-resolution video. However, suppose the model is simply fine-tuned on the input video. In that case, you may not be able to edit the move, as you’ll prefer the original move instead of following the text prompts.
To address this problem, the authors suggest a new approach called mixed fine tuning. In mixed trimming, video diffusion patterns (VDM) are trimmed on individual input video frames without regard to temporal order. This is achieved by masking temporary attention. Combined fine tuning leads to a significant improvement in the quality of motion edits.
The comparison of the results between Dreamix and the most advanced approaches is shown below.
This was the brief for Dreamix, a novel AI framework for text-driven video editing.
If you are interested or would like more information on this framework, you can find a link to the document and the project page.
review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 16k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He currently works at the Christian Doppler ATHENA Laboratory and his research interests include adaptive video streaming, immersive media, machine learning and QoS / QoE evaluation.