We have witnessed the rise of generative AI models in recent months. They went from generating low-resolution face-like images to generating high-resolution photorealistic images fairly quickly. Now it is possible to obtain unique photorealistic images describing what we want to see. Also, perhaps more impressive is the fact that we can even use broadcast models to generate videos for us.
The key contributor to generative AI is diffusion models. They take a text message and generate output that matches that description. They do this by gradually transforming a set of random numbers into an image or video while adding more detail to make it look like the description. These models learn from data sets with millions of samples, so they can generate new images similar to what they’ve seen before. However, the data set can be the key problem at times.
It is almost always not feasible to train a broadcast model for video generation from scratch. They require extremely large data sets and also equipment to meet their needs. Building such data sets is only possible for a couple of institutes around the world, as accessing and collecting such data is out of reach for most people due to cost. We have to go with existing models and try to make them work for our use case.
Even if you somehow manage to prepare a text and video dataset with millions, if not billions, of pairs, you still need to find a way to get the hardware power to power those large-scale models. Therefore, the high cost of video broadcast models makes it difficult for many users to customize these technologies for their own needs.
What if there was a way to circumvent this requirement? Might we have a way to reduce the cost of training video broadcast models? time to meet Text2Video-Zero
Text2Video-Zero It’s a zero-shot text-to-video generative model, which means it doesn’t require any training to be customized. It uses previously trained text-to-image models and converts them into a time-consistent video generation model. At the end, the video shows a sequence of images quickly to stimulate movement. The idea of using them consecutively to generate the video is a simple solution.
However, we cannot simply use an imaging model hundreds of times and combine the results at the end. This won’t work because there is no way to guarantee that the models will draw the same objects all the time. We need a way to ensure temporal consistency in the model.
To enforce time consistency, Text2Video-Zero uses two light modifications.
First, it enriches the latent vectors of the generated frames with motion information to keep the overall scene and background time consistent. This is done by adding motion information to the latent vectors instead of just randomly sampling them. However, these latent vectors do not have enough constraints to represent specific colors, shapes, or identities, leading to temporal inconsistencies, particularly for the foreground object. Therefore, a second modification is required to address this issue.
The second modification is about the attention mechanism. To harness the power of interframe attention while exploiting a pretrained diffusion model without retraining, each autoattention layer is replaced with interframe attention, with each frame’s attention focused on the first frame. This helps Text2Video-Zero to preserve the context, appearance, and identity of the foreground object throughout the entire sequence.
Experiments show that these modifications lead to consistent, high-quality video generation, although it does not require training on large-scale video data. Furthermore, it is not limited to text-to-video synthesis, but is also applicable to specialized and conditional video generation, as well as video editing by textual instructions.
review the Paper and Github. Don’t forget to join our 19k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. She wrote her M.Sc. thesis on denoising images using deep convolutional networks. She is currently pursuing a PhD. She graduated from the University of Klagenfurt, Austria, and working as a researcher in the ATHENA project. Her research interests include deep learning, computer vision, and multimedia networks.