The field of generative models has recently seen an increase in interest in visual synthesis. The generation of high quality images is possible in previous works. However, the length of videos presents more difficulties in practical applications than photos. The average duration of a feature film is more than 90 minutes. The average duration of a cartoon is 30 minutes. The ideal size for a video on TikTok or another similar application is between 21 and 34 seconds.
The Microsoft research team has developed an innovative architecture for making long videos. Most existing works generate long movies segment by segment in a sequential manner, which usually leads to the gap between training on short movies and inferring from large videos. Sequential generation could be more efficient. Instead, this novel method uses a coarse-to-fine process, where video is generated simultaneously with the same granularity; after applying a global diffusion model to produce the wide-range keyframes, local diffusion models are used to iteratively pad material between adjacent frames. The gap between training and inference can be bridged by direct training on long movies, and all parts can be generated simultaneously using this simple but successful approach.
The most important contributions are the following:
- The research team proposed NUWA-XL, a “broadcast on broadcast” architecture, because they see creating long videos as a revolutionary process from “thick to thin.”
- NUWA-XL is the first model to be trained directly on long movies (3376 frames), bridging the training inference gap to generate such videos.
- Parallel inference is made possible by NUWA-XL, which drastically reduces the time it takes to generate long videos. By producing 1024 frames, NUWA-XL speeds up inference by 94.26%.
- To ensure the effectiveness of the model and to provide a standard for creating extended videos, the FlintstonesHD research team created a new dataset called FlintstonesHD.
methods
Temporary KLVAE (T-KLVAE)
KLVAE transforms an input image into a low-dimensional latent representation before applying the diffusion process to avoid the computational burden of training and sampling diffusion models directly on pixels. The researchers propose temporal KLVAE (T-KLVAE) by augmenting the original spatial modules with external temporal convolution and attentional layers to transfer surface knowledge from the previously trained KLVAE image to videos.
Time masked diffusion (MTD)
As the fundamental diffusion model for the proposed broadcast-over-broadcast architecture, the researchers present Mask Temporal Diffusion (MTD). While the film’s “gross” plot is formed only from L prompts for use in global broadcast, the opening and ending frames are also used as inputs for local distribution. The suggested MTD supports global and local broadcast and can accept input conditions with or without start and end frames. They then present the MTD pipeline in its entirety before using an UpBlock to illustrate the merging of various input circumstances.
There are still some restrictions, even though the proposed NUWA-XL increases the quality of extended video creation and speeds up the speed of inference: First, the researchers only validate the efficacy of NUWA-XL on publicly available cartoon The Flintstones because long open-domain videos (such as movies and TV episodes) are now unknown. With preliminary successes in creating an open domain long video dataset, they hope to extend NUWA-XL to the open domain eventually. Second, the gap between training and inference can be bridged by direct training on long movies, but this presents a formidable hurdle for the data. Finally, while NUWA-XL can speed up inference, this enhancement requires a powerful graphics processing unit (GPU) to facilitate parallel inference.
The researchers suggest NUWA-XL, a “broadcast on broadcast” architecture, by framing the creation of long videos as an unconventional “thick to thin” procedure. NUWA-XL is the first model directly trained on long movies (3376 frames), bridging the gap between inference and training in long video production. Parallel inference is supported by NUWA-XL, which speeds up long video creation by 94.26 percent while producing 1024 frames. To further verify the power of the model and provide a benchmark for extended video creation, they build FlintstonesHD, a new dataset.
review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 26k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Dhanshree Shenwai is a Computer Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking domain with strong interest in AI applications. She is enthusiastic about exploring new technologies and advancements in today’s changing world, making everyone’s life easier.