Today’s visual generative models, particularly broadcast-based models, have made great strides in automating content generation. Thanks to computing, data scalability, and advances in architectural design, designers can generate realistic images or videos using a text message as input. To achieve unparalleled fidelity and diversity, these methods often train a robust text-conditional diffusion model on massive video-text and image-text data sets. Despite these remarkable advances, there is still a major obstacle in the low degree of control of the synthesis system, which severely limits its usefulness.
Most current approaches allow for adjustable creation by introducing new conditions beyond texts, such as slice maps, paint masks, or sketches. The Composer expands on this idea by proposing a new generative paradigm based on compositionality that can compose an image under a wide range of input conditions and achieve extraordinary flexibility. While Composer excels at considering multi-level conditions in the spatial dimension, you may need help with video production due to the unique characteristics of video data. This difficulty is a result of the multi-layered temporal structure of films, which must accommodate a wide range of temporal dynamics while maintaining consistency between individual frames. Therefore, the combination of appropriate temporal conditions with spatial signals becomes essential to enable programmable video synthesis.
The above considerations inspired researchers at Alibaba Group and Ant Group to develop VideoComposer, which provides improved spatial and temporal control for video synthesis. This is achieved by first dissecting a video into its constituent parts (textual condition, spatial condition, and critical temporal condition) and then using a latent diffusion model to reconstruct the input video under the influence of these elements. In particular, to explicitly record the dynamics between frames and provide direct control over internal movements, the team also offers the video-specific motion vector as a type of temporal guide during video synthesis.
In addition, they feature a unified spatiotemporal encoder (STC encoder) that employs cross-frame attention mechanisms to capture the spatiotemporal relationships within the sequential input, resulting in improved cross-frame coherence of output movies. The STC encoder also acts as an interface, allowing effective unified use of control signals from a wide range of sequences of conditions. Therefore, VideoComposer is adaptable enough to compose a video in various settings while keeping the synthesis quality constant.
Importantly, unlike conventional approaches, the team was able to manipulate the movement patterns with relatively simple hand movements, such as an arrow showing the path of the moon. Researchers carry out various qualitative and quantitative tests that demonstrate the effectiveness of VideoComposer. The findings show that the method achieves remarkable levels of creativity in a variety of subsequent generative activities.
techniques
review the Paper, Githuband Project. Don’t forget to join our 23k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.