Text-to-video dissemination models have made significant progress in recent times. By simply providing textual descriptions, users can now create realistic or imaginative videos. These basic models have also been adjusted to generate images that match certain looks, styles and themes. However, the area of motion customization in text-to-video generation remains to be explored. Users may want to create videos with specific movements, such as a car moving forward and then turning left. Therefore, it is important to adapt dissemination models to create more specific content that meets user preferences.
The authors of this article have proposed movement director, which helps base models achieve movement customization while maintaining appearance diversity. The technique uses a dual-path architecture to train models to learn appearance and motions on single or multiple reference videos separately, making it easy to generalize custom motion to other settings.
The dual architecture comprises a spatial and temporal path. The spatial path has a fundamental model with trainable spatial LoRAs (low-range adaptations) built into its transformer layers for each video. These spatial LoRAs are trained using a single randomly selected frame in each training step to capture the visual attributes of the input videos. In contrast, the temporal path duplicates the fundamental model, sharing the spatial LoRAs with the spatial path to adapt to the appearance of the given input video. Additionally, temporal transformers in this pathway are enhanced with temporal LoRAs, which are trained using multiple frames of the input videos to capture inherent motion patterns.
By simply implementing the trained temporal LoRAs, the basic model can synthesize videos of the learned movements with various appearances. The dual architecture allows models to learn the appearance and motion of objects in separate videos. This decoupling allows movement director to isolate the appearance and motion of videos and then combine them from multiple source videos.
The researchers compared the performance of movement director on a couple of landmarks, with over 80 different moves and 600 text prompts. In the UCF Sports Action benchmark (with 95 videos and 72 text messages), human evaluators preferred MotionDirector about 75% of the time for better motion fidelity. The method also exceeded the preferences of 25% of the base models. On the second benchmark, i.e., the LOVEU-TGVE-2023 benchmark (with 76 videos and 532 text messages), MotionDirector performed better than other methods based on controllable generation and tuning. The results demonstrate that numerous basic models can be customized using MotionDirector to produce videos characterized by the diversity and desired motion concepts.
MotionDirector is a promising new method for adapting text-to-video diffusion models to generate videos with specific motions. It excels at learning and adapting specific subject and camera movements, and can be used to generate videos with a wide range of visual styles.
One area where MotionDirector can be improved is to learn the motion of multiple subjects in reference videos. However, even with this limitation, MotionDirector has the potential to improve flexibility in video generation, allowing users to create videos tailored to their preferences and requirements.
Review the Paper, Project, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 31k+ ML SubReddit, Facebook community of more than 40,000 people, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
We are also on WhatsApp. Join our ai channel on Whatsapp.
I am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and have a keen interest in Data Science, especially Neural Networks and its application in various areas.
<!– ai CONTENT END 2 –>