Video generation has improved with models like Sora, which uses the Transmission Transformer (DiT) architecture. While text to video (T2V) As models have advanced, they often find it difficult to create clear and consistent videos without additional references. Text-image-to-video (TI2V) The models address this limitation by using an initial image frame as a basis to improve clarity. reaching soraLevel performance remains difficult, as it is difficult to match image-based inputs to the model effectively, and higher quality data sets are needed to improve model output, making it difficult to achieve the same level of success as Sora.
Current methods explored the integration of imaging conditions in U-Net architectures, but applying these techniques to He The models remained unresolved. While broadcast-based approaches dominated text-to-video generation by using LDMscaling models and shifting to transformer-based architectures, many studies focused on isolated aspects, overlooking their combined impact on performance. Techniques such as cross attention in PixArt-α, self-care in SD3and stability tricks like qk–rule showed some improvements but became less effective as the models grew. Despite advances, no unified model was successfully combined T2V and TI2V capabilities, limiting progress toward more efficient and versatile video generation.
To solve this, researchers from Apple and the university of california developed a comprehensive framework that systematically examined the interplay between model architectures, training methods, and data curation strategies. The result RIGID The method is a simple and scalable image- and text-conditioned video generation approach. By using frame replacement, you incorporate image conditions into a Transmission Transformer (DiT) and applies text conditioning through joint guidance without conditional image and text classifier. This design allows RIGID carry out text to video (T2V) and text-image-to-video simultaneously (TI2V) tasks. Besides, RIGID It can be easily extended to applications such as video prediction, frame interpolation, multi-view generation, and long video generation.
The researchers investigated the process of setting up, training, and evaluating text to video (T2V) and text to image (T2I) models. The models used the AdaFactor optimizer, with a specific learning rate and gradient clipping, and were trained to 400k steps. Data preparation involved a video data engine that analyzed video frames, performed scene segmentation, and extracted features such as motion and clarity scores; The training used selected data sets, including more than 90 million High quality video subtitle pairs. Key evaluation metrics, including temporal quality, semantic alignment, and video image alignment, were evaluated using V Bank, VBench-I2Vand MSRVTT. The study also explored ablation techniques, such as the use of different architectural designs and training strategies, including Flow Matching, CFG renormalizationand AdaFactor Optimizer. Experiments on model initialization showed that joint initialization of lower and higher resolution models improved performance. Additionally, using more frames during training improved metrics, particularly motion smoothness and dynamic range.
He T2V and RIGID The models improved significantly after scaling from 600M to 8.7B parameters. In T2V, the VBench-Semantic score increased from 72.5 to 74.8 with larger model sizes and improved to 77.0 when the resolution was lifted 256 to 512. Fitting with high-quality data drove the VBench Quality score of 82.2 to 83.9and the best model achieves a VBench-Semantic score of 79.5. Similarly, the RIGID The model showed progress, and the STIV-M-512 model achieved a VBench-I2V score of 90.1. In video prediction, the STIV-V2V outdated model T2V with a FVD score of 183.7 compared to 536.2. He STEVE-TUP The model returned fantastic results in frame interpolation, with ADVOCATE scores of 2.0 and 5.9 in MSRVTT and MovieGen data sets. In multi-view generation, the proposed STIV model maintained 3D coherence and achieved performance comparable to Zero123++ with At the SNR of 21.64 and LPIPS of 0.156. In the generation of long videos, it generated 380 Marcos, who showed his performance with the potential to continue advancing.
In the end, the proposed framework provided a scalable and flexible solution for video generation by integrating text and image conditioning within a unified model. It demonstrated strong performance on public benchmarks and adaptability in various applications, including controllable video generation, video prediction, frame interpolation, long video generation, and multi-view generation. This approach highlighted its potential to support future advances in video generation and contribute to the broader research community!
Verify he Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 60,000 ml.
Trending: LG ai Research launches EXAONE 3.5 – three frontier-level bilingual open-source ai models that deliver unmatched instruction following and broad context understanding for global leadership in generative ai excellence….
Divyesh is a Consulting Intern at Marktechpost. He is pursuing a BTech in Agricultural and Food Engineering from the Indian Institute of technology Kharagpur. He is a data science and machine learning enthusiast who wants to integrate these leading technologies in agriculture and solve challenges.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>