LLM video generation is an emerging field with a promising growth trajectory. While large language autoregressive models (LLMs) have excelled in generating long, coherent sequences of tokens in natural language processing, their application in video generation has been limited to short videos of a few seconds. To address this, researchers have introduced Loong, an LLM-based autoregressive video generator capable of generating minutes-long videos.
Training a video generation model like Loong involves a unique process. The model is trained from scratch, with text tokens and video tokens treated as a unified sequence. The researchers proposed a short-to-long progressive training approach and loss weighing scheme to mitigate the loss imbalance problem for long video training. This allows Loong to be trained on a 10-second video and then extended to generate long one-minute videos conditioned on text prompts.
However, generating large videos is much more complicated and has many challenges ahead. First of all, there is the problem of unbalanced losses during training. When training with the goal of predicting the next token, predicting early frame tokens from text cues is more difficult than predicting late frame tokens based on previous frames, resulting in unequal loss during training. As the video length increases, the cumulative loss of easy tokens eclipses the loss of difficult tokens, dominating the direction of the gradient. In second place, The model predicts the next token based on ground truth tokens, but relies on its own predictions during inference. This discrepancy causes error accumulation, especially due to strong dependencies between frames and many video tokens, leading to visual quality degradation in long video inferences.
To mitigate the challenge of difficulties with unbalanced video token, Researchers have proposed a progressive training strategy from short to long duration with loss reweighting, as demonstrated below:
Progressive training from short to long
The training is divided into three stages, which increases its duration:
Stage 1: Model pre-trained with text-to-image generation on a large static image dataset, helping the model establish a solid foundation for modeling per-frame appearance.
Stage 2: Model trained with images and short video clips, where the model learns to capture short-term temporal dependencies.
Stage 3: The number of video frames increased and joint training continues
Loong is designed with a two-component system, a video tokenizer that compresses videos into tokens and a decoder and transformer that predicts the next video tokens based on text tokens.
Loong uses the 3D CNN architecture to tokenizerinspired by MAGViT2. The model works with low-resolution videos and leaves the super-resolution for post-processing. Tokenizer can compress 10-second video (65 frames, 128*128 resolution) into a sequence of 17*16*16 discrete tokens. LLM-based autoregressive video generation converts video frames into discrete tokens, allowing text and video tokens to form a unified sequence. Text-to-video generation is modeled as autoregressive prediction video tokens based on text tokens using decoder-only Transformers.
Large language models can generalize to longer videos, but extending beyond trained durations risks error accumulation and quality degradation. There are extensive methods to correct it:
- Video token recoding
- Sampling strategy
- Super resolution and refinement
The model uses the LLaMA architecture, with sizes ranging from parameters 700M to 7B. Models are trained from scratch without pre-trained text weights. The vocabulary contains 32,000 tokens for text, 8,192 tokens for video, and 10 special tokens (a total of 40,202). The video tokenizer replicates MAGViT2, using a causal 3D CNN structure for the first video frame. The spatial dimensions are compressed 8x and the temporal dimensions 4x. Clustered vector quantization (CVQ) is used for quantization, which improves the usability of the codebook over standard VQ. The video tokenizer has 246 million parameters.
The Loong model produces long videos with a consistent look, great motion dynamics, and natural scene transitions. Loong is modeled with text tokens and video tokens in a unified sequence and overcomes the challenges of long video training with the short-to-long progressive training scheme and loss reweighting. The model can be implemented to help visual artists, film producers, and entertainment purposes. But at the same time, it can be misused to create false content and provide misleading information.
look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet.. Don't forget to join our SubReddit over 50,000ml
Are you interested in promoting your company, product, service or event to over 1 million ai developers and researchers? Let's collaborate!
Nazmi Syed is a Consulting Intern at MarktechPost and is pursuing a Bachelor of Science degree at the Indian Institute of technology (IIT) Kharagpur. He has a deep passion for data science and is actively exploring the broad applications of artificial intelligence in various industries. Fascinated by technological advancements, Nazmi is committed to understanding and implementing cutting-edge innovations in real-world contexts.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>