LOONG - A new LLM-based autoregressive video generator that can generate one-minute long videos

LLM video generation is an emerging field with a promising growth trajectory. While large language autoregressive models (LLMs) have excelled in generating long, coherent sequences of tokens in natural language processing, their application in video generation has been limited to short videos of a few seconds. To address this, researchers have introduced Loong, an LLM-based autoregressive video generator capable of generating minutes-long videos.

Training a video generation model like Loong involves a unique process. The model is trained from scratch, with text tokens and video tokens treated as a unified sequence. The researchers proposed a short-to-long progressive training approach and loss weighing scheme to mitigate the loss imbalance problem for long video training. This allows Loong to be trained on a 10-second video and then extended to generate long one-minute videos conditioned on text prompts.

However, generating large videos is much more complicated and has many challenges ahead. First of all, there is the problem of unbalanced losses during training. When training with the goal of predicting the next token, predicting early frame tokens from text cues is more difficult than predicting late frame tokens based on previous frames, resulting in unequal loss during training. As the video length increases, the cumulative loss of easy tokens eclipses the loss of difficult tokens, dominating the direction of the gradient. In second place, The model predicts the next token based on ground truth tokens, but relies on its own predictions during inference. This discrepancy causes error accumulation, especially due to strong dependencies between frames and many video tokens, leading to visual quality degradation in long video inferences.

To mitigate the challenge of difficulties with unbalanced video token, Researchers have proposed a progressive training strategy from short to long duration with loss reweighting, as demonstrated below:

Progressive training from short to long

Unbalanced training losses when training directly on long videos. The training loss for the late frames is smaller than that for the early frames, and the loss for the first frame is still relatively high, leading to suboptimal visual quality in the early frames.

The training is divided into three stages, which increases its duration:

Stage 1: Model pre-trained with text-to-image generation on a large static image dataset, helping the model establish a solid foundation for modeling per-frame appearance.

Stage 2: Model trained with images and short video clips, where the model learns to capture short-term temporal dependencies.

Stage 3: The number of video frames increased and joint training continues

Loong is designed with a two-component system, a video tokenizer that compresses videos into tokens and a decoder and transformer that predicts the next video tokens based on text tokens.

Loong uses the 3D CNN architecture to tokenizerinspired by MAGViT2. The model works with low-resolution videos and leaves the super-resolution for post-processing. Tokenizer can compress 10-second video (65 frames, 128*128 resolution) into a sequence of 17*16*16 discrete tokens. LLM-based autoregressive video generation converts video frames into discrete tokens, allowing text and video tokens to form a unified sequence. Text-to-video generation is modeled as autoregressive prediction video tokens based on text tokens using decoder-only Transformers.

Large language models can generalize to longer videos, but extending beyond trained durations risks error accumulation and quality degradation. There are extensive methods to correct it:

Video token recoding
Sampling strategy
Super resolution and refinement

The model uses the LLaMA architecture, with sizes ranging from parameters 700M to 7B. Models are trained from scratch without pre-trained text weights. The vocabulary contains 32,000 tokens for text, 8,192 tokens for video, and 10 special tokens (a total of 40,202). The video tokenizer replicates MAGViT2, using a causal 3D CNN structure for the first video frame. The spatial dimensions are compressed 8x and the temporal dimensions 4x. Clustered vector quantization (CVQ) is used for quantization, which improves the usability of the codebook over standard VQ. The video tokenizer has 246 million parameters.

The Loong model produces long videos with a consistent look, great motion dynamics, and natural scene transitions. Loong is modeled with text tokens and video tokens in a unified sequence and overcomes the challenges of long video training with the short-to-long progressive training scheme and loss reweighting. The model can be implemented to help visual artists, film producers, and entertainment purposes. But at the same time, it can be misused to create false content and provide misleading information.

look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet.. Don't forget to join our SubReddit over 50,000ml

Are you interested in promoting your company, product, service or event to over 1 million ai developers and researchers? Let's collaborate!

Nazmi Syed is a Consulting Intern at MarktechPost and is pursuing a Bachelor of Science degree at the Indian Institute of technology (IIT) Kharagpur. He has a deep passion for data science and is actively exploring the broad applications of artificial intelligence in various industries. Fascinated by technological advancements, Nazmi is committed to understanding and implementing cutting-edge innovations in real-world contexts.

AiM: An Autoregressive (AR) Generative Image Model Based on the Mamba Architecture

Large language models (LLMs) based on autoregressive transformer decoder architectures have advanced natural language processing with exceptional performance and scalability. Recently, diffusion models have gained attention for visual generation tasks, eclipsing autoregressive models (AMs). However, AMs show better scalability for large-scale applications and perform more efficiently with language models, making…

08/30/2024

In "A.I."

Multimodal Autoregressive Pretraining of Wide View Encoders

*Equal taxpayers A dominant paradigm in large multimodal models is to pair a large language decoder with a vision encoder. While it is known how to pre-train and tune language decoders for multimodal tasks, it is less clear how the vision encoder should be pre-trained. A de facto standard is…

11/23/2024

In "A.I."

Generalizable autoregressive modeling of time series through functional narratives

Time series data are inherently functions of time, however current transformers often learn time series by modeling them as mere concatenations of time periods, overlooking their functional properties. In this work, we propose a novel objective for transformers that learn time series by reinterpreting them as temporal functions. We construct…

10/16/2024

In "A.I."