Autoregressive pretraining has proven to be revolutionary in machine learning, especially when it comes to processing sequential data. Predictive modeling of next sequence elements has been very effective in natural language processing and has increasingly been explored in the domains of computer vision. Video modeling is an area that has barely been explored and offers opportunities to extend to action recognition, object tracking and robotic applications. These developments are due to increasing data sets and innovation in transformer architectures that treat visual inputs as structured tokens suitable for autoregressive training.
Modeling videos presents unique challenges due to their temporal dynamics and redundancy. Unlike text with a clear sequence, video frames often contain redundant information, making it difficult to tokenize and learn appropriate representations. Proper video modeling should be able to overcome this redundancy while also capturing spatiotemporal relationships in frames. Most frameworks have focused on image-based representations, leaving optimization of video architectures open. The task requires new methods to balance efficiency and performance, particularly when video forecasting and robotic manipulation are at play.
Learning visual representations via convolutional networks and masked autoencoders has proven effective for imaging tasks. These approaches often fail for video applications as they cannot fully express temporal dependencies. Tokenization methods such as dVAE and VQGAN typically convert visual information into tokens. These have proven to be effective, but scaling this approach becomes challenging in scenarios with mixed data sets including images and videos. Patch-based tokenization does not generalize to address multiple tasks efficiently in a video.
A research team from Meta FAIR and UC Berkeley has introduced the Toto family of autoregressive video models. Its novelty is to help address the limitations of traditional methods, treating videos as sequences of discrete visual tokens and applying causal transformative architectures to predict subsequent tokens. The researchers developed models that could easily combine image and video training by training on a unified dataset that includes more than a trillion image and video tokens. The unified approach allowed the team to leverage the strengths of autoregressive pretraining in both domains.
Toto models use dVAE tokenization with a vocabulary of 8k tokens to process images and video frames. Each frame is resized and tokenized separately, resulting in sequences of 256 tokens. These tokens are then processed by a causal transformer that uses the features of the RMSNorm and RoPE embeddings to establish improved model performance. Training was performed on ImageNet and HowTo100M datasets, tokenizing at a resolution of 128 × 128 pixels. The researchers also optimized the models for subsequent tasks by replacing average pooling with attention pooling to ensure better representation quality.
The models show good performance on the benchmarks. For ImageNet classification, the largest Toto model achieved a top accuracy of 75.3%, outperforming other generative models such as MAE and iGPT. On the Kinetics-400 action recognition task, the models achieve a maximum accuracy of 74.4%, demonstrating their ability to understand complex temporal dynamics. On the DAVIS dataset for semi-supervised video tracking, models achieve J&F scores of up to 62.4, thereby improving on previous state-of-the-art benchmarks set by DINO and MAE. Additionally, in robotic tasks such as object manipulation, Toto models learn much faster and are more efficient with samples. For example, the Toto-base model achieves a real-world task of cube selection on the Franka robot with an accuracy of 63%. Overall, these are impressive results regarding the versatility and scalability of these proposed models with various applications.
The work provided significant development in video modeling by addressing redundancy and challenges in tokenization. The researchers successfully demonstrated “through unified training on both images and videos, that this form of autoregressive pretraining is generally effective on a variety of tasks.” The innovative architecture and tokenization strategies provide a foundation for denser prediction and recognition research. This is a significant step towards unlocking the full potential of video modeling in real-world applications.
Verify he Paper and Project page. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 65,000 ml.
UPCOMING FREE ai WEBINAR (JANUARY 15, 2025): <a target="_blank" href="https://info.gretel.ai/boost-llm-accuracy-with-sd-and-evaluation-intelligence?utm_source=marktechpost&utm_medium=newsletter&utm_campaign=202501_gretel_galileo_webinar”>Increase LLM Accuracy with Synthetic Data and Assessment Intelligence–<a target="_blank" href="https://info.gretel.ai/boost-llm-accuracy-with-sd-and-evaluation-intelligence?utm_source=marktechpost&utm_medium=newsletter&utm_campaign=202501_gretel_galileo_webinar”>Join this webinar to learn practical information to improve LLM model performance and accuracy while protecting data privacy..
Nikhil is an internal consultant at Marktechpost. He is pursuing an integrated double degree in Materials at the Indian Institute of technology Kharagpur. Nikhil is an ai/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in materials science, he is exploring new advances and creating opportunities to contribute.