In recent years, imaging has made significant progress due to advances in both transformers and diffusion models. Similar to trends in generative language models, many modern image generation models now use standard image tokenizers and detokenizers. Despite showing great success in image generation, image tokenizers encounter fundamental limitations due to the way they are designed. These tokenizers are based on the assumption that the latent space must preserve a 2D structure to maintain a direct mapping of locations between latent tokens and image patches.
This article analyzes three existing methods in the field of image processing and understanding. First, image tokenization has been a fundamental approach since the early days of deep learning, using autoencoders to compress high-dimensional images into low-dimensional latent representations and then decode them again. The second approach is tokenization for image understanding, which is used for image understanding tasks such as image classification, object detection, segmentation, and multimodal large language models (MLLM). Finally there is image generation, where methods have evolved from sampling variational autoencoders (VAEs) to using generative adversarial networks (GANs), diffusion models and autoregressive models.
Researchers from the Technical University of Munich and ByteDance have proposed an innovative approach that tokenizes images into 1D latent sequences, called Transformer-Based One-Dimensional Tokenizer (TiTok). TiTok consists of a Vision Transformer (ViT) encoder, a ViT decoder, and a vector quantizer, similar to typical Vector-Quantized (VQ) model designs. During the tokenization phase, the image is divided into patches, which are then flattened and combined into a 1D sequence of latent tokens. After the ViT encoder processes the image features, the resulting latent tokens form the latent representation of the image.
Along with the task of image generation using a tokenizer, TiTok also shows its efficiency in image generation by using a typical pipeline. For the generation framework, MaskGIT is used due to its simplicity and effectiveness, allowing a MaskGIT model to be trained by simply replacing its VQGAN tokenizer with the TiTok model. The process begins by pre-tokenizing the image into discrete 1D tokens, and a random proportion of the latent tokens are replaced with mask tokens at each training step. After that, a bidirectional transformer takes this sequence of masked tokens as input and predicts the corresponding discrete token IDs for the masked tokens.
TiTok provides a more compact form of latent representation, making it much more efficient than traditional methods. For example, a 256 × 256 × 3 image can be reduced to only 32 discrete tokens, compared to 256 or 1024 tokens used by previous techniques. Using the same generating framework, TiTok achieves a gFID score of 1.97, outperforming the MaskGIT baseline by 4.21 on the ImageNet 256 × 256 benchmark. TiTok's advantages are even more significant at higher resolutions. On the ImageNet 512 × 512 benchmark, TiTok not only outperforms the leading DiT-XL/2 broadcast model, but also reduces the number of image tokens by 64 times, resulting in a 410 times faster generation process. .
In this paper, researchers have introduced an innovative method that tokenizes images into 1D latent sequences called TiTok. It can be used to reconstruct and generate natural images. A compact formulation is provided to tokenize an image into a 1D latent sequence. The proposed method can represent an image with 8 to 64 times fewer tokens than commonly used 2D tokenizers. Furthermore, compact 1D tokens improve the training and inference speed of the generation model, in addition to obtaining a competitive FID on ImageNet benchmarks. Future direction will focus on more efficient image generation and representation models with 1D image tokenization.
Review the Paper and Project. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.
Join our Telegram channel and LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our 44k+ ML SubReddit
Sajjad Ansari is a final year student of IIT Kharagpur. As a technology enthusiast, he delves into the practical applications of ai with a focus on understanding the impact of ai technologies and their real-world implications. His goal is to articulate complex ai concepts in a clear and accessible way.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>