This work was accepted in the workshop. I can't believe it's not better! (ICBINB) at NeurIPS 2023.
Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using autoregressive methods, similar to language modeling. However, these methods have yet to take advantage of pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap and find that pre-trained language models offer limited help in autoregressive text-to-image generation. We provide a two-fold explanation by analyzing the tokens of each modality. First, we show that image tokens possess significantly different semantics compared to text tokens, making pretrained language models no more effective at modeling them than randomly initialized ones. Second, the text tokens in the image-text data sets are too simple compared to the normal language model pre-training data, causing any randomly initialized small language model to reach the same perplexity than larger pre-trained ones, and causes catastrophic degradation of language models. 'ability.