This work was carried out in collaboration with the Federal Swiss Institute of Lausanian technology (EPFL).
The tokenization of images has allowed the main advances in the generation of self -representative images in providing discrete and compressed representations that are more efficient to process than unprocessed pixels. While traditional approaches use the 2D network token, recent methods such as Titok have shown that 1D token can achieve high generation quality by eliminating network layoffs. However, these methods generally use a fixed number of tokens and, therefore, cannot adapt to the inherent complexity of an image. We present to Flexxtok, a tokenizer that projects 2D images in token 1D sequences ordered variable length. For example, an image of 256×256 can be sampled again from 1 to 256 discrete tokens, compressing their hierarchical and semantically. By training a rectified flow model as a decoder and using nested abandonment, Fyxtok produces plausible reconstructions, regardless of the chosen token sequence length. We evaluate our approach to a self -spring generation environment using a simple GPT style transformer. In Imagenet, this approach achieves a FID <2 in 8 to 128 tokens, surpassing Titok and combining state -of -the -art methods with much less tokens. We extend the model to admit to the generation of images conditioned by text and examine how traditional 2D token. A key finding is that Flexxtok allows the prediction of the next Token describe images in a thick and thin "visual vocabulary, and that the number of tokens to generate depends on the complexity of the generation task.
*Equal contribution.
† Affiliated jointly with Apple and the Federal Swiss Institute of Lausanian technology (EPFL).
‡ Federal SWISS of Lausanian technology (EPFL).