From masked image modeling to autoregressive image modeling | by Mengliu Zhao | June 2024

Previous training in image domain

Moving to the image domain, the immediate question is how we form the “token sequence” of the image. The natural thought is to simply use the ViT architecture, dividing an image into a grid of image patches (visual tokens).

BEIT. Released as an arXiv preprint in 2022, the idea of BEiT is simple. After tokenizing an image into a sequence of 14*14 visual tokens, 40% of the tokens are randomly masked, replaced by learnable embeddings, and fed into the transformer. The pre-training goal is to maximize the log likelihood of correct visual tokens, and no decoder is needed for this stage. The pipeline is shown in the following figure.

BEiT preforming channel. Image source: https://arxiv.org/abs/2106.08254

In the original article, the authors also provided a theoretical link between BEiT and the variational autoencoder. So the natural question is: can an Autoencoder be used for pre-training purposes?

MAE-ViT. This article answered the above question by designing a masked autoencoder architecture. Using the same formulation of ViT and random masking, the authors proposed to “discard” the masked patches during training and use only unmasked patches in the visual token sequence as input to the encoder. The mask tokens will be used for reconstruction during the decoding stage in pre-training. The decoder could be flexible and range from 1 to 12 transformer blocks with dimensionality between 128 and 1024. More detailed architectural information can be found in the original article.

Masked autoencoder architecture. Image source: https://arxiv.org/abs/2111.06377

SimMIM. Slightly different from BEiT and MAE-ViT, the article proposes to use a flexible backbone such as Swin Transformer for coding purposes. The proposed prediction head is extremely lightweight: a single linear layer of a 2-layer MLP to push back masked pixels.