Deep generative modeling has emerged as a powerful approach for generating high-quality images in recent years. Specifically, technical improvements in the use of techniques such as diffusion and autoregressive models have enabled the generation of striking, photorealistic images conditional on a text input prompt. Although these models offer remarkable performance, they suffer from a major limitation: their slow sampling rate. A large neural network must be evaluated between 50 and 1000 times to generate a single image, since each step of the generative process relies on reusing the same function. This inefficiency is a crucial factor to consider in real world scenarios and may present an obstacle to the widespread application of these models.
A popular technique in this field is deep variational autoencoders (VAEs), which combine deep neural networks with probabilistic models to learn latent data representations. These representations can be used to generate new images that are similar to the original data but have unique variations. The use of deep VAEs for imaging has allowed remarkable progress in the field of imaging.
However, hierarchical VAEs have yet to produce high-quality images on large and diverse data sets, which is particularly unexpected given their hierarchical generation process, which seems well-suited for image generation. By contrast, autoregressive models have shown greater success, although their inductive bias involves generating images in a simple raster scan order. Therefore, the authors of the paper discussed in this article have examined the factors that contribute to the success of autoregressive models and transposed them to VAE.
For example, the key to the success of autoregressive models lies in training on a sequence of compressed image tokens instead of direct pixel values. By doing so, they can focus on learning the relationships between the semantics of the image without taking into account the imperceptible details of the image. Thus, similar to pixel-space autoregressive models, existing pixel-space hierarchical VAEs may focus primarily on learning fine-grained features, limiting their ability to capture the underlying composition of image concepts.
Based on the considerations mentioned above, the work exploits deep VAEs by taking advantage of the latent space of a deterministic automatic encoder (DAE).
This approach comprises two stages: training a DAE to reconstruct images from low-dimensional latents and then training a VAE to build a generative model from these latents.
The model gains two critical benefits by training the VAE on low-dimensional latents instead of pixel space: a more robust and lightweight training process. In fact, the compressed latent code is much smaller than its RGB representation, but it retains almost all of the perceptual information of the image. A smaller code length is advantageous as it emphasizes global features, which comprise only a few bits. In addition, the VAE can fully focus on the image structure because imperceptible details are discarded. Second, the reduced dimensionality of the latent variable reduces computational costs and allows larger models to be trained with the same resources.
In addition, large-scale diffusion and autoregressive models use a classifierless guide to improve image fidelity. The purpose of this technique is to balance sample diversity and quality, since models based on poor probability tend to generate samples that do not align with the data distribution. The guiding mechanism helps steer samples toward regions closest to a desired label by comparing conditional and unconditional probability functions. For this reason, the authors extend the concept of classifierless orientation to deep VAEs.
The comparison of the results between the proposed method and the most advanced approaches is shown below.
This was the brief for a novel lightweight and deep VAE architecture for imaging.
If you are interested or would like more information on this framework, you can find a link to the document and the project page.
review the Paper. Don’t forget to join our 19k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He currently works at the Christian Doppler ATHENA Laboratory and his research interests include adaptive video streaming, immersive media, machine learning and QoS / QoE evaluation.