A year ago, generating realistic images with AI was a dream. We were impressed to see faces generated that look just like real ones, even though most outputs have three eyes, two noses, etc. However, things changed quite quickly with the release of diffusion models. Today, it is difficult to distinguish an AI-generated image from a real one.
The ability to generate high-quality images is one part of the equation. If we were to use them correctly, compressing them efficiently plays an essential role in tasks such as content generation, data storage, transmission, and bandwidth optimization. However, image compression has been predominantly based on traditional methods such as transform coding and quantization techniques, with limited exploration of generative models.
Despite their success in generating images, diffusion models and score-based generative models have not yet become mainstream approaches for image compression, lagging behind methods based on WIN. They often perform worse than or on par with GAN-based approaches like HiFiC on high-resolution images. Even attempts to reuse text-to-image models for image compression have yielded unsatisfactory results, producing reconstructions that deviate from the original input or contain unwanted artifacts.
The gap between the performance of score-based generative models on imaging tasks and their limited success in image compression raises intriguing questions and motivates further investigation. It is surprising that models capable of generating high-quality images have not been able to outperform GANs in the specific task of image compression. This discrepancy suggests that there may be unique challenges and considerations when applying score-based generative models to compression tasks, requiring specialized approaches to realize their full potential.
Thus, we know that there is a possibility of using score-based generative models in image compression. The question is, how can it be done? Let’s jump to the answer.
The Google researchers proposed a method that combines a standard autoencoder, optimized for mean square error (MSE), with a diffusion process to retrieve and add fine details discarded by the autoencoder. The bit rate for encoding an image is determined solely by the autoencoder, since the broadcast process does not require additional bits. By tuning the diffusion models specifically for image compression, it is shown that they can outperform several recent generative approaches in terms of image quality.
The method explores two closely related approaches: diffusion models, which exhibit impressive performance but require a large number of sampling steps, and rectified streams, which work best when fewer sampling steps are allowed.
The two-step approach consists of first encoding the input image using the MSE-optimized autoencoder and then applying diffusion processing or rectified streams to improve the realism of the reconstruction. The diffusion model employs a noise program that travels in the opposite direction compared to text-to-image models, prioritizing detail over global structure. On the other hand, the rectified stream model takes advantage of the matching provided by the autoencoder to directly map the outputs of the autoencoder to uncompressed images.
Furthermore, the study revealed specific details that may be useful for future research in this domain. For example, the noise schedule and the amount of noise injected during imaging are shown to significantly impact the results. Interestingly, while text-to-image models benefit from higher noise levels when trained on high-resolution images, reducing the overall noise of the diffusion process is found to be advantageous for compression. This setting allows the model to focus more on fine details, since the coarse details are already adequately captured by the autoencoder reconstruction.
review the Paper. Don’t forget to join our 24k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. She wrote her M.Sc. thesis on denoising images using deep convolutional networks. She received her Ph.D. He graduated in 2023 from the University of Klagenfurt, Austria, with his dissertation titled “Video Coding Improvements for HTTP Adaptive Streaming Using Machine Learning”. His research interests include deep learning, computer vision, video encoding, and multimedia networking.