Generative models have revolutionized fields such as language, vision, and biology thanks to their ability to learn and sample complex data distributions. While these models benefit from scaling during training through increased data, computational resources, and model sizes, their inference time scaling capabilities face significant challenges. Specifically, diffusion models, which excel at generating continuous data such as images, audio, and videos through a denoising process, encounter limitations in improving performance when they simply increase the number of feature evaluations (NFEs). during inference. The traditional approach of adding more denoising steps prevents these models from achieving better results despite additional computational investment.
Various approaches have been explored to improve the performance of generative models during inference. Expanding computation at test time has proven effective for LLMs through improved search algorithms, verification methods, and computation allocation strategies. Researchers have pursued multiple directions in diffusion models, including tuning approaches, reinforcement learning techniques, and implementation of direct preference optimization. Additionally, sample selection and optimization methods have been developed using random search algorithms, VQA models, and human preference models. However, these methods focus on training time improvements or limited test time optimizations, leaving room for more detailed inference time scaling solutions.
Researchers from NYU, MIT, and Google have proposed a fundamental framework for scaling diffusion models over inference time. Their approach goes beyond simply augmenting denoising steps and introduces a novel search-based methodology to improve generation performance through better noise identification. The framework operates on two key dimensions: using verifiers to obtain feedback and implementing algorithms to discover superior noise candidates. This approach addresses the limitations of conventional scaling methods by introducing a structured way to use additional computational resources during inference. The flexibility of the framework allows combinations of components to be tailored to specific application scenarios.
The framework implementation focuses on class-conditional ImageNet generation using a pre-trained SiT-XL model with a resolution of 256 × 256 and a second-order Heun sampler. The architecture maintains 250 fixed denoising steps while exploring additional NFEs dedicated to lookup operations. The main search mechanism employs a random search algorithm, which implements a Best-of-N strategy to select optimal noise candidates. The system uses two Oracle verifiers for verification: initial score (IS) and initial Fréchet distance (FID). IS selection is based on the highest classification probability of a pre-trained InceptionV3 model, while FID selection minimizes divergence from pre-computed ImageNet Inception feature statistics.
The effectiveness of the framework has been demonstrated through extensive testing on different benchmarks. On DrawBench, which presents various text prompts, evaluation of LLM Grader shows that searching with multiple verifiers consistently improves sample quality, albeit with different patterns in the settings. ImageReward and Verifier Ensemble perform well and show improvements in all metrics due to their nuanced evaluation capabilities and alignment with human preferences. The results reveal different optimal settings in T2I-CompBench, focusing on the accuracy of text cues rather than visual quality. ImageReward emerges as the best performer, while aesthetic scores show minimal or negative impact and CLIP provides modest improvements.
In conclusion, the researchers establish a significant advance in diffusion models by introducing a framework for inference time scaling through strategic search mechanisms. The study shows that computational scaling through search methods can achieve substantial performance improvements across different model sizes and generation tasks, with different computational budgets producing different scaling behaviors. The research concludes that, while the approach is successful, it also reveals inherent biases in different verifiers and emphasizes the importance of developing task-specific verification methods. This information opens new avenues for future research in the development of more specific and efficient verification systems for various vision generation tasks.
Verify he Paper and Project page. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 65,000 ml.
<a target="_blank" href="https://nebius.com/blog/posts/studio-embeddings-vision-and-language-models?utm_medium=newsletter&utm_source=marktechpost&utm_campaign=embedding-post-ai-studio” target=”_blank” rel=”noreferrer noopener”> (Recommended Reading) Nebius ai Studio Expands with Vision Models, New Language Models, Embeddings, and LoRA (Promoted)
Sajjad Ansari is a final year student of IIT Kharagpur. As a technology enthusiast, he delves into the practical applications of ai with a focus on understanding the impact of ai technologies and their real-world implications. Its goal is to articulate complex ai concepts in a clear and accessible way.