Put Me in the Center Quickly: Subject-Diffusion is an AI Model That Can Achieve Open Domain Personalized Text-to-Image Generation

Text-to-image models have been the cornerstone of every AI discussion for the last year. The advancement in the field happened quite rapidly, and as a result, we have impressive text-to-image models. Generative AI has entered a new phase.

Diffusion models were the key contributors to this advancement. They have emerged as a powerful class of generative models. These models are designed to generate high-quality images by slowly denoising the input into a desired image. Diffusion models can capture hidden data patterns and generate diverse and realistic samples.

The rapid advancement of diffusion-based generative models has revolutionized text-to-image generation methods. You can ask for an image, whatever you can think of, describe it, and the models can generate it for you quite accurately. As they progress further, it is getting difficult to understand which images are generated by AI.

However, there is an issue here. These models solely rely on textual descriptions to generate images. You can only “describe” what you want to see. Moreover, they are not easy to personalize as that would require fine-tuning in most cases.

Imagine doing an interior design of your house, and you work with an architect. The architect could only offer you designs he did for previous clients, and when you try to personalize some part of the design, he simply ignores it and offers you another used style. Does not sound very pleasing, does it? This might be the experience you will get with text-to-image models if you are looking for personalization.

Thankfully, there have been attempts to overcome these limitations. Researchers have explored integrating textual descriptions with reference images to achieve more personalized image generation. While some methods require fine-tuning on specific reference images, others retrain the base models on personalized datasets, leading to potential drawbacks in fidelity and generalization. Additionally, most existing algorithms cater to specific domains, leaving gaps in handling multi-concept generation, test-time fine-tuning, and open-domain zero-shot capability.

So, today we meet with a new approach that brings us closer to open-domain personalization—time to meet with Subject-Diffusion.

Subject-Diffusion is an innovative open-domain personalized text-to-image generation framework. It utilizes only one reference image and eliminates the need for test-time fine-tuning. To build a large-scale dataset for personalized image generation, it builds upon an automatic data labeling tool, resulting in the Subject-Diffusion Dataset (SDD) with an impressive 76 million images and 222 million entities.

Subject-Diffusion has three main components: location control, fine-grained reference image control, and attention control. Location control involves adding mask images of main subjects during the noise injection process. Fine-grained reference image control uses a combined text-image information module to improve the integration of both granularities. To enable the smooth generation of multiple subjects, attention control is introduced during training.

Subject-Diffusion achieves impressive fidelity and generalization, capable of generating single, multiple, and human-subject personalized images with modifications to shape, pose, background, and style based on just one reference image per subject. The model also enables smooth interpolation between customized images and text descriptions through a specially designed denoising process. Quantitative comparisons show that Subject-Diffusion outperforms or matches other state-of-the-art methods, both with and without test-time fine-tuning, on various benchmark datasets.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled “Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning.” His research interests include deep learning, computer vision, video encoding, and multimedia networking.