The development of large language models such as ChatGPT and DALL-E has been a topic of interest in the artificial intelligence community. Using advanced deep learning techniques, these models do everything from generating text to producing images. DALL-E, developed by OpenAI, is a text-to-image generation model that produces high-quality images based on the entered textual description. Trained on massive data sets of text and images, these text-to-image generation models develop a visual representation of the given text or prompt. Not only this, but there are currently several text-to-image models that not only produce a new image from a textual description, but also generate a new image from an existing image. This is done using the concept of Stable Diffusion. The newly introduced neural network framework, ControlNet, significantly improves control over text-to-image broadcast models.
Developed by Stanford University researchers Lvmin Zhang and Maneesh Agrawala, ControlNet enables imaging with precise and detailed control over the image production process with the help of diffusion models. A diffusion model is simply a generative model that helps generate an image from text by iteratively modifying and updating the variables that represent the image. With each iteration, more detail is added to the image and noise is removed, gradually moving towards the target image. These diffusion models are implemented with the help of Stable Diffusion, in which an enhanced diffusion process is used to train the diffusion models. It helps to produce varied images with much more stability and comfort.
ControlNet works in combination with pre-trained diffusion models to enable the generation of images that cover all aspects of the textual descriptions fed as input. This neural network structure allows for the production of high-quality images by taking additional input conditions into account. ControlNet works by making a copy of each stable broadcast block in two variants: a trainable variant and a locked variant. During the production of the target image, the trainable variant tries to memorize new conditions to synthesize the images and painstakingly add details with the help of short data sets. On the other hand, the locked variant helps to retain the capabilities and potentialities of the diffusion model just before the generation of the target image.
The best part of ControlNet development is its ability to know which parts of the input image are important for generating the target image and which are not. Unlike traditional methods that lack the ability to look at the input image closely, ControlNet conveniently solves the problem of spatial consistency by allowing stable diffusion models to use complementary input conditions to decipher the model. The researchers behind the development of ControlNet have shared that ControlNet even allows training on a Graphics Processing Unit (GPU) with eight gigabyte graphics memory.
ControlNet is definitely a breakthrough as it has been trained in a way that it learns conditions ranging from edge maps and keypoints to segmentation maps. It is a great addition to already popular imaging techniques and by augmenting large data sets and with the help of Stable Diffusion it can be used in various applications for better control over imaging.
review the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 14k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.