The concept of diffusion
Diffusion denoising models are trained to extract patterns from noise and generate a desirable image. The training process involves showing model examples of images (or other data) with different noise levels determined according to a noise scheduling algorithm, with the intention of predicting which parts of the data are noise. If successful, the noise prediction model will be able to gradually build a realistic-looking image from pure noise, subtracting increments of noise from the image at each time step.
<img decoding="async" alt="noise diffusion and elimination process" src="https://images.contentstack.io/v3/assets/blt71da4c740e00faaa/blt0794e83bc6735188/66060b30dd5b9e25b6a87c57/EXX-Blog-explaining-text-to-image-generative-ai-2.jpg?format=webp”/><img decoding="async" alt="noise diffusion and elimination process" src="https://images.contentstack.io/v3/assets/blt71da4c740e00faaa/blt0794e83bc6735188/66060b30dd5b9e25b6a87c57/EXX-Blog-explaining-text-to-image-generative-ai-2.jpg?format=webp”/>
Unlike the image at the top of this section, modern diffusion models do not predict the noise of an image with added noise, at least not directly. Instead, they predict noise in a latent spatial representation of the image. Latent space represents images in a compressed set of numerical features, the output of an encoding module of a variational autoencoder, or FEET. This trick put the “latent” in latent diffusionand greatly reduced the time and computational requirements for generating images. As reported by the authors of the paper, latent diffusion speeds up inference at least ~2.7 times faster than direct diffusion and trains about three times faster.
People who work with latent diffusion often talk about using a “diffusion model,” but in reality the diffusion process uses several modules. As in the diagram above, a broadcast pipeline for text-to-image workflows typically includes a text embedding model (and its tokenizer), a denoising prediction/broadcast model, and an image decoder. Another important part of latent diffusion is the scheduler, which determines how the noise is scaled and updated over a series of “time steps” (a series of iterative updates that gradually remove noise from the latent space).
<img decoding="async" alt="latent diffusion model architecture diagram" src="https://images.contentstack.io/v3/assets/blt71da4c740e00faaa/bltac74210e6f7acef9/66060b300713758c2fc413f1/EXX-Blog-explaining-text-to-image-generative-ai-3.jpg?format=webp”/><img decoding="async" alt="latent diffusion model architecture diagram" src="https://images.contentstack.io/v3/assets/blt71da4c740e00faaa/bltac74210e6f7acef9/66060b300713758c2fc413f1/EXX-Blog-explaining-text-to-image-generative-ai-3.jpg?format=webp”/>
Latent broadcast code example
We will use CompVis/latent-diffusion-v1-4 for most of our examples. Text embedding is handled by a CLIPTextModel and CLIPTokenizer. The noise prediction uses a 'U-Net,' a type of image-to-image model that originally gained traction as a model for applications in biomedical imaging (especially segmentation). To generate images from noise-free latent matrices, the pipeline uses a variational autoencoder (FEET) to decode images, converting those arrays into images.
We'll start by building our version of this channel from the HuggingFace components.
# local setup
virtualenv diff_env –python=python3.8
source diff_env/bin/activate
pip install diffusers transformers huggingface-hub
pip install torch --index-url https://download.pytorch.org/whl/cu118
Be sure to check pytorch.org to ensure the correct version for your system if you are working locally. Our imports are relatively simple and the following code snippet is sufficient for all the following demos.
import os
import numpy as np
import torch
from diffusers import StableDiffusionPipeline, AutoPipelineForImage2Image
from diffusers.pipelines.pipeline_utils import numpy_to_pil
from transformers import CLIPTokenizer, CLIPTextModel
from diffusers import AutoencoderKL, UNet2DConditionModel, \
PNDMScheduler, LMSDiscreteScheduler
from PIL import Image
import matplotlib.pyplot as plt
Now the details. Start by defining the image and broadcast parameters and a message.
prompt = (" ")
# image settings
height, width = 512, 512
# diffusion settings
number_inference_steps = 64
guidance_scale = 9.0
batch_size = 1
Initialize your pseudorandom number generator with a seed of your choice to reproduce its results.
def seed_all(seed):
torch.manual_seed(seed)
np.random.seed(seed)
seed_all(193)
Now we can initialize the text embedding model, the autoencoder, a U-Net, and the time step scheduler.
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", \
subfolder="vae")
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4",\
subfolder="unet")
scheduler = PNDMScheduler()
scheduler.set_timesteps(number_inference_steps)
my_device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
vae = vae.to(my_device)
text_encoder = text_encoder.to(my_device)
unet = unet.to(my_device)
Encoding the text message as an embed requires first tokenizing the string input. Tokenization replaces characters with integer codes corresponding to a vocabulary of semantic units, for example, by byte pair encoding (BPE). Our channel incorporates a null message (no text) next to the textual message of our image. This balances the diffusion process between the description provided and the overall natural-looking images. We'll look at how to change the relative weighting of these components later in this article.
prompt = prompt * batch_size
tokens = tokenizer(prompt, padding="max_length",\
max_length=tokenizer.model_max_length, truncation=True,\
return_tensors="pt")
empty_tokens = tokenizer(("") * batch_size, padding="max_length",\
max_length=tokenizer.model_max_length, truncation=True,\
return_tensors="pt")
with torch.no_grad():
text_embeddings = text_encoder(tokens.input_ids.to(my_device))(0)
max_length = tokens.input_ids.shape(-1)
notext_embeddings = text_encoder(empty_tokens.input_ids.to(my_device))(0)
text_embeddings = torch.cat((notext_embeddings, text_embeddings))
We initialize the latent space as random normal noise and scale it according to our diffusion time step scheduler.
latents = torch.randn(batch_size, unet.config.in_channels, \
height//8, width//8)
latents = (latents * scheduler.init_noise_sigma).to(my_device)
Everything is ready to go and we can immerse ourselves in the diffusion circuit itself. We can track the images by periodically sampling so we can see how the noise gradually decreases.
images = ()
display_every = number_inference_steps // 8
# diffusion loop
for step_idx, timestep in enumerate(scheduler.timesteps):
with torch.no_grad():
# concatenate latents, to run null/text prompt in parallel.
model_in = torch.cat((latents) * 2)
model_in = scheduler.scale_model_input(model_in,\
timestep).to(my_device)
predicted_noise = unet(model_in, timestep, \
encoder_hidden_states=text_embeddings).sample
# pnu - empty prompt unconditioned noise prediction
# pnc - text prompt conditioned noise prediction
pnu, pnc = predicted_noise.chunk(2)
# weight noise predictions according to guidance scale
predicted_noise = pnu + guidance_scale * (pnc - pnu)
# update the latents
latents = scheduler.step(predicted_noise, \
timestep, latents).prev_sample
# Periodically log images and print progress during diffusion
if step_idx % display_every == 0\
or step_idx + 1 == len(scheduler.timesteps):
image = vae.decode(latents / 0.18215).sample(0)
image = ((image / 2.) + 0.5).cpu().permute(1,2,0).numpy()
image = np.clip(image, 0, 1.0)
images.extend(numpy_to_pil(image))
print(f"step {step_idx}/{number_inference_steps}: {timestep:.4f}")
At the end of the diffusion process, we have a decent representation of what you wanted to generate. Next, we'll go over additional techniques for greater control. Since we've already created our broadcast channel, we can use HuggingFace's optimized broadcast channel for the rest of our examples.
Control the broadcast channel
We will use a set of helper functions in this section:
def seed_all(seed):
torch.manual_seed(seed)
np.random.seed(seed)
def grid_show(images, rows=3):
number_images = len(images)
height, width = images(0).size
columns = int(np.ceil(number_images / rows))
grid = np.zeros((height*rows,width*columns,3))
for ii, image in enumerate(images):
grid(ii//columns*height:ii//columns*height+height, \
ii%columns*width:ii%columns*width+width) = image
fig, ax = plt.subplots(1,1, figsize=(3*columns, 3*rows))
ax.imshow(grid / grid.max())
return grid, fig, ax
def callback_stash_latents(ii, tt, latents):
# adapted from fastai/diffusion-nbs/stable_diffusion.ipynb
latents = 1.0 / 0.18215 * latents
image = pipe.vae.decode(latents).sample(0)
image = (image / 2. + 0.5).cpu().permute(1,2,0).numpy()
image = np.clip(image, 0, 1.0)
images.extend(pipe.numpy_to_pil(image))
my_seed = 193
We will begin with the most well-known and simplest application of diffusion models: the generation of images from textual cues, known as text-to-image generation. The model we will use was launched on the market (from the Hugging Face Hub) by the academic laboratory that published the latent diffusion article. Hugging Face coordinates workflows such as latent broadcast through the convenient pipeline API. We want to define which device and which floating point to calculate based on whether or not we have a GPU.
if (1):
#Run CompVis/stable-diffusion-v1-4 on GPU
pipe_name = "CompVis/stable-diffusion-v1-4"
my_dtype = torch.float16
my_device = torch.device("cuda")
my_variant = "fp16"
pipe = StableDiffusionPipeline.from_pretrained(pipe_name,\
safety_checker=None, variant=my_variant,\
torch_dtype=my_dtype).to(my_device)
else:
#Run CompVis/stable-diffusion-v1-4 on CPU
pipe_name = "CompVis/stable-diffusion-v1-4"
my_dtype = torch.float32
my_device = torch.device("cpu")
pipe = StableDiffusionPipeline.from_pretrained(pipe_name, \
torch_dtype=my_dtype).to(my_device)
Orientation Scale
If you use a very unusual text message (very different from those in the data set), you may end up in a less traveled part of the latent space. Embedding null messages provides a balance and combining the two according to scale_guide allows you to offset the specificity of your message against the common features of the image.
guidance_images = ()
for guidance in (0.25, 0.5, 1.0, 2.0, 4.0, 6.0, 8.0, 10.0, 20.0):
seed_all(my_seed)
my_output = pipe(my_prompt, num_inference_steps=50, \
num_images_per_prompt=1, guidance_scale=guidance)
guidance_images.append(my_output.images(0))
for ii, img in enumerate(my_output.images):
img.save(f"prompt_{my_seed}_g{int(guidance*2)}_{ii}.jpg")
temp = grid_show(guidance_images, rows=3)
plt.savefig("prompt_guidance.jpg")
plt.show()
Since we generated the message using the 9 targeting coefficients, you can plot the message and see how the spread developed. The default leading coefficient is 0.75, so the 7th image would be the default image output.
Negative indications
Sometimes latent diffusion really “wants” to produce an image that does not match its intentions. In these scenarios, you can use a negative message to steer the process away from spreading unwanted results. For example, we could use a negative message to make our Martian astronaut outreach a little less humane.
my_prompt = " "
my_negative_prompt = " "
output_x = pipe(my_prompt, num_inference_steps=50, num_images_per_prompt=9, \
negative_prompt=my_negative_prompt)
temp = grid_show(output_x)
plt.show()
You should receive results that follow your message while avoiding showing the things described in your negative message.
Image variation
Generating text to image from scratch is not the only application of broadcast channels. In fact, diffusion is very suitable for modifying images, starting from an initial image. We will use a slightly different pipeline and a pre-trained and fine-tuned model for image-to-image diffusion.
pipe_img2img = AutoPipelineForImage2Image.from_pretrained(\
"runwayml/stable-diffusion-v1-5", safety_checker=None,\
torch_dtype=my_dtype, use_safetensors=True).to(my_device)
One application of this approach is to generate variations on a theme. A conceptual artist could use this technique to quickly iterate different ideas for illustrating an exoplanet based on the latest research.
First we will download a public domain concept artist of planet 1e into the TRAPPIST system (credit: NASA/JPL-Caltech).
Then, after scaling down to remove details, we will use a broadcast channel to create several different versions of the TRAPPIST-1e exoplanet.
url = \
"https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/TRAPPIST-1e_artist_impression_2018.png/600px-TRAPPIST-1e_artist_impression_2018.png"
img_path = url.split("https://www.kdnuggets.com/")(-1)
if not (os.path.exists("600px-TRAPPIST-1e_artist_impression_2018.png")):
os.system(f"wget \ '{url}'")
init_image = Image.open(img_path)
seed_all(my_seed)
trappist_prompt = "Artist's impression of TRAPPIST-1e"\
"large Earth-like water-world exoplanet with oceans,"\
"NASA, artist concept, realistic, detailed, intricate"
my_negative_prompt = "cartoon, sketch, orbiting moon"
my_output_trappist1e = pipe_img2img(prompt=trappist_prompt, num_images_per_prompt=9, \
image=init_image, negative_prompt=my_negative_prompt, guidance_scale=6.0)
grid_show(my_output_trappist1e.images)
plt.show()
<img decoding="async" alt="diffusion image variation test" src="https://images.contentstack.io/v3/assets/blt71da4c740e00faaa/blt3a1f4de2890c7936/66060b30f407ddbc2a314310/EXX-Blog-explaining-text-to-image-generative-ai-4.jpg?format=webp”/><img decoding="async" alt="diffusion image variation test" src="https://images.contentstack.io/v3/assets/blt71da4c740e00faaa/blt3a1f4de2890c7936/66060b30f407ddbc2a314310/EXX-Blog-explaining-text-to-image-generative-ai-4.jpg?format=webp”/>
By feeding the model with an example initial image, we can generate similar images. You can also use a text-guided image-to-image pipeline to change the style of an image by increasing the guide, adding negative cues, and more, such as “unrealistic,” “watercolor,” or “paper sketch.” Your mileage may vary and adjusting your directions will be the easiest way to find the right image you want to create.
Conclusions
Despite the discourse behind broadcast systems and the imitation of human-generated art, broadcast models serve other, more impactful purposes. Has state applied to protein folding prediction for protein design and drug development. Text to video conversion is also a active area of investigation and is offered by several companies (e.g. ai/stable-video” target=”_blank” rel=”noopener”>ai Stability, Google). Diffusion is also a emergent approach for text to speech Applications.
It is clear that the diffusion process is playing a central role in the evolution of ai and the technology's interaction with the global human environment. While the complexities of copyright, other intellectual property laws and the impact on human art and science are evident in both positive and negative ways. But the real positive is ai's unprecedented ability to understand language and generate images. It was AlexNet that made computers analyze an image and generate text, and only now can computers analyze textual prompts and generate coherent images.
ai” rel=”noopener” target=”_blank”>Original. Republished with permission.
Kevin Vu manages Exxact Corp. Blog and works with many of its talented authors who write about different aspects of deep learning.