The recent release of Black Forest Labs' Flux model was trending due to its mind-blowing imaging capabilities. However, it was not portable and as such could not run on an end-user or free-tier machine. This encouraged its use on platforms that provided API services where it is not necessary to load the model locally but rather use external API calls. Organizations that prefer to host their models locally will face a high GPU cost. Thanks to the Huggingface team, who have added support for quantization with BitsAndBytes to the Diffusers library. This means we can now run Flux inference on a machine with 8 GB of GPU RAM.
Learning objective
- Understand the process of configuring dependencies to work with FLUX in a Colab environment.
- Demonstrate how to encode a text message using a 4-bit quantized text encoder to reduce memory usage.
- Implement memory-efficient techniques to load and run mixed-precision imaging models on GPUs.
- Generate images from text prompts using the FLUX pipeline in Colab.
This article was published as part of the Data Science Blogathon.
What is flow?
ai/” target=”_blank” rel=”noreferrer noopener nofollow”>Flow is a series of advanced text-to-image and image-to-image models created by Black Forest Labs, the same team behind Stable Diffusion. It can be seen as the next step in the development of text-to-image models, incorporating cutting-edge technologies. Flux is a successor to Stable Diffusion, which has made multiple improvements in both performance and output quality.
As we mentioned in the introduction, Flux can be quite expensive to run on commodity hardware. However, low GPU users can make optimizations to run in a more memory friendly manner. In this article we will see how Flux can benefit from quantization. Yes, as in quantized gguf files that use bits and bytes. Let's look at the Creativity versus Lab Cost graph.
Flux comes in two main variants, Timestep-distilled and Guidance-distilled, but the architecture is based on several advanced components:
- Two pre-trained text encoders: Flux uses CLIP and T5 text encoders to better understand and translate text prompts in images. CLIP and T5 enable superior understanding of text prompts.
- Transformer based DiT model: This acts as the backbone of noise removal, delivering high-quality generation using transformers for more efficient and accurate noise removal.
- Variational Autoencoder (VAE): Instead of removing pixel-level noise, Flux operates in a latent space, similar to Stable Diffusion, reducing computational load while maintaining high output quality.
Flux comes in multiple variants:
- Fast flow: A distilled open source version available at Hugging Face.
- Flux-Developer: An open model with a more restrictive license.
- Flux-Pro: A closed source version accessible through various APIs.
These features allow Flux to surpass many of its predecessors with a more refined and flexible imaging experience.
Why is quantization important?
If you're familiar with running large language models (LLMs) locally, you may have encountered quantization before. Although less commonly used for images, quantization is a powerful technique that reduces the size of a model by storing its parameters in fewer bits, resulting in a smaller memory footprint without sacrificing performance. Typically, neural network parameters are stored in 32 bits (full precision), but quantization can reduce them to as few as 4 bits. This reduction in precision allows large models like Flux to run on commodity hardware.
Quantization with BitsAndBytes
A key innovation that makes it possible to run Flux on an 8GB GPU is quantization, powered by BitsAndBytes Library. This library enables access to large language models using k-bit quantization for PyTorch, and offers three main features that dramatically reduce memory consumption for inference and training.
The Diffusers library, which powers imaging models like Flux, recently added support for this quantization technique. As a result, you can now generate complex images directly on your laptop or platforms like Google Colab's free tier using just 8GB of GPU RAM.
How does BitsAndBytes work?
BitsAndBytes is the ideal choice for quantizing models with 8-bit and 4-bit precision. The 8-bit quantization process multiplies the outliers in fp16 with the non-outliers in int8, converts the non-outliers back to fp16, and then sums them to return the weights in fp16. This approach minimizes the degrading effect of outliers on a model's performance. 4-bit quantization further compresses the model and is commonly used with QLoRA to fit quantized LLMs.
In this guide, we'll show how you can load and run Flux using 4-bit quantization, dramatically reducing memory requirements.
Flux configuration on consumer hardware
STEP 1: Set up the environment
To get started, make sure your machine is running in a GPU-enabled environment (such as an NVIDIA T4 or L4 GPU). Let's dive into the technical steps to run Flux on a machine with only 8GB of GPU memory (your free Google Colab!).
!pip install -Uq git+https://github.com/huggingface/diffusers@main
!pip install -Uq git+https://github.com/huggingface/transformers@main
!pip install -Uq bitsandbytes
These packages provide all the tools needed to run Flux memory efficiently, such as loading pre-trained text encoders, handling efficient model loading and CPU offloading, and quantization to run large models on smaller hardware. Next, we import dependencies.
import diffusers
import transformers
import bitsandbytes as bnb
from diffusers import FluxPipeline, FluxTransformer2DModel
from transformers import T5EncoderModel
import torch
import gc
STEP 2: Memory management with GPU
We need all the memory we have. To ensure smooth operation and avoid wasted memory, we defined a function that clears GPU memory between model loads. The following function will clear the GPU cache and reset memory statistics, ensuring optimal resource usage throughout the laptop.
def flush():
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_peak_memory_stats()
def bytes_to_giga_bytes(bytes):
return bytes / 1024 / 1024 / 1024
flush()
STEP 3: Loading the T5 text encoder in 4-bit mode
Flux uses two pre-trained text encoders: CLIP and T5. We will only load the T5 encoder to minimize memory usage, using 4-bit quantization. This reduces the memory required by almost 90%.
# Checkpoints
ckpt_id = "black-forest-labs/FLUX.1-dev"
ckpt_4bit_id = "hf-internal-testing/flux.1-dev-nf4-pkg"
prompt = "a cute dog in paris photoshoot"
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
ckpt_4bit_id,
subfolder="text_encoder_2",
)
With the T5 encoder loaded, we can now move on to the next step: generating text embeddings. This step dramatically reduces memory consumption, allowing us to load the encoder on a resource-constrained machine.
STEP 4: Generate text embeds
Now that we have the 4-bit quantized T5 text encoder loaded, we can encode the text message. This will convert the input message into embeddings, which will then be used to guide the image generation process.
Now, we load the Flux pipeline with only the T5 encoder and enable CPU offloading. This technique helps balance memory usage by moving large parameters that do not fit in GPU memory to the CPU.
pipeline = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
text_encoder_2=text_encoder_2_4bit,
transformer=None,
vae=None,
torch_dtype=torch.float16,
)
with torch.no_grad():
prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
prompt=prompt, prompt_2=None, max_sequence_length=256
)
del pipeline
flush()
After encoding, the message embeddings are stored in Prompt_embeds, which will condition the model to generate an image. This step converts the message into a form that the model can understand and use to generate images.
STEP 5: Loading the Transformer and VAE in 4 Bits
With the text embeddings ready, we now load the remaining parts of the model: the Transformer and the VAE. Both will also load in 4 bits, keeping overall memory usage to a minimum.
transformer_4bit = FluxTransformer2DModel.from_pretrained(ckpt_4bit_id, subfolder="transformer")
pipeline = FluxPipeline.from_pretrained(
ckpt_id,
text_encoder=None,
text_encoder_2=None,
tokenizer=None,
tokenizer_2=None,
transformer=transformer_4bit,
torch_dtype=torch.float16,
)
pipeline.enable_model_cpu_offload()
This step completes the model loading and you will be ready to image on an 8GB machine.
STEP 6: Generating the image
print("Running denoising.")
height, width = 512, 768
images = pipeline(
prompt_embeds=prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
num_inference_steps=50,
guidance_scale=5.5,
height=height,
width=width,
output_type="pil",
).images
# Display the image
images(0)
The future of on-device imaging
This advancement in quantification and efficient model management brings us closer to the future where powerful ai models can run directly on consumer hardware. You no longer need access to high-end GPUs, expensive cloud resources, or paid serverless API calls. With improvements in the underlying technology and leveraging quantization techniques like BitsAndBytes, the possibilities for democratized ai are endless. Whether you are a hobbyist, developer, or researcher, these advancements make it easier than ever to create, experiment, and innovate in imaging.
Conclusion
With the introduction of Flux and the clever use of quantization, you can now generate stunning images using hardware as modest as an 8GB GPU. This is an important step in making advanced ai accessible to a broader audience, and the technology will only get better from here. So grab your laptop, set up Flux, and start creating! While full-precision models require more memory and resources, techniques such as 4-bit quantization provide a practical solution for deploying large models on constrained systems. This approach can be applied not only to Flux but also to other large models, opening up the possibility of high-quality ai generation on smaller, more affordable hardware configurations.
If you are looking for an online generative ai course, explore: GenAI Pinnacle Program
Key takeaways
- FLUX is a powerful text-to-image generation model that can be run efficiently in Colab by using memory optimization techniques such as 4-bit quantization and mixed precision.
- You can take advantage of tools like diffusers and transformers to streamline the process of generating images from text prompts.
- Efficient memory management allows large models to run with limited resources, such as Colab GPUs.
Resources
- Flow
- flow imaging
- bits and bytes
- ai/announcing-black-forest-labs/” target=”_blank” rel=”noreferrer noopener nofollow”>Black Forest Laboratories
The media shown in this article is not the property of Analytics Vidhya and is used at the author's discretion.
Frequently asked questions
Answer. 4-bit quantization reduces the model's memory footprint, allowing large models like FLUX to run more efficiently with limited resources, such as Colab GPUs.
Answer. Simply replace the prompt variable in the script with whatever new text description you want the model to display. For example, changing it to “A serene landscape with mountains” will generate an image of that scene.
Answer. You can adjust num_inference_steps (controls quality) and guided_scale (controls how tightly the image adheres to the message) in the pipeline call. Higher values will result in better quality and more detailed images, but may also take longer to generate.
Answer. Make sure you are running the laptop on a GPU and using the mixed precision and 4-bit quantization settings. If errors persist, consider reducing num_inference_steps or running the model in “CPU offload” mode to reduce memory usage.
Answer. Yes, you can run this script on any machine that has Python and the necessary libraries installed. Make sure your local machine has enough GPU and memory resources if you are working with large models like FLUX.