In recent years, diffusion models have shown great success in text-to-image generation, achieving high image quality, improved inference performance, and expanding our creative inspiration. However, it is still a challenge to efficiently control the generation, especially with conditions that are difficult to describe with text.
Today, we announce MediaPipe Broadcast plugins, which enable controllable text-to-image generation to run on the device. Expanding on our previous work on GPU inference for large generative models on the device, we present new low-cost solutions for controllable text-to-image generation that can be plugged into existing diffusion models and their Low-Range Adaptation (LoRA) variants.
Text-to-image generation with control plugins running on the device. |
Background
With diffusion models, image generation is modeled as an iterative denoising process. Starting with a noise image, at each step, the diffusion model gradually removes noise from the image to reveal an image of the target concept. Research shows that leveraging language understanding through text prompts can greatly improve image generation. For text-to-image generation, the text embed is connected to the model via cross-attention layers. However, some information is difficult to describe using text cues, for example, an object’s position and pose. To address this problem, the researchers add additional models to the diffusion to inject control information from a condition image.
Common approaches to controlled text-to-image generation include plug and play, control networkand T2I adapter. Plug-and-Play applies a widely used implicit denoising diffusion model (NO) inversion approach that reverses the generation process from an input image to derive an initial noise input and then uses a copy of the diffusion model (parameters 860M for Stable Diffusion 1.5) to encode the condition of an input image . Plug-and-Play extracts spatial features with self attention of the copied broadcast and injects them into the text-to-image broadcast. ControlNet creates a trainable copy of the encoder of a broadcast model, which is wired through a convolution layer with zero-initialized parameters to encode the conditioning information that is passed to the decoder layers. However, as a result, the size is large, half the diffusion model (430M parameters for Stable Diffusion 1.5). The T2I adapter is a smaller network (77 million parameters) and achieves similar effects in controllable generation. The T2I adapter only takes the condition image as input and its output is shared across all broadcast iterations. However, the adapter model is not designed for portable devices.
MediaPipe Broadcast Plugins
To make conditional generation efficient, customizable, and scalable, we designed the MediaPipe broadcast plugin as a separate network that is:
- pluggable: Can be easily connected to a previously trained base model.
- trained from scratch: Does not use pretrained weights from the base model.
- Laptop– Runs off of the base model on mobile devices, at negligible cost compared to base model inference.
Method | Parameter size | pluggable | right from the start | Laptop | ||||
plug and play | 860M* | ✔️ | ❌ | ❌ | ||||
control network | 430M* | ✔️ | ❌ | ❌ | ||||
T2I adapter | 77m | ✔️ | ✔️ | ❌ | ||||
MediaPipe Plugin | 6M | ✔️ | ✔️ | ✔️ |
Comparison of Plug-and-Play, ControlNet, T2I Adapter, and the MediaPipe broadcast plugin. * The number varies according to the particularities of the diffusion model. |
The MediaPipe broadcast plugin is a portable on-device model for text-to-image generation. Extracts multiscale features from a conditioning image, which are added to the encoder of a diffusion model at corresponding levels. When connected to a text-to-image broadcast model, the plug-in model can provide an additional conditioning signal for image generation. We designed the plugin network to be a lightweight model with only 6 million parameters. It uses in-depth convolutions and inverted bottlenecks from MobileNetv2 for fast inference on mobile devices.
Unlike ControlNet, we inject the same control functions into all broadcast iterations. That is, we only run the plugin once for the generation of an image, which saves the calculation. Below we illustrate some intermediate results of a diffusion process. The control is effective at every diffusion step and allows controlled generation even in the first few steps. Further iterations improve the alignment of the image with the text message and generate more detail.
Illustration of the build process using the MediaPipe broadcast plugin. |
examples
In this work, we develop plugins for a broadcast-based text-to-image generation model with MediaPipe. face reference pointmedia tube holistic milestone, depth mapsand sly edge. For each task, we select around 100K images from a web-scale text and image datasetand calculate the control signals using the corresponding MediaPipe solutions. We use refined subtitles from PALI to train plugins.
face reference point
media tube face marker The task computes 478 reference points (attentively) of a human face. we use the drawing tools in MediaPipe to render a face, including the outline of the face, mouth, eyes, eyebrows, and iris, with different colors. The following table shows randomly generated samples using front mesh conditioning and indications. As a comparison, both ControlNet and Plugin can control text to image generation with certain conditions.
Facial landmark plugin for text-to-image generation, compared to ControlNet. |
holistic milestone
MediaPipe holistic marker The task includes reference points of body posture, hands, and facial mesh. Next, we generate several stylized images conditioning the holistic features.
Holistic milestone plugin for text to image generation. |
Depth
Depth plugin for text to image generation. |
sly edge
Canny-edge plugin for text to image generation. |
Assessment
We conducted a quantitative study of the face reference point plugin to demonstrate the performance of the model. The evaluation data set contains 5K human images. We compare the generation quality measured by widely used metrics, Frechet start distance (FID) and SHORTEN scores. The base model is a pretrained text-to-image diffusion model. we use stable diffusion v1.5 here.
As shown in the table below, both ControlNet and the MediaPipe broadcast plugin produce much better sample quality than the base model, in terms of FID and CLIP scores. Unlike ControlNet, which must be run at each broadcast step, the MediaPipe plugin only runs once for each generated image. We measured the performance of all three models on a server machine (with Nvidia V100 GPU) and a mobile phone (Galaxy S23). On server, we run all three models with 50 diffusion steps, and on mobile, we run 20 diffusion steps using the MediaPipe imaging application. Compared to ControlNet, the MediaPipe plugin shows a clear advantage in inference efficiency while preserving sample quality.
Model | FID↓ | CLIP↑ | Inference time (s) | |||||
Nvidia V100 | Galaxy S23 | |||||||
Base | 10.32 | 0.26 | 5.0 | 11.5 | ||||
Base + ControlNet | 6.51 | 0.31 | 7.4 (+48%) | 18.2 (+58.3%) | ||||
Base Plugin + MediaPipe | 6.50 | 0.30 | 5.0 (+0.2%) | 11.8 (+2.6%) |
Quantitative comparison in FID, CLIP and inference time. |
We tested the plugin’s performance on a wide range of mid to high end mobile devices. We list the results on some representative devices in the table below, which covers both Android and iOS.
Device | Android | iOS | ||||||||||
pixel 4 | pixel 6 | pixel 7 | Galaxy S23 | iPhone 12 Pro | iPhone 13 Pro | |||||||
Time (MS) | 128 | 68 | fifty | 48 | 73 | 63 |
Inference time (ms) of the plugin on different mobile devices. |
Conclusion
In this paper, we introduce MediaPipe, a portable plugin for conditional text-to-image generation. Injects features extracted from a condition image into a diffusion model and controls image generation accordingly. Portable plugins can connect to pre-trained broadcast models running on servers or devices. By running text-to-image generation and plugins entirely on the device, we enable more flexible generative AI applications.
expressions of gratitude
We would like to thank all team members who contributed to this work: Raman Sarokin and Juhyun Lee for the GPU inference solution; Khanh LeViet, Chuo-Ling Chang, Andrei Kulik, and Matthias Grundmann for their leadership. Special thanks to Jiuqiang TangJoe Zou and Luwang, who created this technology and all the demos running on the device.