Fashion photography is ubiquitous on online platforms, including social media and e-commerce websites. However, as static images, they may be limited in their ability to provide comprehensive information about a garment, particularly how it fits and moves on a person’s body.
By contrast, fashion videos offer a fuller, more immersive experience, showing the texture of the fabric, the way it drapes and flows, and other essential details that are difficult to capture through photos.
Fashion videos can be an invaluable resource for consumers looking to make informed purchasing decisions. They offer a deeper look at clothing in action, allowing shoppers to better assess its suitability for their needs and preferences. Despite these benefits, however, fashion videos remain relatively rare, and many brands and retailers still rely primarily on photography to showcase their products. As the demand for more engaging and informative content grows, the production of high-quality fashion videos is likely to increase across the industry.
A novel way to address these problems comes from Artificial Intelligence (AI). The name is DreamPose and it represents a novel approach to transforming fashion photos into realistic animated videos.
This method involves a diffusion video synthesis model based on Stable Diffusion. By providing one or more images of a human being and a corresponding pose sequence, DreamPose can generate high-fidelity, realistic video of the subject in motion. The overview of their workflow is shown below.
The task of generating high-quality, realistic videos from images poses several challenges. While image diffusion models have shown impressive results in terms of quality and fidelity, the same cannot be said for video diffusion models. Such models are often limited to generating simple motion or cartoon images. Additionally, existing video broadcast models suffer from several problems, including poor temporal consistency, motion instability, lack of realism, and limited control over motion in the target video. These limitations are due in part to the fact that existing models are primarily conditioned by text rather than other cues, such as motion, that can provide more precise control.
In contrast, DreamPose takes advantage of a pose and image conditioning scheme to achieve greater look fidelity and frame-to-frame consistency. This approach overcomes many of the shortcomings of existing video broadcast models. Furthermore, it enables the production of high-quality videos that accurately capture the movement and appearance of the input subject.
The model is fine-tuned from a pretrained image diffusion model that is very powerful in modeling natural image distribution. Using such a model, the task of animating images can be simplified by identifying the subspace of natural images consistent with the conditioning signals. To achieve this, the stable broadcast architecture was modified, specifically by redesigning the encoder and conditioning mechanisms to support image-aligned and pose-unaligned conditioning.
Furthermore, it includes a two-stage fine-tuning process that involves fine-tuning the UNet and VAE components using one or more input images. This approach optimizes the model to generate high-quality, realistic video that accurately captures the appearance and movement of the input subject.
Some examples of the produced results reported by the authors of this paper are illustrated in the figure below. In addition, this figure includes a comparison between DreamPose and the most modern techniques.
This was the brief for DreamPose, a novel AI framework for synthesizing photorealistic fashion videos from a single input image. If you are interested, you can learn more about this technique at the links below.
review the Research work, Code, and Project. Don’t forget to join our 20k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He currently works at the Christian Doppler ATHENA Laboratory and his research interests include adaptive video streaming, immersive media, machine learning and QoS / QoE evaluation.