There has been notable progress in the text-to-image domain, sparking a surge of enthusiasm within the research community to expand into 3D generation. This excitement is largely due to the emergence of approaches that make use of pre-trained 2D text-to-image diffusion models.
An important development in this area is the creative work done by Dreamfusion. They brought in a new method called the Score Distillation Sampling (SDS) algorithm, which has made a big difference because it can create numerous different 3D objects just from text instructions. Despite its revolutionary approach, it comes with its set of challenges. A significant limitation is its control over the geometry and texture of the generated models, often leading to issues like oversaturation and the multi-face appearance of models.
Additionally, researchers have noticed that trying to make the models better by just making the text instructions stronger does not improve efficacy.
To combat these challenges, researchers have introduced an enhanced methodology for this 3D generation. This method centers on creating multiple images from different angles of the desired 3D model and using these images to reconstruct the 3D object. This process starts by using an existing text-to-3D generation model, like DreamFusion, to create a basic representation of the object. By making these initial models, we get a basic understanding of the object’s shape and how it’s arranged in space. Then, this method improves the images of the views using an image-to-image (I2I) generation process.
IT3D offers assistance for different 3D output representations, such as meshes and NeRFs, and its additional strength lies in its efficient ability to change the appearance of 3D models using text inputs. The above image presents the IT3D pipeline. Beginning from a coarse 3D model, IT3D first generates a tiny posed dataset leveraging image-to-image pipeline
conditioning on rendering of the coarse 3D model. Then incorporates a randomly initialised discriminator to distil knowledge form the generated dataset and update the 3D model with discrimination loss and SDS loss.
Moreover, analysis shows that this method can speed up the training process, leading to fewer necessary training steps and comparable total training time. This method can tolerate high variance datasets as we note from the above image. Finally, the empirical findings prove that the proposed method significantly improves the baseline models in terms of texture detail, geometry, and fidelity between text prompts and the resulting 3D objects.
This technique has indeed provided us with a fresh perspective on text-to-3D generation and has become the first research work done as an amalgamation of GAN and diffusion prior to improving the text-to3D task.
Check out the Paper and GitHub link. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Janhavi Lande, is an Engineering Physics graduate from IIT Guwahati, class of 2023. She is an upcoming data scientist and has been working in the world of ml/ai research for the past two years. She is most fascinated by this ever changing world and its constant demand of humans to keep up with it. In her pastime she enjoys traveling, reading and writing poems.