Because of the intuitiveness of using natural language prompts to specify desired 3D models, recent advances in text-to-image generation have also sparked much interest in triggerless text-to-3D generation. This could increase the productivity of the 3D modeling workflow and lower the barrier to entry for beginners. The text-to-3D generation process remains difficult because, unlike the text-to-image scenario, where paired data is available, it is impractical to get large amounts of paired text and 3D data. To get around this data constraint, some innovative works, such as CLIP-Mesh, Dream Fields, DreamFusion and Magic3D, optimize a 3D rendering using pre-trained deep text-to-image models such as CLIP or image diffusion models. This allows the generation of 3D text without the need for labeled 3D data.
Despite the enormous success of these works, the only 3D settings they can have are generally basic geometry and surreal aesthetics. These constraints may be due to the deep background used to optimize the 3D rendering generated from pre-trained image models, which can only enforce constraints on high-level semantics while ignoring low-level features. On the other hand, SceneScape and Text2Room, two recently arrived concurrent efforts, use the color image produced by the text image diffusion model directly to influence the reconstruction of 3D scenes. Due to the limitations of explicit 3D mesh rendering, including stretched geometry caused by naive triangulation and noisy depth estimation, these methods, while supporting the generation of realistic 3D scenes, focus primarily on interior scenes and are difficult to scale up. outdoor scenes. Rather, their approach uses NeRF, a 3D representation better suited for modeling various scenarios with complex geometry. In this study, researchers from the University of Hong Kong present Text2NeRF, a text-based 3D scene synthesis system that combines the best features of a text-to-image diffusion model trained with Neural Radiance Field (NeRF).
Because of NeRF’s superiority in modeling fine-grained and realistic features in various environments, which could greatly reduce artifacts induced by a triangular mesh, they chose NeRF as the 3D representation. They use finer-grained image backgrounds inferred from the diffusion model instead of older techniques, such as DreamFusion, which drove 3D rendering with semantic backgrounds. This allows Text2NeRF to produce more delicate geometric structures and realistic textures in 3D scenes. In addition, they constrain NeRF optimization from scratch without the need for additional 3D monitoring or multi-view training data by using a pretrained text-to-image diffusion model as the pre-image layer.
NeRF rendering parameters are optimized using previous depth and content. To be more precise, they use a monocular depth estimation approach to provide the geometric preview of the created scene and the diffusion model to construct a text-related image as the content preview. Furthermore, they suggest a progressive paint and update (PIU) technique for single-view synthesis of the 3D scene to ensure consistency across multiple viewpoints. The created scene can be magnified and modified view by view according to the path of a camera using the PIU approach. By rendering the updated NeRF in this way, the enlarged area of the current view can be mirrored in the next view, ensuring that the same region will not be extended again during the scene expansion process and maintaining continuity and consistency of the scene. the created view. scene. In a nutshell, the PIU method and NeRF 3D rendering ensure that the diffusion model produces sight-consistent images while creating a 3D scene. Due to the lack of multiple-view constraints, they find that single-view training in NeRF results in overfitting to this view, leading to geometric uncertainty during view-by-view updating.
They provide a set of produced view support to provide multiple view constraints for the NeRF model to solve this problem. Meanwhile, they use an L2 depth loss in addition to the RGB image loss, inspired by, to achieve depth-aware NeRF optimization and increase the convergence rate and stability of the NeRF model. They also feature a two-stage depth alignment technique to align the depth value of the same point from multiple viewpoints, considering that depth maps in separate views are independently estimated and may be inconsistent in overlapping areas. Its Text2NeRF can produce various high-fidelity and view-consistent 3D scenes from natural language descriptions due to the aforementioned well-designed components.
Due to the universality of the method, Text2NeRF created various 3D scenes, including art, interior and exterior scenes. Text2NeRF is also not limited by view range and can create 360 degree views. Numerous tests show that Text2NeRF performs qualitatively and numerically better than previous techniques. The following is a summary of their contributions: • They provide a text-based framework for creating realistic 3D scenes that combine diffusion modeling with NeRF renderings and enable shot-free creation of a variety of interior and exterior scenes using a variety of natural language. . notices
• Provide the PIU technique, which gradually produces unique view-consistent content for 3D scenes, and build the support set, which provides multi-view constraints for the NeRF model during view-by-view updating.
• Implement a two-stage depth alignment technique to remove estimated depth misalignment in various perspectives, and use depth loss to achieve depth-aware NeRF optimization. The code will soon be posted on GitHub.
review the Paper and project page. Don’t forget to join our 22k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.