Understanding a holistic 3D image is a major challenge for autonomous vehicles (AVs) to perceive. It directly influences subsequent activities such as planning and mapping. The lack of sensor resolution and partial observation caused by the small field of view and occlusions make it difficult to obtain accurate and complete 3D information about the real environment. Semantic scene completion (SSC), a method of jointly inferring the geometry and semantics of the whole scene from sparse observations, was offered to solve the problems. Scene reconstruction for visible areas and scene hallucination for obstructed sections are two subtasks that an SSC solution must handle simultaneously. Humans easily reason about scene geometry and semantics based on imperfect observations, which supports this effort.
However, modern SSC techniques still lag behind human perception in driving scenarios in terms of performance. LiDAR is considered as a primary modality by most current SSC systems to provide accurate 3D geometric measurements. However, cameras are more affordable and offer better visual indications of the driving environment, but LiDAR sensors are more expensive and less portable. This inspired the investigation of camera-based SSC solutions, which were initially featured in the groundbreaking work of MonoScene. MonoScene uses feature-dense projection to convert 2D image inputs to 3D. However, such a projection provides 2D features of empty or occluded voxels from the visible areas. An empty voxel covered by a car, for example, will still receive the visual characteristic of the car.
As a result, the 3D features created perform poorly in terms of geometric integrity and semantic segmentation – your share. VoxFormer, unlike MonoScene, sees cross attention from 3D to 2D as a representation of sparse queries. The suggested design is inspired by two realizations: (1) sparseness in 3-D space: Since a significant portion of 3-D space is often empty, a sparse rather than a dense representation is undoubtedly more effective, and scalable. (2) reconstruction before hallucination: the 3D information of the non-visible region can best be completed using the reconstructed visible areas as starting points.
In summary, they made the following contributions to this effort:
• A state-of-the-art two-stage system that transforms photos into a full 3D voxelized semantic scene.
• An innovative convolution-based 2D query proposal network that produces reliable queries from image depth.
• A unique transformer that produces a full 3D scene representation and is similar to the Masked Auto Encoder (MAE).
• As seen in Fig. 1(b), VoxFormer advances next-generation camera-based SSC.
VoxFormer consists of two stages: Stage 1 suggests a sparse set of occupied voxels, and Stage 2 populates scene renderings based on Stage 1 recommendations. Stage 1 is class independent, while Stage 2 It is class specific. As illustrated in Fig. 1(a), stage 2 is based on a single sparse-to-dense MAE design. In particular, stage 1 contains a CNN-based lightweight 2D query proposal network that reconstructs scene geometry using image depth. Then, across the entire field of view, it suggests a sparse collection of voxels using preset voxel queries that can be learned.
They first strengthen their characterization by allowing the suggested voxels to pay attention to image observations. The remaining voxels will then be processed by self-service to finalize scene representations for per-voxel semantic segmentation after the non-proposed voxels are attached to a learnable mask token. VoxFormer provides state-of-the-art semantic segmentation and geometric completion performance, based on extensive experiments on the large-scale SemanticKITTI dataset. More critically, as demonstrated in Fig. 1, the benefits are great in safety-critical short-range locations.
review the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 15k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.