For tasks such as autonomous driving, the ai model must understand not only the 3D structure of roads and sidewalks, but also identify and recognize traffic signs and traffic lights. This task is made easier with a special laser mounted in the car that captures the 3D data. This process is called egocentric scene understanding, that is, understanding the environment from one’s own perspective. The problem is that there are no publicly available data sets beyond the autonomous driving domain that generalize to egocentric understanding of the human scene.
Google researchers have presented SANPO (Scene Understanding, Accessibility, Navigation, Way Finding, Obstacle Avoidance), which is a multi-attribute video dataset for human egocentric scene understanding. SANPO consists of both real-world and synthetic data, called SANPO-Real and SANPO-Synthetic, respectively. SANPO-Real covers diverse environments and has video from two stereo cameras to support multi-view methods. The real dataset also includes 11.4 hours of video captured at 15 frames per second (FPS) with dense annotations.
SANPO is a large-scale video dataset for human egocentric scene understanding, consisting of more than 600,000 real-world frames and more than 100,000 synthetic frames with dense prediction annotations.
Google researchers have prioritized privacy protection. They have collected data following laws at the local, municipal and state levels. They also made sure to remove any personal information, such as faces and vehicle license plates, before submitting the data for annotation.
To overcome imperfections in capturing videos, such as motion blur, human grading errors, etc., SANPO-Synthetic was introduced to augment the real data set. The researchers partnered with Parallel domain to create a high-quality synthetic data set optimized to fit real-world conditions. SANPO-Synthetic consists of 1961 sessions, which were recorded using virtualized Zed cameras that have an even split between head-mounted and chest-mounted positions.
The synthetic dataset and a part of the real dataset have been annotated using panoptic instance masks, which assign a class and an ID to each pixel. In SANPO-Real, only a few frames have more than 20 instances per frame. In contrast, SANPO-Synthetic presents many more instances per frame than the real data set.
Some of the other important video datasets in this field are SCAN, musohu, ego4d, VIPSegand Waymo Open. SANPO was benchmarked against these datasets and is the first dataset with panoptic masks, depth, camera pose, multi-view stereo, and both real and synthetic data. Aside from SANPO, only Waymo Open has panoptic segmentation and depth maps.
The researchers trained two state-of-the-art models: ContainersEx (for depth estimation) and kMaX-DeepLab (for panoptic segmentation), in the SANPO dataset. They observed that the dataset is quite challenging for dense prediction tasks. Furthermore, the synthetic data set has higher accuracy than the real data set. This is mainly because real-world environments are quite complex compared to synthetic data. Furthermore, segmentation annotators are more accurate in the case of synthetic data.
Introduced to address the lack of datasets for human egocentric scene understanding, SANPO is a significant advancement spanning synthetic and real-world datasets. Its dense annotations, multi-attribute features, and unique combination of panoptic segmentation and depth information distinguish it from other datasets in the field. Additionally, the researchers’ commitment to privacy allows the dataset to help other researchers create visual navigation systems for the visually impaired and push the boundaries of advanced visual scene understanding.
Review the Paper and Google Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 31k+ ML SubReddit, Facebook community of more than 40,000 people, Discord Channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
We are also on WhatsApp. Join our ai channel on Whatsapp.
I am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and have a keen interest in Data Science, especially Neural Networks and its application in various areas.
<!– ai CONTENT END 2 –>