Have you ever come across illusions where a child in the image appears taller and larger than an adult? ames room The illusion is a famous one that involves a room that is shaped like a trapezoid, with one corner of the room closer to the viewer than the other corner. When you look at it from a certain point, the objects in the room appear normal, but as you move to a different position, everything changes size and shape, and it can be tricky to understand what’s close to you and what’s not.
However, this is a problem for us humans. Normally, when looking at a scene, we estimate the depth of objects quite accurately if there are no illusion tricks. Computers, on the other hand, are not as successful at estimating depth, as it is still a fundamental problem in computer vision.
Depth estimation is the process of determining the distance between the camera and objects in the scene. Depth estimation algorithms take an image or a sequence of images as input and generate a corresponding depth map or 3D representation of the scene. This is an important task as we need to understand the depth of the scene in numerous applications like robotics, autonomous vehicles, virtual reality, augmented reality, etc. For example, if you want to have a safe self-driving car, understanding the distance to the car in front of you is crucial to adjusting your driving speed.
There are two branches of depth estimation algorithms, metric depth estimation (MDE), where the goal is to estimate the absolute distance, and relative depth estimation (RDE), where the goal is to estimate the relative distance between objects. In the scene.
MDE models are useful for mapping, planning, navigation, object recognition, 3D reconstruction, and image editing. However, the performance of MDE models can deteriorate when a single model is trained on multiple data sets, especially if the images have large differences in depth scale (eg, images of interiors and exteriors). As a result, current MDE models often overfit specific data sets and do not generalize well to other data sets.
RDE models, on the other hand, use disparity as a means of monitoring. Depth predictions in RDE are only consistent with each other across image frames and the scale factor is unknown. This allows RDE methods to be trained on a diverse set of scenes and data sets, including 3D movies, which can help improve model generalization across domains. However, the tradeoff is that the depth provided in RDE does not have a metric meaning, which limits its applications.
What would happen if we combined these two approaches? We can have a depth estimation model that can generalize well to different domains while maintaining a precise metric scale. This is exactly what ZoeDepth has achieved.
ZoeDepth is a two-stage framework that combines the MDE and RDE approaches. The first stage consists of an encoder-decoder structure that is trained to estimate relative depths. This model is trained on a wide variety of data sets, which improves generalization. The second stage adds components in charge of estimating the metric depth that are added as an additional head.
The metric head design used in this approach is based on a method called the metric bin module, which estimates a set of depth values for each pixel instead of a single depth value. This allows the model to capture a range of possible depth values for each pixel, which can help improve its accuracy and robustness. This allows for accurate depth measurement that considers the physical distance between objects in the scene. These bosses are trained on metric depth data sets and are lightweight compared to the first stage.
When it comes to inference, a classifier model selects the appropriate head for each image using encoder functions. This allows the model to specialize in estimating depth for specific domains or scene types while taking advantage of relative depth pretraining. In the end, we get a flexible model that can be used in multiple configurations.
review the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 15k+ ML SubReddit, discord channeland electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. She wrote her M.Sc. thesis on denoising images using deep convolutional networks. She is currently pursuing a PhD. She graduated from the University of Klagenfurt, Austria, and working as a researcher in the ATHENA project. Her research interests include deep learning, computer vision, and multimedia networks.