Humans see the world through two eyes. One of the main benefits of this binocular Vision is the ability to perceive. depth — how close or far objects are. The human brain infers the depth of objects by comparing images captured by the left and right eyes at the same time and interpreting the disparities. This process is known as stereopsis.
Just as depth perception plays a crucial role in human vision and navigation, the ability to estimate depth is critical for a wide range of computer vision applications, from autonomous driving to robotics and even augmented reality. . However, a number of practical considerations, from space limitations to budget constraints, often limit these applications to a single camera.
Monocular depth estimation (MDE) is the task of predicting the depth of a scene from a single image. Computing depth from a single image is inherently ambiguous, as there are multiple ways to project the same 3D scene onto the 2D plane of an image. As a result, MDE is a challenging task that requires (either explicitly or implicitly) taking into account many cues such as object size, occlusion, and perspective.
In this post, we will illustrate how to load and visualize depth map data, run monocular depth estimation models, and evaluate depth predictions. We will do this using data from SOL RGB-D data set.
In particular, we will cover the following:
We will use the Hug Face transformers and diffusers libraries for inference, Fifty-one for data management and visualization, and scikit-image for evaluation metrics. All of these libraries are open source and free to use. Disclaimer: I work at Voxel51, the main maintainer of one of these libraries (FiftyOne).
Before you begin, make sure you have all the necessary libraries installed:
pip install -U torch fiftyone diffusers transformers scikit-image
Then we will import the modules that we will use throughout the publication:
from glob import glob
import numpy as np
from PIL import Image
import torchimport fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob
from fiftyone import ViewField as F
He SUN RGB-D data set contains 10,335 RGB-D images, each of which has an RGB image, a depth image, and corresponding camera intrinsics. Contains images of the New York University Depth v2berkeley B3DOand SOL3D data sets. SOL RGB-D is one of the most popular Datasets for monocular depth estimation and semantic segmentation tasks!
For this tutorial, we will only use the NYU v2 depth parts. NYU Depth v2 is permissive license for commercial use (MIT), and can be downloaded from Hugging Face directly.
Downloading the raw data
First, download the SUN RGB-D data set from here and unzip it, or use the following command to download it directly:
curl -o sunrgbd.zip https://rgbd.cs.princeton.edu/data/SUNRGBD.zip
And then unzip it:
unzip sunrgbd.zip
If you want to use the dataset for other tasks, you can completely convert the annotations and upload them to your fiftyone.Dataset
. However, for this tutorial, we will only use the depth images, so we will only use the RGB images and the depth images (stored in the depth_bfx
subdirectories).
Creating the data set
Because we're only interested in getting the message across, we'll limit ourselves to the first 20 samples, which are all from the NYU Depth v2 part of the data set:
## create, name, and persist the dataset
dataset = fo.Dataset(name="SUNRGBD-20", persistent=True)## pick out first 20 scenes
scene_dirs = glob("SUNRGBD/kv1/NYUdata/*")(:20)
samples = ()
for scene_dir in scene_dirs:
## Get image file path from scene directory
image_path = glob(f"{scene_dir}/image/*")(0)
## Get depth map file path from scene directory
depth_path = glob(f"{scene_dir}/depth_bfx/*")(0)
depth_map = np.array(Image.open(depth_path))
depth_map = (depth_map * 255 / np.max(depth_map)).astype("uint8")
## Create sample
sample = fo.Sample(
filepath=image_path,
gt_depth=fo.Heatmap(map=depth_map),
)
samples.append(sample)
## Add samples to dataset
dataset.add_samples(samples);
Here we are storing the depth maps as heat maps. Everything is represented in terms of normalized, relative distances, where 255 represents the maximum distance in the scene and 0 represents the minimum distance in the scene. This is a common way to represent depth maps, although it is far from the only way to do it. If we were interested in absolute distances, we could store sample parameters for the minimum and maximum distances in the scene, and use them to reconstruct the absolute distances from the relative distances.
Viewing ground truth data
With heatmaps stored in our samples, we can visualize the ground truth data:
session = fo.launch_app(dataset, auto=False)
## then open tab to localhost:5151 in browser
When working with depth maps, the color scheme and opacity of the heat map are important. I'm colorblind, so I find that the viridis The colormap with opacity at maximum works best for me.
Fundamental truth?
By inspecting these RGB images and depth maps, we can see that there are some inaccuracies in the actual terrain depth maps. For example, in this image, the dark crack running through the center of the image is actually the farthest part of the scene, but the ground truth depth map shows it as the closest part of the scene:
This is one of the key challenges for MDE tasks: ground truth data is hard to come by and is often noisy! It is essential to be aware of this when evaluating your MDE models.
Now that we have our dataset loaded, we can run monocular depth estimation models on our RGB images.
For a long time, state-of-the-art models for monocular depth estimation, such as DORN and DenseDepth They were built with convolutional neural networks. However, recently both transformer-based models and DPT and GLPNand diffusion-based models such as Wonderful We have achieved remarkable results!
In this section, we will show you how to generate MDE depth map predictions with DPT and Marigold. In both cases, you can optionally run the model locally with the respective Hugging Face library or run it remotely with Replicate.
To run via Replicate, install the Python client:
pip install replicate
And export your replicated API token:
export REPLICATE_API_TOKEN=r8_<your_token_here>
With Replicate, it may take a minute for the model to load into server memory (cold start issue), but once it does, the prediction should only take a few seconds. Depending on your local computing resources, running on the server can give you huge speedups compared to running locally, especially for Marigold and other diffusion-based depth estimation approaches.
Monocular depth estimation with DPT
The first model we will run is a dense prediction transformer (DPT). DPT models have found utility in both MDE and semantic segmentation, tasks that require “dense” pixel-level predictions.
The next checkpoint uses Midaswhich returns the inverse depth mapso we have to invert it again to get a comparable depth map.
To run locally with transformers
First we load the model and the image processor:
from transformers import AutoImageProcessor, AutoModelForDepthEstimation## swap for "Intel/dpt-large" if you'd like
pretrained = "Intel/dpt-hybrid-midas"
image_processor = AutoImageProcessor.from_pretrained(pretrained)
dpt_model = AutoModelForDepthEstimation.from_pretrained(pretrained)
Below we encapsulate the code to perform inference in a sample, including pre- and post-processing:
def apply_dpt_model(sample, model, label_field):
image = Image.open(sample.filepath)
inputs = image_processor(images=image, return_tensors="pt")with torch.no_grad():
outputs = model(**inputs)
predicted_depth = outputs.predicted_depth
prediction = torch.nn.functional.interpolate(
predicted_depth.unsqueeze(1),
size=image.size(::-1),
mode="bicubic",
align_corners=False,
)
output = prediction.squeeze().cpu().numpy()
## flip b/c MiDaS returns inverse depth
formatted = (255 - output * 255 / np.max(output)).astype("uint8")
sample(label_field) = fo.Heatmap(map=formatted)
sample.save()
Here, we are storing predictions in a label_field
field in our samples, represented with a heat map just like the ground truth labels.
Please note that in the apply_dpt_model()
function, between passing the model forward and generating the heatmap, notice that we make a call to torch.nn.functional.interpolate()
. This is because the model forward pass is run on a reduced version of the image and we want to return a heat map that is the same size as the original image.
Why do we need to do this? If we only want to *look* at the heatmaps, this wouldn't matter. But if we want to compare the actual terrain depth maps with the model predictions on a per-pixel basis, we need to make sure they are the same size.
All that's left to do is loop through the data set:
for sample in dataset.iter_samples(autosave=True, progress=True):
apply_dpt_model(sample, dpt_model, "dpt")session = fo.launch_app(dataset)
To run with Replicate, you can use this model. This is what the API looks like:
import replicate## example application to first sample
rgb_fp = dataset.first().filepath
output = replicate.run(
"cjwbw/midas:a6ba5798f04f80d3b314de0f0a62277f21ab3503c60c84d4817de83c5edfdae0",
input={
"model_type": "dpt_beit_large_512",
"image":open(rgb_fp, "rb")
}
)
print(output)
Monocular depth estimation with marigold
Following their tremendous success in text-to-image contexts, diffusion models are being applied to an increasingly wide range of problems. Wonderful “Repurposes” diffusion-based imaging models for monocular depth estimation.
To run Marigold locally, you will need to clone the git repository:
git clone https://github.com/prs-eth/Marigold.git
This repository features a new diffuser pipeline, MarigoldPipeline
which makes the application of Marigold easier:
## load model
from Marigold.marigold import MarigoldPipeline
pipe = MarigoldPipeline.from_pretrained("Bingxin/Marigold")## apply to first sample, as example
rgb_image = Image.open(dataset.first().filepath)
output = pipe(rgb_image)
depth_image = output('depth_colored')
Then post-processing of the output depth image is necessary.
Instead, to run through Replicate, we can create a apply_marigold_model()
work in analogy with the DPT case above and iterate over the samples in our data set:
import replicate
import requests
import iodef marigold_model(rgb_image):
output = replicate.run(
"adirik/marigold:1a363593bc4882684fc58042d19db5e13a810e44e02f8d4c32afd1eb30464818",
input={
"image":rgb_image
}
)
## get the black and white depth map
response = requests.get(output(1)).content
return response
def apply_marigold_model(sample, model, label_field):
rgb_image = open(sample.filepath, "rb")
response = model(rgb_image)
depth_image = np.array(Image.open(io.BytesIO(response)))(:, :, 0) ## all channels are the same
formatted = (255 - depth_image).astype("uint8")
sample(label_field) = fo.Heatmap(map=formatted)
sample.save()
for sample in dataset.iter_samples(autosave=True, progress=True):
apply_marigold_model(sample, marigold_model, "marigold")
session = fo.launch_app(dataset)
Now that we have predictions from multiple models, let's evaluate them! we will take advantage scikit-image
apply three simple metrics commonly used for monocular depth estimation: mean square error (RMSE), peak signal-to-noise ratio (PSNR), and structural similarity index (YEAH).
Higher PSNR and SSIM scores indicate better predictions, while lower RMSE scores indicate better predictions.
Note that the specific values I arrive at are a consequence of the specific pre- and post-processing steps I performed along the way. What matters is relative performance!
We will define the evaluation routine:
from skimage.metrics import peak_signal_noise_ratio, mean_squared_error, structural_similaritydef rmse(gt, pred):
"""Compute root mean squared error between ground truth and prediction"""
return np.sqrt(mean_squared_error(gt, pred))
def evaluate_depth(dataset, prediction_field, gt_field):
"""Run 3 evaluation metrics for all samples for `prediction_field`
with respect to `gt_field`"""
for sample in dataset.iter_samples(autosave=True, progress=True):
gt_map = sample(gt_field).map
pred = sample(prediction_field)
pred_map = pred.map
pred("rmse") = rmse(gt_map, pred_map)
pred("psnr") = peak_signal_noise_ratio(gt_map, pred_map)
pred("ssim") = structural_similarity(gt_map, pred_map)
sample(prediction_field) = pred
## add dynamic fields to dataset so we can view them in the App
dataset.add_dynamic_sample_fields()
And then apply the evaluation to the predictions of both models:
evaluate_depth(dataset, "dpt", "gt_depth")
evaluate_depth(dataset, "marigold", "gt_depth")
Calculating the average performance for a given model/metric is as simple as calling the data set mean()
method on that field:
print("Mean Error Metrics")
for model in ("dpt", "marigold"):
print("-"*50)
for metric in ("rmse", "psnr", "ssim"):
mean_metric_value = dataset.mean(f"{model}.{metric}")
print(f"Mean {metric} for {model}: {mean_metric_value}")
Mean Error Metrics
--------------------------------------------------
Mean rmse for dpt: 49.8915828817003
Mean psnr for dpt: 14.805904629602551
Mean ssim for dpt: 0.8398022368184576
--------------------------------------------------
Mean rmse for marigold: 104.0061165272178
Mean psnr for marigold: 7.93015537185192
Mean ssim for marigold: 0.42766803372861134
All metrics seem to agree that DPT outperforms Marigold. However, it is important to note that these metrics are not perfect. For example, RMSE is very sensitive to outliers and SSIM is not very sensitive to small errors. For a more thorough evaluation, we can filter by these metrics in the application to visualize what the model is doing right and wrong, or where the metrics fail to capture the model's performance.
Finally, turning masks on and off is a great way to visualize the differences between the ground truth and the model predictions:
In summary, we learned how to run monocular depth estimation models on our data, how to evaluate the predictions using common metrics, and how to visualize the results. We also learned that monocular depth estimation is a notoriously difficult task.
The quality and quantity of data are very limiting factors; Models often have difficulty generalizing to new environments; and metrics are not always good indicators of model performance. The specific numerical values that quantify model performance can largely depend on your processing pipeline. And even your qualitative evaluation of the intended depth maps can be strongly influenced by their color schemes and opacity scales.
If there's one thing you learn from this post, I hope it's this: it's mission critical that you look at the depth maps themselves, and not just the metrics!